The main limitations of traditional topic modeling methods like Latent Dirichlet Allocation (LDA) include the generation of topics that are not specific and interpretable enough, difficulties in understanding the content of documents, and limited control over the specificity and formatting of topics. These models often represent topics as distributions over words, leading to incoherent and difficult-to-interpret topics. Additionally, they offer limited ability to guide the topic generation process and may fail to produce high-quality and easily interpretable topics.
TopicGPT, a novel topic modeling framework, operates in two main stages: topic generation and topic assignment3. During topic generation, an LLM is iteratively prompted to generate topics based on a sample of documents from the input dataset and a list of previously generated topics13. This encourages the creation of distinctive and specific topics3. The generated topics are then refined to remove redundant and infrequent ones. In the topic assignment stage, an LLM assigns topics to new documents, providing a quotation from the document to support the assignment, enhancing verifiability. GPT-4 is used for topic generation and GPT-3.5-turbo for topic assignment.
TopicGPT differs from traditional topic modeling methods like SeededLDA and BERTopic in several ways. First, it uses a prompt-based framework and large language models (LLMs) for topic generation and assignment, producing topics that align better with human categorizations5. Second, TopicGPT provides natural language labels and descriptions for topics, enhancing interpretability. Third, it allows for the generation of high-quality topics and customization without model retraining. Lastly, TopicGPT has shown superior performance in aligning topics and assignments with human-annotated ground truth topics4.