Understanding the Distinctions in Topics per Class Utilizing BERTopic – A User-Friendly and Captivating Approach | Written by Mariya Mansurova in September 2023

Introduction:

Understanding the differences in texts by categories is essential for analyzing large sets of free-form texts. One tool that can help with this is Topic Modelling. By using BERTopic, an easy-to-use and powerful package, you can uncover hidden semantic patterns and assign topics to your texts. In this article, we will explore how to build a topic model and compare topics across different categories, using hotel reviews as an example.

Full Article: Understanding the Distinctions in Topics per Class Utilizing BERTopic – A User-Friendly and Captivating Approach | Written by Mariya Mansurova in September 2023

Understanding the Differences in Texts by Categories: A Guide to Topic Modelling

In today’s world of product analytics, we are faced with large amounts of free-form texts. These texts come in various forms: user comments on app stores, customer support inquiries, and survey responses. Analyzing all of these texts manually would be a daunting task, but luckily, there are tools available to help automate the process. One such tool is Topic Modelling, which we will explore in this article.

Segmenting customers based on their texts can provide valuable insights. For example, if we find that 14.2% of reviews mention too many ads in the app, we may not know if this is a significant issue. However, when we segment these reviews by platform and discover that 34.8% of Android users mention this issue compared to only 3.2% of iOS users, it becomes clear that we need to investigate the ad experience on Android. This is where comparing topics across categories becomes useful.

To illustrate how Topic Modelling works, let’s use a dataset of hotel reviews for several hotel chains in London. Before diving into the analysis, let’s take a look at the data. We have a total of 12,890 reviews spread across 7 different hotel chains.

Once we have the data, we can apply the BERTopic package, a powerful and easy-to-use tool for Topic Modelling using HuggingFace transformers and class-based TF-IDF. Topic Modelling is an unsupervised machine learning technique that identifies hidden patterns in texts and assigns topics to them automatically. In our case, we don’t need to preprocess the data much, thanks to the BERTopic package.

To utilize BERTopic effectively, we translate all comments into English using the deep-translator package. This ensures that the model works optimally with texts in one language. After translation, we analyze the distribution of review lengths to identify any potential noise in the dataset. We discover that around 5% of reviews are extremely short and likely not meaningful.

To filter out these short comments, we remove all reviews shorter than 20 symbols, reducing the dataset by 4.3%. We then check if this filter disproportionately affects specific hotels but find that the distribution of short comments is similar across all categories.

Now it’s time to build our first topic model using BERTopic. With just a few lines of code, we can train the model and obtain 113 topics. However, the largest group, Topic -1, represents outliers and accounts for nearly 50% of all reviews. To better understand the topics, we visualize them using a bar chart, giving us a glimpse into the main terms associated with each topic.

To further explore the differences in reviews across hotel chains, we use the Topics per Class representation provided by BERTopic. However, interpreting the resulting graph can be challenging, as it does not accurately show the share of different topics for each class. Although this representation hasn’t fully solved our initial task, it provides us with valuable insights.

To optimize our topic model, we can address the issue of outlier reviews. BERTopic offers four strategies for dealing with outliers: based on topic-document probabilities, based on topic distributions, based on c-TF-IDF representations, and based on document and topic embeddings. Each strategy has its own advantages, and the best approach depends on the specific dataset.

Examining examples of outliers, we find that these reviews often contain multiple topics. BERTopic assigns only one topic to each document, which may not accurately represent texts with a mixture of topics. However, we can use Topic Distributions to overcome this limitation. By splitting documents into tokens and assigning a topic to each subsentence, we can better capture the nuances of mixed-topic texts.

By implementing these strategies and continuously refining our topic model, we can gain valuable insights from free-form texts. With automated tools like BERTopic, we can analyze large volumes of texts efficiently and effectively to inform decision-making and improve products and services.

Summary: Understanding the Distinctions in Topics per Class Utilizing BERTopic – A User-Friendly and Captivating Approach | Written by Mariya Mansurova in September 2023

Understanding the differences in texts by categories can be a challenging task, especially when dealing with large amounts of data. However, there are tools like Topic Modelling that can help automate this process. In this article, the author discusses how to build a topic model using BERTopic and compares topics across categories using insightful graphs.




FAQs – Topics per Class Using BERTopic | Mariya Mansurova

Frequently Asked Questions

How to understand the differences in topics per class using BERTopic?

BERTopic is a powerful topic modeling algorithm that can be used to identify and cluster topics within a given set of documents. By applying BERTopic to a dataset split into classes, you can compare and analyze the differences in the topics extracted for each class. To understand these differences, follow these steps:

  1. Split your dataset into classes or categories based on a specific criterion. For example, if your dataset contains news articles, you can split them into classes based on their respective topics (e.g., sports, politics, entertainment).
  2. Apply BERTopic to each class separately, generating a set of topics for each class.
  3. Analyze the resulting topics for each class, looking for key terms and recurrent themes to identify the differences. You can use visualizations, such as word clouds or topic similarity graphs, to gain a comprehensive understanding.
  4. Compare the topics extracted for each class. Look for similarities and differences in terms of content, language, or any other relevant characteristics. This comparison will provide insights into how topics vary across different classes.

Why is it important to use headings in HTML?

Using headings in HTML is essential for both search engine optimization (SEO) and improving the reading experience for human users. Here’s why:

  • SEO benefits: Search engines rely on headings to understand the structure and content hierarchy of a webpage. Properly formatted headings (h1 to h6) provide search engines with valuable information about the page’s main topics and subtopics. This helps search engines index and rank the page more accurately, improving its visibility in search results.
  • Enhanced readability: Headings assist human users in quickly scanning and understanding the content. They break down the text into logical sections and allow readers to navigate the page easily. Well-structured headings make the content more scannable, improving user experience and reducing bounce rates.

How to create SEO-friendly and attractive headings in HTML?

To create SEO-friendly and visually appealing headings in HTML, consider the following tips:

  • Use descriptive and concise wording that accurately represents the content of the section.
  • Include relevant keywords within headings to optimize them for search engines.
  • Use headings hierarchically (h1, h2, h3, etc.) to indicate the structure and importance of each section.
  • Avoid excessive use of headings or skipping heading levels, as it may confuse both search engines and users.
  • Style your headings using CSS to make them visually appealing, such as adjusting font size, color, and alignment.

Why should FAQs be included on a website?

Including a Frequently Asked Questions (FAQs) section on a website offers several benefits:

  • Improved user experience: FAQs provide quick and convenient access to commonly asked questions, saving users time and effort. Visitors can easily find answers to their queries without having to contact customer support or browse through lengthy content.
  • Reduced support requests: By addressing common queries proactively through FAQs, you can minimize the number of support requests. This frees up resources and allows your support team to focus on more complex or unique customer inquiries.
  • Builds trust and credibility: Well-designed and informative FAQs demonstrate your expertise and understanding of your product, service, or industry. They provide reassurance to potential customers and help build trust and credibility in your brand.
  • SEO benefits: Including relevant keywords within your FAQs can improve your website’s visibility in search engine results pages (SERPs). When people search for answers to common questions related to your industry or niche, your website has a higher chance of appearing in search results.