Extracting Insights from Reviews using Cluster Analysis Ayush Hans1 , Nihar Khera1 1 National Institute of Technology, Kurukshetra , India Abstract The top operating organizations understand an essential role that customer feedback plays in the business industry. These businesses then consistently listen to the feedbacks of the consumers to stay ahead in the competition. Customer feedback gives crucial insights into the workings of the product, services, and what could be done within the company’s domain to make the experiences of the consumers better. Customer’s opinions help the companies ensure that the final product actually shall suffice their expectations, solve their problems and meet their needs. Hence, the customer feedback is one of the most reliable and easy to get sources for tangible data that can also be used in making wise business decisions. The proposed approach provides a method to make effective use of this feedback and generate insights for the Product Team. Since it is not feasible to go through all reviews to find out what the customers are talking about, the reviews are clubbed together by Topic Modelling approach. The Business Team is presented with top keywords corresponding to each group of reviews which makes it easy for them to find out the actionable areas. The way the results are presented to the team guides them in the right direction so as to improve their products and services. A model is generated once the reviews have been labelled with topics. This is helpful to classify the new reviews which keep on coming from the customers’ end. The Topic Modelling algorithm is again followed once the team has good number of new reviews which will further help in improving the model. Keywords Reviews, Natural Language Processing, Machine Learning, LDA, Topic Modelling, BERT 1. Introduction The existing customers’ reviews are not only helpful for the new customers to find the right product but they also serve as a means for the product teams to improve their products and services. In this era of digitization, organizations use customer reviews and other feedback information from various sources and generate insights out of those reviews. Machine Learning and Natural Language Processing both are used to process these wide varieties and a huge volume of reviews. Different approaches such as Topic Modeling, Text Clustering are used in Natural Language Processing for Customer Feedback Analysis. Data Preprocessing: It is an important step to preprocess textual data before performing Natural Language Processing tasks. NLP involves text/data processing to convert the available data into more usable and convenient form. It helps to get rid of the redundant and irrelevant data present in the dataset and also plays a role in maintaining the standard of the text [1]. International Conference on Smart Systems and Advanced Computing (Syscom-2021), December 25–26, 2021 $ ayushhans2011@gmail.com (A. Hans); nkniharkhera@gmail.com (N. Khera) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Flow Diagram of the model Topic Modeling: Topic Modeling is used for finding different topics from documents (basi- cally some form of textual data) without having any knowledge in advance. LDA (Latent Dirichlet Allocation): This topic modeling approach makes use of each document as a different set of topics and every word is considered to be drawn from those topics. A good LDA Model involves tuning of hyperparameters such as word topic density, document topic density etc. In order to get good quality of topics, a suitable number of topics has to be selected which can be done by measuring the Topic Coherence, which measures the degree of semantic similarity between the words which scored highest in the topic[2]. Text Classification: Text Classification is a good choice to get familiar with textual data processing. It finds a lot of interesting applications in daily life. There have been a significant amount of researches in this field. One of such research is Bert Model. BERT stands for “Bidirectional Encoder Representations from Transformers”. The remainder of this paper is structured as follows: Section 2 provides the proposed approach followed throughout the paper. In Section 3, we present the related works. In Section 4, we present the implementation of the proposed method. In Sections 5 and 6, we discuss the results, provide a conclusion and propose recommendations for some future work. 2. Proposed approach The proposed approach basically combines two aspects of Machine Learning algorithms- Clus- tering and Classification. Clustering helps to avoid the manual task of labeling the product reviews by dividing them into topics or clusters. The labels then serve as the basis of classifying the new reviews. Topic Modeling finds a theme across reviews and discovers hidden topics. It can be interpreted as creating some buckets and putting each review into these buckets. First, the reviews are split into positive or negative depending on the rating value given by the customer. Then LDA Topic Modeling is used to find themes across these two categories. The output of Topic Modeling is visualized on a webpage that displays the top Bigrams (two words frequently occurring together) corresponding to each topic or cluster identified by the LDA Topic Modeling. This type of visualization is really helpful from the perspective of the Business or Product Team as they get a clear picture of what the customers are talking about in the reviews. The team also gets the list of actual customer reviews to read them as and when needed. It also displays an Inter-topic Distance Map which reflects the clusters formed where each cluster is represented in the form of a bubble. This is very helpful for Data Scientists for analysis of the topics or clusters formed. The webpage shows a list of the most relevant words corresponding to each topic along with their frequency in the selected topic and overall frequency. Now, we have the clusters[3][4], but new reviews still keep on coming up from the customers. These reviews are classified into the clusters formed with the help of a classification model which is built using the topics or clusters from the topic modeling algorithm. The topic modeling algorithm can again be followed after a specified time (for example, after two or three months) when the Product Team has quite a reasonable quantity of new reviews. This will in result improve the quality of new topics or clusters formed. 3. Related works Many researches have been done in the text summarizations and terminology identifications [5]. This technique requires designing templates by adequately identifying and extracting primary elements and significant facts in a document. Researchers still are working on the information extraction processes from texts. The main focus is on the machine learning and NLP methods for proper extraction or classification of entities and relations. Continuing on the same, the other area of research in this field is the opinion and review extraction from online web pages and the opinion summarizations based on product features with the help of edge[6] and cloud computing [7][8]. The central problem with the existing studies on the work of reviews is that they consider all the reviews with the same significance, which may not give relevant and accurate results. That is why the classification of reviews based on importance is a significant task. Hiremath proposed a system to automatically assess the review’s quality using quartile measure and identify a customer review as Most Significant review, More Significant review, Significant review, and Insignificant review. Other approaches include Topic Modeling algorithms like Latent Dirichlet Allocation, Latent Semantic Analysis etc. which enables us to discover topics from set of documents. In Topic modeling using LDA, different topic groups are created. It is the role of the researcher to decide the number of groups in the final output. Since there is no prior knowledge about what is the best number of groups, we generate models with different numbers of groups and then analyze and compare different topic modeling, and then the decision is made to select the topic model which is most meaningful and sensible out of all the models generated with different hyperparameters. Topic Modeling is an approach which is useful in finding out the themes across the data, hence this is quite effective when we are dealing with customer reviews. Each review is assigned one of the themes to which it belongs with highest proportion making it easy for the businesses to figure out the difficulties being faced by the customers in regard to their products and services. 4. Implementation 4.1. NLP Preprocessing • Contractions Expansion: Contractions are quite common in English Language. The contractions of words are created by removing specific letters and sounds. This step expands each and every contraction to its original form to maintain the standard of the text[9]. • Removal of URLs: There is a chance that the review may have some URL in it. Therefore, we need to remove it to continue with further processing. • Removal of HTML Tags: This step is useful when the reviews have been extracted from a website because there is a chance that some HTML specific code has become a part of the review during scrapping. • Lower Casing: Lower Casing is a text preprocessing technique. It is done to convert the text into same casing format, so that the words are not considered as different. • Removal of Punctuation: This step is performed to maintain the standard of the text. The list of punctuations to exclude should be chosen after taking into consideration the task for which preprocessing is done. • Tokenization: This text preprocessing step splits textual strings into smaller pieces which are referred to as “tokens”. It involves splitting textual data into sentences which are then split into words. This is a necessary step in almost all of the textual data processing tasks. Tokenization is also known as Text Segmentation. • Lemmatization: This is one of the most important NLP preprocessing steps. Lemmatization aims at reducing a word to its base or dictionary form, which is called as the “lemma”. It really transforms words to their true root form, instead of just chopping them. For example, the words “playing”, “plays”, “played” are mapped to “play”. It can be done with the help of Python “nltk” package and makes use of a dictionary such as “WordNet” for producing the mappings. Lemmatization plays a significant role in Natural Language Processing and Artificial Intelligence tasks. In languages other than English, lemmatization can be quite complicated. 4.2. Topic Modelling • LDA (Latent Dirichlet Allocation): LDA (Latent Dirichlet Allocation): Topic Modeling is an approach that is used to find themes across the reviews and discover hidden topics. It is based on extracting a certain number of groups consisting of specific words from the reviews. These groups represent the topics that are useful from the perspective of the Business or Product Team to find out what the customers are talking about in the reviews. LDA (Latent Dirichlet Allocation) is one of the most popular methods of Topic Modeling. LDA takes two hyperparameters into consideration, the “alpha parameter” and the “beta parameter”. The “alpha parameter” controls the mixture of topics for any given document. If it is low, the documents will have less of a mixture of topics and if it is high, the documents will have more of a mixture of topics. The “beta parameter” controls the distribution of words per topic. If it is low, the topics will likely have fewer words. If it is high, the topics will likely have more words. Another factor that LDA takes into account is K, the number of topics or groups to form. • Topic Modeling using Nouns and Adjectives: The topics generated by LDA can be a mixture of nouns, verbs, adjectives, etc. The LDA algorithm treats all tokens equally with the same importance. When we are dealing with the reviews, removing all words except nouns and adjectives helps to improve the semantic coherence of the topics. • Bigrams Formation: Bigrams refer to two words frequently occurring together in the text. Applying LDA Topic Modeling after taking into account bigrams (or in general, n-grams) helps to improve the quality of topic models. In Python, Gensim”s Phrases model can build and implement the bigrams, trigrams, etc. 4.3. Evaluation of LDA Topic Modeling: Topic Coherence The probabilistic topic models (such as LDA) are popular approaches for textual processing and analysis. They provide predictive and latent topic representation of the corpus. It is assumed that the latent space discovered by these models is generally meaningful and useful, and evaluating such assumptions is challenging due to its unsupervised training process. Topic Coherence is a method that can be used to evaluate the LDA topics. It is based on the concept of combining a number of measures into a framework to evaluate the coherence between topics that have been generated by the model. If a set of sentences or facts support each other, they are said to be coherent. Topic Coherence measures score of a single topic by measuring the degree of semantic similarity between high importance words in the topic. Higher the value of Topic Coherence for a model, better is the quality of topics formed by the model. 4.4. Visualization Visualizing clusters makes it convenient for the Business or Product Team to evaluate, explore and interpret the results of Cluster Analysis[10]. It lists out the top Bigrams corresponding to each topic or cluster identified by the LDA Topic Modeling which gives the Product team a clear picture of what the customers are talking about in the reviews about their product. The webpage also displays the list of actual customer reviews for deep analysis. It has an Inter-topic Distance Map which is helpful for Data Scientists to evaluate the clusters formed. Hence, we have both unigrams and bigrams for each cluster, which is useful for the Product Team to find out the areas to focus upon to improve their product. 4.5. Classification of New Reviews We’ve labeled the reviews after we’ve finished clustering. Each review now has a label that corresponds to the topic number to which it belongs. To perform text token processing, the BERT employs the Transformer encoder architecture. This processing is done in the full context of all tokens before and after it. Such models are pre-trained on a large corpus of text before being fine-tuned for specific NLP tasks. BERT is an encoder stack of transformer architecture[11], which is an encoder-decoder network that makes use of self-attention on the encoder side and attention on the decoder side. BERT Models also have large feed forward networks, 768 hidden units in case of Base Bert and 1024 hidden units in case of Large Bert. During training process, the Bert model takes pairs of sentences and learns to predict if the second sentence is the subsequent sentence of the first sentence in the original text. 50 percent of the inputs are a pair in which the second sentence is the subsequent sentence in the original text. For the other 50 percent of the inputs, a random sentence from the corpus is chosen as the second sentence[12]. Figure 2: Using BERT for Classification 5. Results Finally, after combining the two aspects of Machine Learning algorithms, Clustering and Classification, we visualize the insights to see if we can have some meaningful results from them. 6. Conclusion and Future plans The customers” reviews are of utmost importance for any firm or organization. The organizations that look into the feedback given by the customers always excel in their domain. It is not possible to go through each and every piece of customer feedback manually. Clustering the reviews is a better way to get insights from them. Topic Modeling can be used to find themes across the reviews and discover hidden topics. LDA (Latent Dirichlet Allocation) is one of the most popular methods of Topic Modeling. It is a “generative probabilistic model”. After applying the LDA model, we have the topic or cluster for each customer review to which it belongs with highest probability value. As a result, we have labelled reviews, each of which belongs to one of the topics or clusters. These clusters are visualized to present them to the Product Team in an easy to interpret and analyze form. Once the organization has a significant number of fresh reviews, it may use the clustering technique to improve the quality of topics or clusters, since we know that the more data there Figure 3: Cluster Reviews is, the better the model performs. This approach is quite effective from the perspective of an organization and helps them to improve the quality of their products and services by making it easy to identify actionable areas. Another improvement that could be made in the future is to incorporate sentences or embed- dings from a model like Bert into the Topic Modeling technique. The vectors from the model and LDA can be combined with some weight or hyperparameter to improve the results. Acknowledgments We would like to thank our college, National Institute of Technology, Kurukshetra for giving us the platform to express ourselves. Also, we would like to thank our mentor Dr. B.B. Gupta, Asst. Professor, NIT Kurukshetra. References [1] S. Kapadia, "Towards Data Science," 19 08 2019. [Online]. Available: https:// towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0. [Accessed 07 02 2021]. [2] Kaggle, "Clustering with Topic Modeling using LDA," Kaggle, 01 09 2020. [Online]. Available:https://www.kaggle.com/panks03/clustering-with-topic-modeling-using-lda. [Ac- cessed 19 03 2021]. Figure 4: Classification results in form of clusters and relevance metrics [3] Shahabadi, M. S. E., Tabrizchi, H., Rafsanjani, M. K., Gupta, B. B., Palmieri, F. (2021). A combination of clustering-based under-sampling with ensemble methods for solving imbalanced class problem in intelligent systems. Technological Forecasting and Social Change, 169, 120796. [4] Manasrah, A. M., Gupta, B. B. (2019). An optimized service broker routing policy based on differential evolution algorithm in fog/cloud environment. Cluster Computing, 22(1), 1639-1653. [5] Gou, Z., et al. (2017). Analysis of various security issues and challenges in cloud computing environment: a survey. In Identity Theft: Breakthroughs in Research and Practice (pp. 221-247). IGI global. [6] A Dahiya, B. Gupta (2021), Edge Intelligence: A New Emerging Era, Insights2Techinfo, pp.1 [7] A. Dahiya (2021), Integration of Cloud and Fog Computing for Energy Efficient and Scalable Services, Insights2Techinfo, pp.1 [8] Mirsadeghi, F., Rafsanjani, M. K., Gupta, B. B. (2020). A trust infrastructure based authen- tication method for clustered vehicular ad hoc networks. Peer-to-Peer Networking and Applications, 1-17. [9] Kaggle, "Getting started with Text Preprocessing," Kaggle, 25 03 2019. [Online]. Avail- able: https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing. [Accessed 10 02 2021]. [10] S. A. S. S. Prakash Hiremath, "Cluster Analysis of Customer Reviews Extracted from Web Pages," Journal of Applied Computer Science & Mathematics, 24 07 2014. [Online]. Avail- able: https://www.researchgate.net/publication/47807593_Cluster_Analysis_of_Customer_ Reviews_Extracted_from_Web_Pages. [Accessed 06 02 2021]. [11] "A Visual Notebook to Using BERT for the First Time," Google Colab, 28 01 2020. [Online]. Available: https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/ master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb. [Accessed 08 02 2021]. [12] H. E. BOUKKOURI, "Text Classification: The First Step Toward NLP Mastery," Medium, 18 06 2018. [Online]. Available: https://medium.com/data-from-the-trenches/ text-classification-the-first-step-toward-nlp-mastery-f5f95d525d73 . [Accessed 03 08 2021].