<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extracting Insights from Reviews using Cluster Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ayush Hans</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nihar Khera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Technology</institution>
          ,
          <addr-line>Kurukshetra</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The top operating organizations understand an essential role that customer feedback plays in the business industry. These businesses then consistently listen to the feedbacks of the consumers to stay ahead in the competition. Customer feedback gives crucial insights into the workings of the product, services, and what could be done within the company's domain to make the experiences of the consumers better. Customer's opinions help the companies ensure that the final product actually shall sufice their expectations, solve their problems and meet their needs. Hence, the customer feedback is one of the most reliable and easy to get sources for tangible data that can also be used in making wise business decisions. The proposed approach provides a method to make efective use of this feedback and generate insights for the Product Team. Since it is not feasible to go through all reviews to find out what the customers are talking about, the reviews are clubbed together by Topic Modelling approach. The Business Team is presented with top keywords corresponding to each group of reviews which makes it easy for them to find out the actionable areas. The way the results are presented to the team guides them in the right direction so as to improve their products and services. A model is generated once the reviews have been labelled with topics. This is helpful to classify the new reviews which keep on coming from the customers' end. The Topic Modelling algorithm is again followed once the team has good number of new reviews which will further help in improving the model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Reviews</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>LDA</kwd>
        <kwd>Topic Modelling</kwd>
        <kwd>BERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The existing customers’ reviews are not only helpful for the new customers to find the right
product but they also serve as a means for the product teams to improve their products and
services. In this era of digitization, organizations use customer reviews and other feedback
information from various sources and generate insights out of those reviews. Machine Learning
and Natural Language Processing both are used to process these wide varieties and a huge
volume of reviews. Diferent approaches such as Topic Modeling, Text Clustering are used in
Natural Language Processing for Customer Feedback Analysis.</p>
      <p>
        Data Preprocessing: It is an important step to preprocess textual data before performing
Natural Language Processing tasks. NLP involves text/data processing to convert the available
data into more usable and convenient form. It helps to get rid of the redundant and irrelevant
data present in the dataset and also plays a role in maintaining the standard of the text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Topic Modeling: Topic Modeling is used for finding diferent topics from documents
(basically some form of textual data) without having any knowledge in advance.</p>
      <p>
        LDA (Latent Dirichlet Allocation): This topic modeling approach makes use of each
document as a diferent set of topics and every word is considered to be drawn from those topics.
A good LDA Model involves tuning of hyperparameters such as word topic density, document
topic density etc. In order to get good quality of topics, a suitable number of topics has to be
selected which can be done by measuring the Topic Coherence, which measures the degree of
semantic similarity between the words which scored highest in the topic[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Text Classification : Text Classification is a good choice to get familiar with textual data
processing. It finds a lot of interesting applications in daily life. There have been a significant
amount of researches in this field. One of such research is Bert Model. BERT stands for
“Bidirectional Encoder Representations from Transformers”.</p>
      <p>The remainder of this paper is structured as follows: Section 2 provides the proposed approach
followed throughout the paper. In Section 3, we present the related works. In Section 4, we
present the implementation of the proposed method. In Sections 5 and 6, we discuss the results,
provide a conclusion and propose recommendations for some future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed approach</title>
      <p>The proposed approach basically combines two aspects of Machine Learning algorithms-
Clustering and Classification.</p>
      <p>Clustering helps to avoid the manual task of labeling the product reviews by dividing them
into topics or clusters. The labels then serve as the basis of classifying the new reviews. Topic
Modeling finds a theme across reviews and discovers hidden topics. It can be interpreted as
creating some buckets and putting each review into these buckets. First, the reviews are split
into positive or negative depending on the rating value given by the customer. Then LDA Topic
Modeling is used to find themes across these two categories.</p>
      <p>The output of Topic Modeling is visualized on a webpage that displays the top Bigrams (two
words frequently occurring together) corresponding to each topic or cluster identified by the
LDA Topic Modeling. This type of visualization is really helpful from the perspective of the
Business or Product Team as they get a clear picture of what the customers are talking about in
the reviews. The team also gets the list of actual customer reviews to read them as and when
needed. It also displays an Inter-topic Distance Map which reflects the clusters formed where
each cluster is represented in the form of a bubble. This is very helpful for Data Scientists
for analysis of the topics or clusters formed. The webpage shows a list of the most relevant
words corresponding to each topic along with their frequency in the selected topic and overall
frequency.</p>
      <p>
        Now, we have the clusters[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], but new reviews still keep on coming up from the customers.
These reviews are classified into the clusters formed with the help of a classification model
which is built using the topics or clusters from the topic modeling algorithm.
      </p>
      <p>The topic modeling algorithm can again be followed after a specified time (for example, after
two or three months) when the Product Team has quite a reasonable quantity of new reviews.
This will in result improve the quality of new topics or clusters formed.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related works</title>
      <p>
        Many researches have been done in the text summarizations and terminology identifications [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
This technique requires designing templates by adequately identifying and extracting primary
elements and significant facts in a document. Researchers still are working on the information
extraction processes from texts. The main focus is on the machine learning and NLP methods
for proper extraction or classification of entities and relations. Continuing on the same, the
other area of research in this field is the opinion and review extraction from online web pages
and the opinion summarizations based on product features with the help of edge[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and cloud
computing [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The central problem with the existing studies on the work of reviews is
that they consider all the reviews with the same significance, which may not give relevant and
accurate results. That is why the classification of reviews based on importance is a significant
task. Hiremath proposed a system to automatically assess the review’s quality using quartile
measure and identify a customer review as Most Significant review, More Significant review,
Significant review, and Insignificant review.
      </p>
      <p>Other approaches include Topic Modeling algorithms like Latent Dirichlet Allocation, Latent
Semantic Analysis etc. which enables us to discover topics from set of documents. In Topic
modeling using LDA, diferent topic groups are created. It is the role of the researcher to decide
the number of groups in the final output. Since there is no prior knowledge about what is
the best number of groups, we generate models with diferent numbers of groups and then
analyze and compare diferent topic modeling, and then the decision is made to select the topic
model which is most meaningful and sensible out of all the models generated with diferent
hyperparameters.</p>
      <p>
        Topic Modeling is an approach which is useful in finding out the themes across the data, hence
this is quite efective when we are dealing with customer reviews. Each review is assigned one
of the themes to which it belongs with highest proportion making it easy for the businesses to
ifgure out the dificulties being faced by the customers in regard to their products and services.
4. Implementation
4.1. NLP Preprocessing
• Contractions Expansion: Contractions are quite common in English Language. The
contractions of words are created by removing specific letters and sounds. This step
expands each and every contraction to its original form to maintain the standard of the
text[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
• Removal of URLs: There is a chance that the review may have some URL in it. Therefore,
we need to remove it to continue with further processing.
• Removal of HTML Tags: This step is useful when the reviews have been extracted from a
website because there is a chance that some HTML specific code has become a part of the
review during scrapping.
• Lower Casing: Lower Casing is a text preprocessing technique. It is done to convert the
text into same casing format, so that the words are not considered as diferent.
• Removal of Punctuation: This step is performed to maintain the standard of the text. The
list of punctuations to exclude should be chosen after taking into consideration the task
for which preprocessing is done.
• Tokenization: This text preprocessing step splits textual strings into smaller pieces which
are referred to as “tokens”. It involves splitting textual data into sentences which are then
split into words. This is a necessary step in almost all of the textual data processing tasks.
      </p>
      <p>Tokenization is also known as Text Segmentation.
• Lemmatization: This is one of the most important NLP preprocessing steps. Lemmatization
aims at reducing a word to its base or dictionary form, which is called as the “lemma”. It
really transforms words to their true root form, instead of just chopping them. For example,
the words “playing”, “plays”, “played” are mapped to “play”. It can be done with the help
of Python “nltk” package and makes use of a dictionary such as “WordNet” for producing
the mappings. Lemmatization plays a significant role in Natural Language Processing
and Artificial Intelligence tasks. In languages other than English, lemmatization can be
quite complicated.</p>
      <sec id="sec-3-1">
        <title>4.2. Topic Modelling</title>
        <p>• LDA (Latent Dirichlet Allocation): LDA (Latent Dirichlet Allocation): Topic Modeling is
an approach that is used to find themes across the reviews and discover hidden topics.
It is based on extracting a certain number of groups consisting of specific words from
the reviews. These groups represent the topics that are useful from the perspective of
the Business or Product Team to find out what the customers are talking about in the
reviews. LDA (Latent Dirichlet Allocation) is one of the most popular methods of Topic
Modeling. LDA takes two hyperparameters into consideration, the “alpha parameter” and
the “beta parameter”. The “alpha parameter” controls the mixture of topics for any given
document. If it is low, the documents will have less of a mixture of topics and if it is high,
the documents will have more of a mixture of topics. The “beta parameter” controls the
distribution of words per topic. If it is low, the topics will likely have fewer words. If it is
high, the topics will likely have more words. Another factor that LDA takes into account
is K, the number of topics or groups to form.
• Topic Modeling using Nouns and Adjectives: The topics generated by LDA can be a
mixture of nouns, verbs, adjectives, etc. The LDA algorithm treats all tokens equally with
the same importance. When we are dealing with the reviews, removing all words except
nouns and adjectives helps to improve the semantic coherence of the topics.
• Bigrams Formation: Bigrams refer to two words frequently occurring together in the text.</p>
        <p>Applying LDA Topic Modeling after taking into account bigrams (or in general, n-grams)
helps to improve the quality of topic models. In Python, Gensim”s Phrases model can
build and implement the bigrams, trigrams, etc.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.3. Evaluation of LDA Topic Modeling: Topic Coherence</title>
        <p>The probabilistic topic models (such as LDA) are popular approaches for textual processing and
analysis. They provide predictive and latent topic representation of the corpus. It is assumed that
the latent space discovered by these models is generally meaningful and useful, and evaluating
such assumptions is challenging due to its unsupervised training process. Topic Coherence is a
method that can be used to evaluate the LDA topics. It is based on the concept of combining
a number of measures into a framework to evaluate the coherence between topics that have
been generated by the model. If a set of sentences or facts support each other, they are said
to be coherent. Topic Coherence measures score of a single topic by measuring the degree of
semantic similarity between high importance words in the topic. Higher the value of Topic
Coherence for a model, better is the quality of topics formed by the model.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.4. Visualization</title>
        <p>
          Visualizing clusters makes it convenient for the Business or Product Team to evaluate, explore
and interpret the results of Cluster Analysis[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It lists out the top Bigrams corresponding to
each topic or cluster identified by the LDA Topic Modeling which gives the Product team a
clear picture of what the customers are talking about in the reviews about their product. The
webpage also displays the list of actual customer reviews for deep analysis. It has an Inter-topic
Distance Map which is helpful for Data Scientists to evaluate the clusters formed. Hence, we
have both unigrams and bigrams for each cluster, which is useful for the Product Team to find
out the areas to focus upon to improve their product.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>4.5. Classification of New Reviews</title>
        <p>We’ve labeled the reviews after we’ve finished clustering. Each review now has a label that
corresponds to the topic number to which it belongs. To perform text token processing, the
BERT employs the Transformer encoder architecture. This processing is done in the full context
of all tokens before and after it. Such models are pre-trained on a large corpus of text before
being fine-tuned for specific NLP tasks.</p>
        <p>
          BERT is an encoder stack of transformer architecture[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which is an encoder-decoder
network that makes use of self-attention on the encoder side and attention on the decoder side.
BERT Models also have large feed forward networks, 768 hidden units in case of Base Bert and
1024 hidden units in case of Large Bert. During training process, the Bert model takes pairs of
sentences and learns to predict if the second sentence is the subsequent sentence of the first
sentence in the original text. 50 percent of the inputs are a pair in which the second sentence is
the subsequent sentence in the original text. For the other 50 percent of the inputs, a random
sentence from the corpus is chosen as the second sentence[12].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>Finally, after combining the two aspects of Machine Learning algorithms, Clustering and
Classification, we visualize the insights to see if we can have some meaningful results from
them.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion and Future plans</title>
      <p>The customers” reviews are of utmost importance for any firm or organization. The organizations
that look into the feedback given by the customers always excel in their domain. It is not possible
to go through each and every piece of customer feedback manually. Clustering the reviews
is a better way to get insights from them. Topic Modeling can be used to find themes across
the reviews and discover hidden topics. LDA (Latent Dirichlet Allocation) is one of the most
popular methods of Topic Modeling. It is a “generative probabilistic model”. After applying the
LDA model, we have the topic or cluster for each customer review to which it belongs with
highest probability value. As a result, we have labelled reviews, each of which belongs to one of
the topics or clusters. These clusters are visualized to present them to the Product Team in an
easy to interpret and analyze form.</p>
      <p>Once the organization has a significant number of fresh reviews, it may use the clustering
technique to improve the quality of topics or clusters, since we know that the more data there
is, the better the model performs. This approach is quite efective from the perspective of an
organization and helps them to improve the quality of their products and services by making it
easy to identify actionable areas.</p>
      <p>Another improvement that could be made in the future is to incorporate sentences or
embeddings from a model like Bert into the Topic Modeling technique. The vectors from the model
and LDA can be combined with some weight or hyperparameter to improve the results.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank our college, National Institute of Technology, Kurukshetra for giving us
the platform to express ourselves. Also, we would like to thank our mentor Dr. B.B. Gupta, Asst.
Professor, NIT Kurukshetra.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kapadia</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Towards Data Science," 19 08</source>
          <year>2019</year>
          . [Online]. Available: https:// towardsdatascience.com/evaluate-topic
          <article-title>-model-in-python-latent-dirichlet-allocation-lda-</article-title>
          <source>7d57484bb5d0. [Accessed 07 02</source>
          <year>2021</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kaggle</surname>
          </string-name>
          ,
          <article-title>"Clustering with Topic Modeling using LDA,"</article-title>
          <source>Kaggle, 01 09</source>
          <year>2020</year>
          . [Online]. Available:https://www.kaggle.com/panks03/clustering
          <article-title>-with-topic-modeling-using-lda</article-title>
          .
          <source>[Accessed 19 03</source>
          <year>2021</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Shahabadi</surname>
            ,
            <given-names>M. S. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tabrizchi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rafsanjani</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>B. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmieri</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>A combination of clustering-based under-sampling with ensemble methods for solving imbalanced class problem in intelligent systems</article-title>
          .
          <source>Technological Forecasting and Social Change</source>
          ,
          <volume>169</volume>
          ,
          <fpage>120796</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Manasrah</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>B. B.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>An optimized service broker routing policy based on diferential evolution algorithm in fog/cloud environment</article-title>
          .
          <source>Cluster Computing</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1639</fpage>
          -
          <lpage>1653</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Gou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , et al. (
          <year>2017</year>
          ).
          <article-title>Analysis of various security issues and challenges in cloud computing environment: a survey</article-title>
          .
          <source>In Identity Theft: Breakthroughs in Research and Practice</source>
          (pp.
          <fpage>221</fpage>
          -
          <lpage>247</lpage>
          ).
          <source>IGI global.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A</given-names>
            <surname>Dahiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gupta</surname>
          </string-name>
          (
          <year>2021</year>
          ),
          <source>Edge Intelligence: A New Emerging Era, Insights2Techinfo</source>
          , pp.
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dahiya</surname>
          </string-name>
          (
          <year>2021</year>
          ),
          <article-title>Integration of Cloud and Fog Computing for Energy Eficient</article-title>
          and
          <string-name>
            <given-names>Scalable</given-names>
            <surname>Services</surname>
          </string-name>
          ,
          <source>Insights2Techinfo</source>
          , pp.
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Mirsadeghi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rafsanjani</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>B. B.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>A trust infrastructure based authentication method for clustered vehicular ad hoc networks</article-title>
          . Peer-to-
          <source>Peer Networking and Applications</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Kaggle</surname>
          </string-name>
          ,
          <article-title>"Getting started with Text Preprocessing,"</article-title>
          <source>Kaggle, 25 03</source>
          <year>2019</year>
          . [Online]. Available: https://www.kaggle.com/sudalairajkumar/getting
          <article-title>-started-with-text-preprocessing</article-title>
          .
          <source>[Accessed 10 02</source>
          <year>2021</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. A. S. S.</given-names>
            <surname>Prakash</surname>
          </string-name>
          <string-name>
            <surname>Hiremath</surname>
          </string-name>
          ,
          <article-title>"Cluster Analysis of Customer Reviews Extracted from Web Pages,"</article-title>
          <source>Journal of Applied Computer Science &amp; Mathematics, 24 07</source>
          <year>2014</year>
          . [Online]. Available: https://www.researchgate.net/publication/47807593_Cluster_
          <article-title>Analysis_of_Customer_ Reviews_Extracted_from_</article-title>
          <source>Web_Pages. [Accessed 06 02</source>
          <year>2021</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <article-title>"A Visual Notebook to Using BERT for the First Time,"</article-title>
          <source>Google Colab</source>
          ,
          <volume>28</volume>
          <fpage>01</fpage>
          2020. [Online].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>