Automatic Aspect-Based Sentiment Analysis
        (AABSA) from Customer Reviews

                                Ella Jiaming Xu1 , Bo Tang2 ,
                                Xiao Liu1 , and Feiyu Xiong2
                     1
                       Stern School of Business, New York University
                           {jx1258, xliu}@stern.nyu.edu
                                    2
                                       Alibaba Group
                     {tangbo.t, feiyu.xfy}@alibaba-inc.com


       Abstract. Online review platforms provide enormous information for
       users to evaluate products and services. However, the sheer volume of
       reviews can create information overload that could increase user search
       costs and cognitive burden. To reduce information overload, in this paper,
       we propose an Automatic Aspect-Based Sentiment Analysis (AABSA)
       model to automatically identify key aspects from Chinese online reviews
       and conduct aspect-based sentiment analysis. We create a hierarchical
       structure of hypernyms and hyponyms, apply deep-learning-based rep-
       resentation learning and clustering to identify aspects that are the core
       content in the reviews, and then calculate the sentiment score of each
       aspect. To evaluate the performance of the identified aspects, we use an
       econometric model to estimate the impact of each aspect on product
       sales. We collaborate with one of Asia’s largest online shopping plat-
       forms and employ the model in its product review tagging system to
       help consumers search for product aspects. Compared with benchmark
       models, our model is both more effective, because it creates a more com-
       prehensive list of aspects that are indicative of customer needs, and more
       efficient because it is fully automated without any human labor cost.

       Keywords: Aspect-Based Sentiment Analysis · Representation Learn-
       ing · Deep Learning · Sentiment Analysis · Econometric Model


1    Introduction

Online reviews are critical for multiple stakeholders. Consumers can obtain rich
information from reviews to evaluate products and services. Firms can leverage
reviews to gain insights on customer needs and opportunities to improve their
products. Despite the enormous informational value provided by reviews, the
sheer volume of reviews has created a problem of information overload. Con-
sumers cannot easily process information of thousands of reviews to understand
the strengths and weaknesses of each product fully. Firms cannot easily glean
insights from immense unstructured review content of their own products and
competitors’. Although review rating, usually on a five-point Likert scale, is a


 Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License
 Attribution 4.0 International (CC BY 4.0). In: N. Chhaya, K. Jaidka, J. Healey, L. H. Ungar, A. Sinha
 (eds.): Proceedings of the 3rd Workshop of Affective Content Analysis, New York, USA, 07-
 FEB-2020, published at http://ceur-ws.org
2       Xu et al.

useful summary statistic, it is a single-dimensional value that fails to capture
the multi-dimensional facets of products.
    To overcome the information overload problem, a few online platforms, such
as Yelp and TripAdvisor, have started to provide aspect-based sentiment scores
on review pages, which help customers select reviews containing the selected
aspects. Current literature also pay growing attention to identifying customer
needs from online reviews [2, 28, 33, 30, 37, 6, 34, 7, 25]. However, both the plat-
forms and the literature face several fundamental problems:
    First, many previous solutions applied the supervised learning approach to
define aspects manually and then match them to corresponding reviews [6, 2, 34,
33]. This approach has many drawbacks. First of all, e-commerce platforms, such
as Amazon and Alibaba, often cover a wide range of product categories. The as-
pects that are relevant to one product category might be irrelevant for another
category. For example, sound quality is important for the TV category but not for
the sofa category. Therefore, it is time-consuming to identify aspects manually
for each product category. Moreover, e-commerce evolves rapidly. New product
categories constantly arrive, and customer tastes are continuously changing. It is
hard to keep up with the trend and manually identify aspects for each newly de-
veloped category and new customer needs. For example, in 2019, a new product
category, anti-smoking smart lighter, was introduced. And new aspects such as
social connectedness and windproof need to be defined and added. Furthermore,
even if defining a set of aspects is feasible, annotating large datasets is highly
demanding on human labor costs and time. Last but not least, with significant
human intervention, uncontrollable bias may arise. Second, although some pre-
vious papers also proposed automatic aspect detection, they selected the most
frequently mentioned aspects as the core product aspects [40, 6, 2]. The draw-
back of this approach is that it ignores word similarities. For example, although
hue and color each might not be the most frequently mentioned keyword, they
can be constructed as an important aspect jointly. Some follow-up research ap-
plied word embedding techniques, such as word2vec and wordnet clustering, to
capture word similarities [40, 36, 38]. But these works fail to capture complex
semantic relationships of aspect keywords.
    Third, previous literature assumes that all the aspects are at the same level [6,
2, 34]. However, there are limitations to flattening the aspect structure. Consider
a laptop retailer aiming at improving the quality of laptops. Quality, undeniably,
is an important aspect of a laptop, but it is an abstract aspect that consists
of many sub-aspects like durability and speed. A review without the keyword
”quality,” but with more specific words such as ”durability” and ”speed” could
also reflect a consumer’s overall sentiment towards the quality of a laptop.
    In summary, the following questions are left unsolved to conduct aspect-based
sentiment analysis:
    1. How can we identify aspects automatically with reduced cost and improved
flexibility?
    2. How can we better capture the semantic relationship among keywords to
construct aspects?
                                           AABSA from Customer Reviews           3

    3. Does the hierarchical structure among aspects exist, and can the hierar-
chical structure improve the comprehensiveness of identified aspects?
    In this paper, we develop the Automatic Aspect-Based Sentiment Analysis
(AABSA) model to extract hierarchical-structured product aspects from online
consumer reviews. Specifically, we provide solutions to the three questions men-
tioned above:
    1. We propose a fully automated aspect-based sentiment analysis model
(AABSA). The model can create aspect-based sentiment scores from online re-
views without any human intervention or domain knowledge. In the AABSA
model, we applied k-means clustering to put sentence embeddings into groups
and select the center words from clusters as aspects. A prominent advantage of
the k-means clustering is that it is an unsupervised learning model so that we
do not need to pre-determine the aspects manually. AABSA could automatically
identify the aspects, aspect structure, and the number of aspects. No labels are
needed for the learning process, leaving it on its own to find structure in its in-
put. The model saves the time of defining labels and allows us to identify aspects
automatically.
    2. We introduce the Bidirectional Encoder Representations from Transform-
ers (BERT) model to transfer short reviews into sentence embeddings and cluster
them [9]. Unlike recent language representation methods, BERT jointly consid-
ers both left and right side context of words in all layers and helps us better
capture the semantic relationship among aspects.
    3. We develop a hierarchical aspect structure consisting of hypernym aspects,
which are defined as the core content that can summarize the semantics, and
hyponym aspects, which are defined as the sub-aspects of hypernym aspects.
We first cluster sentence embeddings and identify center words of clusters as
hypernym candidates. We then applied PageRank to build a weighted word
map with synonyms of hypernym candidates and applied PageRank to identify
hypernyms [26]. An essential advantage of the model is that the hypernyms and
hyponyms are not necessarily the words that appear most frequently, but those
that can capture the theme of the entire sentence. The hierarchical structure
significantly increases the comprehensiveness and accuracy of identified aspects.
    In summary, this paper makes several substantive and methodological con-
tributions. We propose an innovative method to identify product aspects by in-
troducing a hierarchical system. We demonstrate three comparative advantages
of the proposed model against benchmark methods: 1) improved comprehensive-
ness, 2) better prediction accuracy on sales, and 3) full automation without time-
consuming hand-coding. The method has been employed in Alibaba’s Chinese
product review tagging system to help consumers search for product aspects.


2     Literature Review
2.1   Aspect Identification
Online review platforms allow customers to express their attitudes towards prod-
ucts and services freely, and customers rely on online reviews to make decisions.
4      Xu et al.

The sheer volume of online reviews makes it difficult for a human to process
and extract all meaningful information. Hence, research on identifying product
aspects from user-generated content and ranking their relative importance has
been prolific in the past few decades [31, 40, 42, 12]. The most common methods
rely on focus groups, experiential interviews, or ethnography as input. Trained
professional analysts, then review the input, manually identify customer needs,
remove redundancy, and structure the customer needs [34, 16, 1]. [40] identi-
fied important aspects according the frequency and the influence of consumers’
opinions given to each aspect on their overall opinions by a shallow dependency
parser. [42] then extended [40]’s paper by performing extensive evaluations on
more products in more diverse domains and more real-world applications. [12]
applied an automatic clustering approach to aspect identification. One common
limitation of these approaches is that they assume the frequency that an as-
pect appears is positively correlated with its importance. However, high-level
and abstract concepts, such as ”quality,” may not appear very frequently in the
reviews. Still, the associated low-level, concrete concepts, such as durability and
conformance, may appear very frequently. The approaches, as mentioned above,
could fail to detect important high-level and abstract aspects. We instead pro-
pose a method that can rely on the hierarchical structure between hypernyms
and hyponyms to detect important aspects. And our method is fully automatic,
not relying on any human labor cost.
    In the marketing field, researchers often rely on existing psychological and
economic theory to pre-define a list of aspects and then extract the pre-defined
aspects from user-generated reviews [41, 24, 19, 35, 8, 10, 20]. However, this ap-
proach is theory-driven instead of data-driven. Therefore, it is hard to generalize
across contexts. For example, one paper that extracted the ”health” aspect from
weight-loss products might not be relevant for another product category, such
as automobiles. In contrast, we propose a data-driven method that can extract
the most relevant aspects tailored to the specific context. And our method is
domain knowledge agnostic, not relying on human expertise.

2.2   Aspect Sentiment Analysis
Sentiment analysis is a type of subjectivity analysis that aims to identify opin-
ions, emotions, and evaluations expressed in natural language [27]. The main
goal is to predict the sentiment orientation by analyzing opinion words and ex-
pressions and detect trends. Sentiment analysis plays an important role in iden-
tifying customer’s attitudes towards brands, and recent studies are paying more
attention to developing more fine-grained aspect-based sentiment analysis on
user-generated content. Previously, researchers studied extraction of evaluating
expressions from customer opinions [4, 17, 27, 32, 43, 14]. [14] extracted features
and summarized opinions from consumer reviews by part-of-speech tagging and
built an opinion word list. [4] summarized the sentiment of reviews for a local
service and focused on aspect-based summarization models.
    With the development of machine learning techniques, researchers applied
these advanced techniques to sentiment analysis. [23] introduced support vec-
                                          AABSA from Customer Reviews           5

tor machines (SVMs) and unigram models to sentiment analysis. [22] applied
Naive Bayes to analyze aspect-based sentiment. [18] applied the Maximum En-
tropy (MaxEnt) classification to classify consumer messages into either positive
or negative. [27] researched the performance of various machine learning tech-
niques, including MaxEnt classification, and showed that MaxEnt classification
was powerful with classifying reviews. Researchers then applied deep learning
methods such as XLNet and LSTM to conduct sentiment analysis [39, 11, 15,
29]. In this paper, we tested both MaxEnt classification and Fasttext and found
that MaxEnt outperforms Fasttext because the e-commerce platform has cre-
ated a rich sentiment vocabulary pertaining to product reviews. The compared
results of MaxEnt and Fasttext is listed in Table 2 of Appendices.


3     AABSA Model Framework
In this section, we describe the details of the structure of our aspect-sentiment
analysis model, AABSA. We start with an overview of its framework, which
consists of two main components: aspect identification and sentiment analysis.
We then describe the baseline model to be compared with.

3.1   Aspect-based Sentiment Analysis Problem
The aspect-based sentiment analysis problem is to identify product aspects from
a review document, and the aspects represent the most important customer needs
in the document. Having identified the aspects, we then need to associate senti-
ment scores with every aspect. We make two assumptions. First, we assume that
each sentence is possible to be associated with more than one aspect. Second,
we assume that the hierarchical structure exists among aspects. We classify as-
pects into hypernyms and hyponyms. Hypernyms are the core content that can
summarize the theme of the content and are often abstract and involve various
sub-aspects. For example, ”battery” is a core theme in the camera category iden-
tified by a previous research [2]. However, from the retailer’s perspective, ”bat-
tery” cannot provide them with detailed and comprehensive information on the
direction to improve the battery aspect. As a result, we identify the sub-aspects
as their hyponyms. For example, if the ”battery” is identified as a hypernym,
then ”battery life” and ”battery production place” are its possible hyponyms.
Hyponyms could provide more exact direction for product improvement.

3.2   AABSA Model Framework
Our model consists of eight steps:
    1. Pre-process reviews. We collected reviews from Alibaba, one of the biggest
e-commerce platforms in Asia. In our research, we analyze reviews for two prod-
uct categories: camera and toothbrush. There are two reasons why we choose
these two categories. First, camera and toothbrush are common products and
they are widely analyzed in marketing literature and we are able to compare our
6      Xu et al.

results with previous works. Second, the camera represent the high-end prod-
uct categories and the toothbrush represent the lower-end and more daily-used
products. We can compare the impact of reviews on the sales of them and gener-
ate business insights. We divide the entire review document into short sentences
and identify informative sentences, which were defined by the company’s existing
internal rules. For example, the sentence ”Very good” is classified as uninforma-
tive, whereas the sentence ”The battery can last more than 10 hours” is classified
as informative.
    2. Train word embeddings. The hierarchical aspect structure is based on the
relationships between hypernyms and hyponyms, which are represented by word
similarities. Concerning quantitative similarity representations, we convert words
into vectors using the Word2vec algorithm and eliminate the lower-frequency
words in synonym pairs [21].
   3. Train sentence embeddings. In order to measure the similarity between
reviews quantitatively, we convert the most frequent 50% short sentences into
sentence embeddings with BERT [9].
   4. Select hypernym candidates. We assume that a few core words, which
are defined as hypernym candidates, could summary each sentence. Hence, we
apply k-means clustering to generate semantic categories and select the most
important words, whose accumulated cosine distances to their cluster centers
are the shortest within the clusters, as hypernym candidates. We then filter
invalid words among hypernyms candidates.
    5. Further, we introduce the concept of hyponyms to assist in the subsequent
sentiment analysis step. We select the words closest to the hypernym candidates
in each cluster as hyponyms candidates. We then select the words closest to
the hyponym candidates as their subordinates. Then we construct a weighted
word network comprising hyponym candidates, hypernym candidates, and their
subordinates. We then apply PageRank to select hyponym candidates according
to their relative importance.
   6. Merge hypernyms and hyponym candidates. We find that there are overlaps
among hypernyms and hyponym candidates. To avoid redundancy, we rank all
the hypernyms according to their importance and merge the hypernym and
hyponym candidates to finalize hypernyms. If a hypernym belongs to several
hypernym sets, then we merge the hypernym with its hyponym candidates to
the hyponym set of the highest-ranked hypernym.
   7. Match hypernyms to reviews. We select the words closest to hypernyms
and hyponyms from the content and then apply a regular expression matching
to match them to reviews.
   8. We use the Maximum Entropy (MaxEnt) classification to classify reviews
sentences associated with each aspect into positive, neutral, or negative. The
sentiment score of reviews of each product is aggregated at the week level.
   The framework and algorithm of AABSA model is shown in Algorithm 1 and
Figure 1.
                                            AABSA from Customer Reviews           7

Algorithm 1 Aspect Identification
Input: Reviews: {Ri }Ii=1
Output: Hypernym set H1 and hyponym set H2
 1: Learn word vectors from reviews: {W Vj }Jj=1 = Word2Vec({Ri }Ii=1 )
 2: Divide all reviews {Ri }Ii=1 into short reviews {SRi }SI
                                                          i=1
 3: Calculate vector representation of short reviews {SRVi }SIi=1 with BERT
 4: Cluster {SRVi }SI
                    i=1 into k clusters with k-means clustering
 5: Calculate the center of each cluster m: Cm
 6: for m in clusters do
 7:    for word wj in cluster m do
 8:        Calculate the frequency of wj in cluster m: Nmj
 9:        for instance nwj of word wj do
10:            Calculate the cosine distance between wj and Cm : Dn (wj , m)
                                                      PN
11:        Calculate importance of wj in m: Fmj = nwmj=1 Dn (wj , m)
                                                         j

12:    Select word ŵj with highest Fmj as the hypernym in cluster m
13: Form hypernym set H1
14: for w1i in H1 do
15:    for w2j in mostSimilarN(w1i ) do
16:        Add edge(w1i ,w2j ) (=D(W V1i ,W V2j )) to N etN
17:        for w3k in mostSimilarN(w2j ) do
18:            Add edge(w2j ,w3k ) to N etN
19: Rank words using WeightedPageRank (N etN )
20: for hj ∈ H1 do
21:    Select TopN words with highest ranking as hyponym set H2j
22: Sort hypernyms by descending importance
23: for h1j ∈ H1 do
24:    for h1i ∈ H1 and i>j do
25:        if h1i ∈ H2j then
26:            Merge h1i and H2i with H1j
27: Sentiment analysis using MaxEnt


Pre-process reviews In general, online reviews are complex sentences consist-
ing of complicate sentiments and are composed of both informative and unin-
formative contents [34]. For example, in a review such as “I just got this camera
today, and it looks fantastic but it’s too heavy for me!”, the first clause is unin-
formative since it is irrelevant to the camera’s aspects, while the second clause
describes the customer’s positive attitude towards its appearance and negative
attitude towards its weight. To better identify sentiments and informative con-
tents, we separate original comments into single sentences and then automati-
cally eliminate uninformative single sentences with regular expression matching
with predefined regulations. Then we automatically eliminate the stop-words,
numbers, brand names, and punctuation.

Train word embeddings To measure the similarities and also figure out syn-
onyms quantitatively, we need to transfer words into vectors with word embed-
8       Xu et al.


                                                           Train Word-
                                                           embeddings
         Input:                                            (Word2vec)
                                  Pre-process
      Raw Consumer
                                   (Parsing)              Train Sentence
        Reviews
                                                           Embeddings
                                                             (BERT)

                                  Recall Hyponym         Select Hypernym
    Merge Hypernym and
                                    Candidates            Candidates (K-
    Hyponym Candidates
                                    (PageRank)           means clustering)


     Match Hypernyms                                          Output:
                                 Sentiment Analysis
     and Hyponyms to                                        Aspects and
                                      (MaxEnt)
         Reviews                                          sentiment score

                           Fig. 1. AABSA Framework

ding. Word embedding is a representation of document vocabulary utilizing the
context of a word, including semantic and syntactic similarity and word rela-
tionships. With word embedding, words used in similar contexts have similar
representations, and the cosine similarity between word vectors could quanti-
tatively represent similarities between words. We apply a skip-gram word2vec
model to train word embeddings [21]. Skip-gram takes as its input a large cor-
pus of text and produces a vector space, typically of several hundred dimensions,
with each unique word in the corpus being assigned a corresponding vector in
the space [21].


Train Sentence Embedding (with BERT) Sentence embeddings are use-
ful for keyword expansion and are used to identify the relationship between
words and sentences. In order to quantify the relationship between the sentences
and discover the latent customer needs, we formulate sentence embeddings and
extract keywords from the sentence clusters afterward. Consider the following
examples:
    “The toothbrush hair is super soft, and it really protects my son’s teeth!”
    “I am really disappointed that its toothbrush hair too soft, and it cannot
clean my teeth.”
    These two sentences are different expressions of opposite attitudes towards
the same product aspect, but they are similar in the semantic structure. In
earlier works, researchers often created sentence embeddings by directly taking
the average of word embeddings, which ignores the semantic and concatenate
relationships between sentences [9]. For example, word2vec would produce the
same word embedding for the word “soft” in both sentences. Language models
                                            AABSA from Customer Reviews            9

training word embeddings only use directionless or unidirectional context and
match each word to a fixed representation regardless of the context within which
the word appears. Discussing the same aspect in a similar semantic structure
might have different meanings. In this paper, we apply BERT (Bidirectional
Encoder Representations from Transformers) to obtain sentence embeddings.
    BERT is a deep learning-based model in natural language processing, and the
architecture is a multi-layer bidirectional Transformer encoder. It is designed to
learn deep bidirectional representations from the unlabeled text by cooperatively
considering both sides of context. BERT trains contextual representations on
text corpus and produces word representations that are dynamically informed by
the words around them. In contrast to previous efforts that read text sequentially
either from left to right or right to left, BERT introduces more comprehensive
and global word relationships to the word representation. BERT is bidirectional,
generalizable, has high-performance, and universal. Since the pre-training proce-
dure is comparatively hardware-demanding and time-consuming, we use BERT’s
own pre-built pre-training model, Chinese L-12 H-768 A-12, which was trained
by Google with Chinese Wikipedia data, as our pre-training model.


Select Hypernym Candidates After training sentence embeddings, we clus-
ter them with the k-means clustering algorithm. We assume that sentence vectors
within a cluster describe the same customer needs, and a limited number of core
words could summarize the opinions of each cluster. To exploit variety and com-
prehensiveness, we select the non-repeated central words of each of the top 10
largest clusters as hypernym candidates. Both silhouette coefficients and BIC
determine the optimal number of clusters.
    The process of selecting hypernym candidates is as follows. Denote embed-
ding of sentence i in cluster m as smi , word j as wj , and if wj appears in smi then
indicator amij is 1, otherwise is 0. The number of sentence embeddings in cluster
m is Nm and the number of wj appearance in smi is nmij . We first calculate
the cosine distance between smi to its cluster center, dmi , and the distance is
proportional to its representativeness. We sum up the cosine similarities between
wj and the cluster center as its importance in cluster m:

                                      Nm n
                                      X  X cij

                              Fmj =              dmi amij                        (1)
                                      i=1 j=1


    The cosine distance also represents the similarities between words in a sen-
tence and its cluster center. Since words repeatedly appear in different sentences,
we sum up the cosine distances between words and their cluster center as their
final distances. The words with the largest similarity within each cluster are the
most core words, and we select the top two words from each cluster as hyper-
nym candidates. The whole process of selecting hypernym candidates is shown
in Figure 2. Hypernyms candidates are then finalized after eliminating repeated
candidates chosen from all clusters.
10     Xu et al.

           Sent ence 1                                           Sent ence 1             CI D 1


           Sent ence 2                                           Sent ence 2             CI D m


           Sent ence N                                           Sent ence N             CI D M


                                                                               Sent ence_11


                  Hypernym Candidat e 1              CI D 1                    Sent ence_12


                  Hypernym Candidat e 2                                        Sent ence_1x1


                                                                               Sent ence_m 1


                                                     CI D M
                 Hypernym Candidat e M                                         Sent ence_m 2


                                                                               Sent ence_MxM


                                      Fig. 2. Select Hypernyms


Recall Hyponym Candidates (with PageRank) As we mentioned earlier,
hyponyms provide retailers with more detailed and granular information about
a product. Another purpose of introducing hyponyms is that also they help
matching hypernyms to more related reviews. We select the closest words to each
hypernym candidate as second-order related words and again select the closest
words to each second-order related words as third-order ones. Then we construct
weighted directed wordnet where weights are determined by distances between
pre-trained word embeddings. The process of building the wordnet is shown in
Figure 3. Then we apply the PageRank algorithm to generate the final hyponyms
according to their relative closeness and importance. PageRank is an iterative
algorithm that determines the importance of a web page based on the importance
of its parent page [26, 5, 13]. The core idea of PageRank is that the rank of an
element is divided among its forward links evenly to contribute to the ranks of
the pages they point to. After PageRank of each element is obtained, we select
words with the highest PageRank between hypernyms as hyponym candidates.
Compared with selecting the closest words to hypernyms and hyponyms, the
main advantage of using PageRank is that it uses the entire graph rather than
a small subset to estimate relative relationships between words. As a result, it
enlarges the diversity, reliability, and richness of identified aspects.
    The procedure of calculating PageRank is described as follows. Let Fi be the
set of words that word i points to and Bi be the set of words that points to
                                                  AABSA from Customer Reviews                            11

i. Let Ni = |Fi | be the number of links from i and let c be a factor used for
normalization. PageRank of i is then
                                               X R(v)
                                   R(i) = c                                                              (2)
                                                        Nv
                                               v∈Bi


                                          3rd- order
                                        relat ed word                                        3

                                                                         3
                                          3rd- order
                    2nd- order
                                        relat ed word                            2       3
                   relat ed word
                                                                                                     3
                    2nd- order                                                               2
                                          3rd- order         3
                   relat ed word
     Hypernym                           relat ed word                2           1                   3
     Candidat e

                                                                 3
                    2nd- order
                   relat ed word                                             3       2
                                                                                                 3

                                   Fig. 3. Build wordnet


Merge Hypernyms and Hyponyms After finalizing the recall process, we
noticed that there are overlaps between hypernyms and hyponyms. For example,
hypernym candidate A is also a hyponym of hypernym candidate B. Overlaps
would cause redundancy and confusion when mapping sentiment to aspects in
the following steps. In order to further improve the precision of the constructed
aspect lexicon and investigate the internal similarity between hypernym candi-
dates, we merge hypernyms and hyponym candidates.

Match hypernyms to Reviews The next step is matching hypernyms to
the reviews discussing corresponding aspects. With the pre-trained word embed-
dings, we select the closest words to hyponyms and match them with hypernyms
and hyponyms to reviews with regular expression matching.

Sentiment Analysis In the next step of the AABSA model, we need to iden-
tify the sentiment evaluation of identified aspects. We applied the MaxEnt clas-
sification algorithm to the sentiment classification problem. MaxEnt models are
feature-based models and could solve feature selection and model selection. Max-
Ent classification is proved to be effective in a number of natural language pro-
cessing applications [27, 3]. The goal is to assign a class c to a given document d
to maximize P (c|d), which is calculated as below:
                                     1        X
                      PM E (c|d) =       exp(     λi,c fi,c (d, c))             (3)
                                   Z(d)        i

where Z(d) is a normalization function. Fi,c is a aspect function for aspect fi
and class c. Fi,c (d, c0 ) = 1 if ni (d) > 0 and c0 = c. The λi,c is a aspect-weighted
12      Xu et al.

parameter and a large λi,c means that fi is considered a strong indicator for
class c.
    Now, each review can be represented as a vector consists of aspects and
sentiment evaluations. We measure the overall sentiment evaluation of aspect i
in week t as:
                                       Pmijt
                                 Pnit          sentimentkjt
                                          k=1
                                          j=1      mijt
                     sentimentit =                                                (4)
                                                nit
where nit is the number of reviews that mention aspect i in week t and mijt is
the time of i’s appearance in review j in week t.


3.3   BASELINE MODEL

The baseline model is developed by [2] on deriving product aspects by mining
consumer reviews. It mainly consists of four steps: pre-process content, eliminate
synonyms, obtain core word candidates, and select hypernyms. After splitting
reviews into sentences and remove stopwords, they calculate the TF-IDF value of
each word and convert words into one-hot vectors with TF-IDF values of context
words. Then they cluster word vectors with k-means clustering and choose the
center words of clusters as hypernyms. The major differences are the training
process of word vectors and the application of hierarchical structure of aspects.
The architecture of the baseline model is indicated in Figure 4.


         Input:
                            Pre-process         Eliminate        Build Word
      Raw Consumer
                             (Parsing)          Synonyms       Vectors (TF-IDF)
        Reviews


                              Output:           Sentiment        Select Core
                            Aspects and          Analysis       Words (K-means
                          Sentiment Score       (MaxEnt)          clustering)
                        Fig. 4. Baseline Model Framework

4     Empirical Applications

In this section, we evaluate the AABSA model with review data drawn from
product categories “Toothbrush” and “Camera” provided by Alibaba. In section
4.1, we first describe our data set. Then in section 4.2, we describe the identified
aspects with AABSA.


4.1   Data

Alibaba Group is one of the largest e-commerce companies in Asia, which was
first launched in 1999 in China. It is commonly referred to as the “Chinese
                                            AABSA from Customer Reviews            13

Amazon.” As of June 2019, Alibaba has 755 million active users in more than
200 countries. It has three biggest digital shopping platforms, Alibaba, Taobao,
and Tmall, which focus on B2B, C2B, and B2C business separately. Since 2010,
Alibaba has launched sales on singles day in November and Spring Festival in
January or February. In our work, we used panel data from 20 weeks between
March and July to avoid fluctuation caused by the sales effect. For each item, we
observe reviews, ratings, weekly sales, and essential attributes (e.g., price, weight,
popularity), which are defined by retailers when the products were launched. The
full data set consists of 295,628 reviews of 13,944 camera products and 18,550,956
reviews of 147,337 toothbrush products. In the pre-processing step, we use the
reviews to build a vocabulary of nouns from which we select hypernyms and
hyponyms. In the sentiment analysis step, we hired human taggers to classify
aspect sentiments into three categories: positive, neutral, and negative.


4.2   Selecting Aspects

Tables 1 and Table 2 describe the top 10 hypernyms and 50 hyponyms from
camera and toothbrush reviews. We find that aspects obtained from our model
provide more detailed and comprehensive information on product aspects and
customer needs. Each aspect captured by the AABSA model represents a de-
tailed aspect, and it could provide clear instructions for firms to perform product
improvement. For example, a positive ”pixel” aspect indicates that the photo
taken by the camera is clear. However, some words obtained by the baseline
model, such as ”cell phone” and ”camera” are broad-defined aspects, and it is
hard for firms to make specific improvements given this information.
    To make an apple-to-apple comparison, among the aspects identified by the
baseline model, we select the 10 most frequently-mentioned aspects. They are
shown in Table 3.


5     Experimental Results

We describe two main sets of results: i) performance comparison of our AABSA
and the benchmark model and ii) the marketing insights. First, we select the
most popular and frequent 10 aspects from all hypernyms and hyponyms for
AABSA and the baseline model. The aspects are shown in Table 4.
    Accuracy To compare the performance of AABSA and the benchmark
model, we create an econometric model to estimate the impact of each aspect
on product sales. The intuition is that if the identified aspects are more useful
for consumers and firms, they should be better predictors of product sales. We
calculate the percentage of positive reviews of an aspect in the past 180 days of
the week t as the support rate of the aspect and then use the support rate to
predict product sales in week t in linear regression. In our model, there are 9
hypernyms and 1 hyponym of “price,” “offline.” We then report the performance
of our model and the baseline model in terms of sales prediction accuracy and
analyze the prediction power of each aspect. The regression result of cameras is
14      Xu et al.

                    Table 1. Hypernyms and Hyponyms of Cameras

                hypernyms hyponyms
                   price   unworthy, value, half-price, replacement,
                           incredible
                   pixel   effect, clear, figure, recorder,
                           sense of camera
                 function advancement, night-vision, fun,
                           stabilization, fish-eye
                packaging beautiful packaging, delicate packaging,
                           solid packaging, thickness, protection
                  outlook chic, draft, specialty, delicacy,
                           eye catching
                  efficacy outstanding, slow motion, movie,
                           sense of color, flashlight
                   photo outstanding, clean, sense of color,
                           color tune, high definition
                  battery charging battery, battery life, duration,
                           camera battery, forbidden
                    hue    brightness, gradation, sense of color,
                           beauty, charm
                   color   outlook color, fashion, delicacy, red, pink


shown in table 5. The first column reports the estimates from AABSA, and the
second column reports estimates from the baseline model.
    We can make several inferences from the regression coefficients. First, in our
model, coefficients for every aspect are significant, and the adjust r-squared is
5% higher than that of the baseline model. Second, we find that while positive
reviews on most aspects have positive effects on sales, positives reviews on offline
stores have negative effects on sales. One plausible explanation for this effect is
that the offline and online stores are of competitive relationships, and customers
would tend to switch to offline stores if they read related positive reviews on
online retailing platforms.
    The regression result of toothbrushes is shown in table 6. In the toothbrush
category, the coefficients are all significant, and our model also out-performances
the baseline model by 2%. However, the improvement is not as much as in the
camera category. A plausible explanation is that toothbrushes are daily neces-
sities, and they are much cheaper than cameras. As a result, when consumers
purchase toothbrushes, they would spare less time to read textual reviews, and
the sales prediction power of reviews is weakened.
    Comprehensiveness We then compare the comprehensiveness of the as-
pects identified by our model to develop some intuition of what drives the per-
formance discrepancy. [34] identified 6 primary customer needs and 22 secondary
customer needs of oral care products with a machine-learning hybrid method and
then classified the needs into the primary group and the secondary group. We
compare the toothbrush aspects extracted from the AABSA model with aspects
                                             AABSA from Customer Reviews        15

               Table 2. Hypernyms and Hyponyms of Toothbrushes

               hypernyms hyponyms
                  price   offline, average price, market price
                          value package, retail store
               brush head children, easy to clean, gum pain, soft
                          toughness
                package box, bag, small bag, foam bag, hole
                  smell   chocolate,juice, orange, mellow, fragrant
                 service sincere
                attitude passionate
                 efficacy outstanding, white, enhance, disease,
                          stain removal
               brush hair thick, soft, weak, toughness, plentiful
                   gift   floss, pencil sharpener, color pen, origami
                          case
                  color   brown, pink, beige, purple, green


             Table 3. Top 10 Aspects Identified by the Baseline Model

                Category Aspects
                Camera clear, price, item, photo, satisfaction
                          efficacy, camera, delivery, style, cell phone
               Toothbrush affordable, discount, offline, satisfaction,
                          cheap, easy to use, delivery, purchase,
                          strength, praise


extracted from [34]’s model. The comparison table is listed in Table 1 in the
Appendices.
    In [34]’s results, “feel clean and fresh” captures the customer’s own oral
feeling while and after using oral care products, and in AABSA’s results, “easy to
clean” hypernym aspect also describes the customers’ feelings’ of the oral smell
after brushing teeth, and “toothbrush hair” and “toothbrush head” captures
the comfort while using the toothbrush; “strong teeth and gums” describes the
aspect of preventing gingivitis and protecting the gum, and we identified “gum
pain”; “Product efficacy” describes the efficacy of oral care products, which focus


                  Table 4. Top 10 Aspects Identified by AABSA
                Product Category Aspects
                    Camera       price, price offline, pixel,
                                 features, package, exterior,
                                 efficacy, photo, battery, color
                   Toothbrush    price, brush head, package, smell,
                                 service, attitude, efficacy,
                                 brush hair, gift, color
16      Xu et al.

                         Table 5. Estimation Results of Cameras
                          AABSA model             Baseline model
                        price      0.050 *      clear      0.110 ***
                                   (0.023)                  (0.030)
                    price offline -0.062 *     price       0.198 ***
                                   (0.027)                  (0.020)
                        pixel     0.127 ***     item       0.340 ***
                                   (0.016)                  (0.019)
                     function     0.144 ***    photo       0.101 ***
                                   (0.020)                  (0.021)
                      package     0.209 *** satisfaction     0.002
                                   (0.016)                  (0.057)
                      exterior    0.307 ***   efficacy     0.140 ***
                                   (0.021)                  (0.019)
                      efficacy    0.215 ***   camera       0.289 ***
                                   (0.017)                  (0.019)
                       photo      0.166 ***   delivery    -0.162 ***
                                   (0.019)                  (0.029)
                      battery     0.202 ***     style       0.064 *
                                   (0.022)                  ()0.030
                        color     0.215 *** cell phone 0.219 ***
                                   (0.023)                  (0.244)
                    Adjust R-Squared: 0.294 Adjust R-Squared: 0.244


on a more subjective aspect. It matches “efficacy” in our model, which reflects
the effect of using the toothbrush; “Convenience” describes the convenience of
using the oral product to reach the cleaning perspective, and AABSA identified
“easy to clean” which also describes the toothbrush’s ability to clean teeth;
and “Shopping/product choice” describes the competitiveness between brands.
From the above results, we can conclude that our model can create a more
comprehensive list of aspects than [34].


6    Discussions
In this paper, we propose an innovative method to identify product aspects by
introducing a hierarchical system. Compared with the previous aspect identi-
fication and sentiment analysis model, the AABSA model improves the com-
prehensiveness and the prediction accuracy on sales, and it is fully automatic
in aspect-identification without time-consuming hand-coding. The method has
been adopted by Alibaba’s product review tagging system to help consumers
search for product aspects.


References
 1. Alam, I., Perry, C.: A customer-oriented new service development process. Journal
    of services Marketing 16(6), 515–534 (2002)
                                              AABSA from Customer Reviews            17

                    Table 6. Estimation Results of Toothbrushes
                        AABSA model            Baseline model
                     price    0.060 *** affordable 0.038 ***
                               (0.009)                   (0.006)
                  brush head 0.379 ***     discount     0.266 ***
                               (0.019)                   (0.007)
                   package    0.544 ***     offline    -0.354 ***
                               (0.010)                   (0.010)
                     smell    0.726 *** satisfaction 0.381 ***
                               (0.011)                   (0.006)
                    service   0.087 ***     cheap       0.316 ***
                               (0.008)                   (0.007)
                   attitude   0.773 *** easy to use 0.246 ***
                               (0.007)                   (0.008)
                    efficacy  0.381 ***    delivery     0.073 ***
                               (0.010)                   (0.007)
                  brush hair 0.603 ***     purchase     0.376 ***
                               (0.033)                   (0.006)
                      gift    0.620 ***    strength     0.444 ***
                               (0.010)                   (0.009)
                     color    0.659 ***     praise     -0.212 ***
                               (0.013)                   (0.010)
                  Adjust R-Squared: 0.186 Adjust R-Squared: 0.164


 2. Archak, N., Ghose, A., Ipeirotis, P.G.: Show me the money!: deriving the pricing
    power of product features by mining consumer reviews. In: Proceedings of the 13th
    ACM SIGKDD international conference on Knowledge discovery and data mining.
    pp. 56–65. ACM (2007)
 3. Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to
    natural language processing. Computational linguistics 22(1), 39–71 (1996)
 4. Blair-Goldensohn, S., Hannan, K., McDonald, R., Neylon, T., Reis, G., Reynar,
    J.: Building a sentiment summarizer for local service reviews (2008)
 5. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine.
    Computer networks and ISDN systems 30(1-7), 107–117 (1998)
 6. Chakraborty, I., Kim, M., Sudhir, K.: Attribute sentiment scoring with online text
    reviews: Accounting for language structure and attribute self-selection (2019)
 7. Che, W., Zhao, Y., Guo, H., Su, Z., Liu, T.: Sentence compression for aspect-
    based sentiment analysis. IEEE/ACM Transactions on audio, speech, and language
    processing 23(12), 2111–2124 (2015)
 8. Chevalier, J.A., Mayzlin, D.: The effect of word of mouth on sales: Online book
    reviews. Journal of marketing research 43(3), 345–354 (2006)
 9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
10. Dhar, V., Chang, E.A.: Does chatter matter? the impact of user-generated content
    on music sales. Journal of Interactive Marketing 23(4), 300–307 (2009)
11. Gray, S., Radford, A., Kingma, D.P.: Gpu kernels for block-sparse weights. arXiv
    preprint arXiv:1711.09224 (2017)
18      Xu et al.

12. Hadano, M., Shimada, K., Endo, T.: Aspect identification of sentiment sen-
    tences using a clustering algorithm. Procedia - Social and Behavioral Sciences
    27, 22 – 31 (2011). https://doi.org/https://doi.org/10.1016/j.sbspro.2011.10.579,
    http://www.sciencedirect.com/science/article/pii/S1877042811024062, computa-
    tional Linguistics and Related Fields
13. Haveliwala, T.: Efficient computation of pagerank. Tech. rep., Stanford (1999)
14. Hu, M., Liu, B.: Mining opinion features in customer reviews. In: AAAI. vol. 4,
    pp. 755–760 (2004)
15. Johnson, R., Zhang, T.: Supervised and semi-supervised text categorization using
    lstm for region embeddings. arXiv preprint arXiv:1602.02373 (2016)
16. Kaulio, M.A.: Customer, consumer and user involvement in product development:
    A framework and a review of selected methods. Total quality management 9(1),
    141–149 (1998)
17. Kobayashi, N., Inui, K., Matsumoto, Y.: Extracting aspect-evaluation and aspect-
    of relations in opinion mining. In: Proceedings of the 2007 Joint Conference on
    Empirical Methods in Natural Language Processing and Computational Natural
    Language Learning (EMNLP-CoNLL). pp. 1065–1074 (2007)
18. Lee, H.Y., Renganathan, H.: Chinese sentiment analysis using maximum entropy.
    In: Proceedings of the Workshop on Sentiment Analysis where AI meets Psychol-
    ogy (SAAIP 2011). pp. 89–93. Asian Federation of Natural Language Processing,
    Chiang Mai, Thailand (Nov 2011), https://www.aclweb.org/anthology/W11-3713
19. Lee, T.Y., Bradlow, E.T.: Automated marketing research using online customer
    reviews. Journal of Marketing Research 48(5), 881–894 (2011)
20. Liu, X., Lee, D., Srinivasan, K.: Large scale cross category analysis of consumer
    review content on sales conversion leveraging deep learning. Available at SSRN
    2848528 (2017)
21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
    sentations of words and phrases and their compositionality. In: Advances in neural
    information processing systems. pp. 3111–3119 (2013)
22. Mubarok, M.S., Adiwijaya, Aldhi, M.D.: Aspect-based sentiment analysis to review
    products using naı̈ve bayes. In: AIP Conference Proceedings. vol. 1867, p. 020060.
    AIP Publishing (2017)
23. Mullen, T., Collier, N.: Sentiment analysis using support vector machines with
    diverse information sources. In: Proceedings of the 2004 conference on empirical
    methods in natural language processing. pp. 412–418 (2004)
24. Netzer, O., Feldman, R., Goldenberg, J., Fresko, M.: Mine your own business:
    Market-structure surveillance through text mining. Marketing Science 31(3), 521–
    543 (2012)
25. Nguyen, T.H., Shirai, K.: Phrasernn: Phrase recursive neural network for aspect-
    based sentiment analysis. In: Proceedings of the 2015 Conference on Empirical
    Methods in Natural Language Processing. pp. 2509–2514 (2015)
26. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
    Bringing order to the web. Tech. rep., Stanford InfoLab (1999)
27. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using
    machine learning techniques. In: Proceedings of the ACL-02 conference on Empir-
    ical methods in natural language processing-Volume 10. pp. 79–86. Association for
    Computational Linguistics (2002)
28. Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S.,
    Mohammad, A.S., Al-Ayyoub, M., Zhao, Y., Qin, B., De Clercq, O., et al.: Semeval-
    2016 task 5: Aspect based sentiment analysis. In: Proceedings of the 10th interna-
    tional workshop on semantic evaluation (SemEval-2016). pp. 19–30 (2016)
                                              AABSA from Customer Reviews             19

29. Radford, A., Jozefowicz, R., Sutskever, I.: Learning to generate reviews and dis-
    covering sentiment. arXiv preprint arXiv:1704.01444 (2017)
30. Ruder, S., Ghaffari, P., Breslin, J.G.: A hierarchical model of reviews for aspect-
    based sentiment analysis. arXiv preprint arXiv:1609.02745 (2016)
31. Snyder, B., Barzilay, R.: Multiple aspect ranking using the good grief algorithm.
    In: Human Language Technologies 2007: The Conference of the North American
    Chapter of the Association for Computational Linguistics; Proceedings of the Main
    Conference. pp. 300–307 (2007)
32. Tadano, R., Shimada, K., Endo, T.: Effective construction and expansion of a
    sentiment corpus using an existing corpus and evaluative criteria estimation. In:
    Proceedings of the 11th Conference of the Pacific Association for Computational
    Linguistics (PACLING2009). pp. 211–216. Citeseer (2009)
33. Thet, T.T., Na, J.C., Khoo, C.S.: Aspect-based sentiment analysis of movie reviews
    on discussion boards. Journal of information science 36(6), 823–848 (2010)
34. Timoshenko, A., Hauser, J.R.: Identifying customer needs from user-generated
    content. Marketing Science 38(1), 1–20 (2019)
35. Tirunillai, S., Tellis, G.J.: Mining marketing meaning from online chatter: Strategic
    brand analysis of big data using latent dirichlet allocation. Journal of Marketing
    Research 51(4), 463–479 (2014)
36. Tsai, Y.L., Wang, Y.C., Chung, C.W., Su, S.C., Tsai, R.T.H.: Aspect-category-
    based sentiment classification with aspect-opinion relation. In: 2016 Conference on
    Technologies and Applications of Artificial Intelligence (TAAI). pp. 162–169. IEEE
    (2016)
37. Wang, W., Pan, S.J., Dahlmeier, D., Xiao, X.: Recursive neural conditional random
    fields for aspect-based sentiment analysis. arXiv preprint arXiv:1603.06679 (2016)
38. Xue, B., Fu, C., Shaobin, Z.: A study on sentiment computing and classification
    of sina weibo with word2vec. In: 2014 IEEE International Congress on Big Data.
    pp. 358–363. IEEE (2014)
39. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: Xlnet:
    Generalized autoregressive pretraining for language understanding. arXiv preprint
    arXiv:1906.08237 (2019)
40. Yu, J., Zha, Z.J., Wang, M., Chua, T.S.: Aspect ranking: Identifying im-
    portant product aspects from online consumer reviews. In: Proceedings of
    the 49th Annual Meeting of the Association for Computational Linguis-
    tics: Human Language Technologies - Volume 1. pp. 1496–1505. HLT ’11,
    Association for Computational Linguistics, Stroudsburg, PA, USA (2011),
    http://dl.acm.org/citation.cfm?id=2002472.2002654
41. Zha, Z., Yu, J., Tang, J., Wang, M., Chua, T.: Product aspect ranking and its
    applications. IEEE Transactions on Knowledge and Data Engineering 26(5), 1211–
    1224 (May 2014). https://doi.org/10.1109/TKDE.2013.136
42. Zha, Z.J., Yu, J., Tang, J., Wang, M., Chua, T.S.: Product aspect ranking and its
    applications. IEEE transactions on knowledge and data engineering 26(5), 1211–
    1224 (2013)
43. Zhu, J., Wang, H., Tsou, B.K., Zhu, M.: Multi-aspect opinion polling from textual
    reviews. In: Proceedings of the 18th ACM conference on Information and knowledge
    management. pp. 1799–1802. ACM (2009)


7    Appendix
20   Xu et al.

                     Table 1. Comprehensiveness Comparison

          [34]’s                                AABSA
          oral care attributes
          Feel clean and fresh
          Clean feeling in my mouth             easy to clean
          Fresh breath all day long             smell
          Pleasant taste and texture            toothbrush hair,
                                                toothbrush head
          Strong teeth
          Prevent gingivitis                    gums, efficacy disease

          Product efficacy
          Able to protect my teeth           efficacy
          Whiter teeth                       efficacy whitening
          Effectively clean hard             easy to clean
          to reach areas
          Knowledge and confidence
          Gentle oral care products
          Oral care products that last
          Tools are easy to maneuver
          and manipulate
          Knowledge of proper techniques     -
          Long-term oral care health
          Motivation for good check-ups
          Able to differentiate products
          Convenience
          Efficient oral care routine        easy to clean
          Oral care “away from the bathroom” -
          Shopping/product choice
          Faith in the products              -
          Provides a good deal               price value
          Effective storage                  package
          Environmentally friendly products  -
          Easy to shop for oral care items   -
          Product aesthetics                 -


                 Table 2. Robustness check: MaxEnt and Fastttext

                     MaxEnt Precision Recall F1-score support
                       0      0.910 0.900 0.905 23830
                       1      0.939 0.946 0.942 39055
                    Avg/total 0.928 0.928 0.928 39055
                    Fasttext Precision Recall F1-score support
                       0      0.867 0.869 0.868 23830
                       1      0.920 0.919 0.919 39055
                    Avg/total 0.900 0.900 0.900 62885