=Paper=
{{Paper
|id=Vol-2960/paper15
|storemode=property
|title=A General Aspect-Term-Extraction Model for Multi-Criteria Recommendations (Long paper)
|pdfUrl=https://ceur-ws.org/Vol-2960/paper15.pdf
|volume=Vol-2960
|authors=Paolo Pastore,Andrea Iovine,Fedelucio Narducci,Giovanni Semeraro
|dblpUrl=https://dblp.org/rec/conf/recsys/PastoreINS21
}}
==A General Aspect-Term-Extraction Model for Multi-Criteria Recommendations (Long paper)==
A General Aspect-Term-Extraction Model for Multi-Criteria Recommendations Paolo Pastore1 , Andrea Iovine2 , Fedelucio Narducci1 and Giovanni Semeraro2 1 Polytechnic University of Bari, Italy 2 Dept. of Computer Science University of Bari, Italy Abstract In recent years, increasingly large quantities of user reviews have been made available by several e-commerce platforms. This content is very useful for recommender systems (RSs), since it reflects the users’ opinion of the items regarding several aspects. In fact, they are especially valuable for RSs that are able to exploit multi-faceted user ratings. However, extracting aspect-based ratings from unstructured text is not a trivial task. Deep Learning models for aspect extraction have proven to be effective, but they need to be trained on large quantities of domain-specific data, which are not always available. In this paper, we explore the possibility of transferring knowledge across domains for automatically extracting aspects from user reviews, and its implications in terms of recommendation accuracy. We performed different experiments with several Deep Learning-based Aspect Term Extraction (ATE) techniques and Multi-Criteria recommendation algorithms. Results show that our framework is able to improve recommendation accuracy compared to several baselines based on single-criteria recommendation, despite the fact that no labeled data in the target domain was used when training the ATE model. Keywords multi-criteria recommendation, deep learning, aspect term extraction, domain adaptation, transfer learning 1. Introduction both aspects and ratings must be extracted automatically from unstructured text. This task is usually referred to Nowadays, many Web platforms and e-commerce web- as Aspect-Based Sentiment Analysis (ABSA). ABSA is not sites allow customers to express their opinions by pro- a trivial task, because there is no stable definition of ”as- viding reviews on items, services, or media. Such user- pect”, due to its intrinsic subjectivity. Also, the same generated content is extremely valuable for recommen- aspect can appear in many different forms inside user dation, since it reflects the user’s perception of a spe- reviews. For instance, a reviewer could use ”service”, cific item and of specific features of that item listing ”staff” or ”waiter” for referring to the ”service” category. its strengths and weaknesses, the most important fea- For this reason, we distinguish between the aspect itself tures, and the tasks for which it is more (or less) suitable. and its representation forms in the reviews, also called Extracting this information and exploiting it to enrich aspect terms. Furthermore, the aspects used in a domain user profiles and item descriptions can give enormous are completely different to those in other domains: for advantages to Recommender Systems (RSs). Given the restaurants, users will mention features such as the food considerable importance of reviews in the recommen- or the quality of the service, when talking about smart- dation process, many works in the literature proposed phones, they will instead refer to other aspects such as the idea of integrating them into RSs, as a way to im- the screen or the camera. In recent years, many models prove their accuracy. Specifically, text reviews can be a for automatically extracting aspects from text based on solution to the rating sparsity problem often encountered Deep Learning models have been proposed. However, by RSs based on Collaborative Filtering (CF), and can these techniques need to be trained on domain-specific be used to capture a much more fine-grained model of labeled datasets that are not always available. the customer’s preferences [1]. Accordingly, instead of In this paper, we investigate the application of domain modeling the user’s profile as a set of (item, rating) pairs, adaptation strategies for aspect-based recommendation. it might be represented as a set of (item, aspect, rating) The aim is to evaluate the effectiveness of modern Deep triples. Of course, the problem with this approach is that Learning-based Aspect Term Extraction (ATE) models when no annotated data is available for the target do- 3rd Edition of Knowledge-aware and Conversational Recommender main. For this purpose, we developed an aspect-based Systems (KaRS) & 5th Edition of Recommendation in Complex Environments (ComplexRec) Joint Workshop @ RecSys 2021, recommendation framework that includes an ATE mod- September 27–1 October 2021, Amsterdam, Netherlands ule, an Aspect Clustering module, a Sentiment Analysis Envelope-Open paolo.pastore1@poliba.it (P. Pastore); andrea.iovine@uniba.it (SA) module, and a Multi-Criteria Recommender Sys- (A. Iovine); fedelucio.narducci@poliba.it (F. Narducci); tem. We performed an experimental study to compare giovanni.semeraro@uniba.it (G. Semeraro) several ATE models both in a single domain scenario © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). and in a domain adaptation setting. We then chose the CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) model that obtained the best performance in both set- as recommendation algorithms. Our work follows a simi- tings, i.e. the model that is most able to capture the es- lar approach. In our framework however, the ATE task is sential, domain-invariant characteristics of aspect terms. performed using state-of-the-art Deep Learning models. Finally, we tested the framework in a recommendation ABSA has proven to be a very effective method for scenario, to understand whether the models involved in improving the accuracy, usefulness and persuasiveness this study actually improve the accuracy of RSs, com- of the recommendations. As a result, Natural Language pared to single-criteria recommendation baselines. This Processing (NLP) research focused on improving ABSA will prove that our framework is able to successfully ex- and ATE models, and more resources have been made tract fine-grained ratings from text, and exploit them for available for these tasks. Examples of such resources are improving the quality of the recommendations. the SemEval datasets [12, 13, 14], and Hu and Liu [15]. In summary, the main contributions of this work are: Earlier works on ATE proposed strategies such as as- (a) The definition of a novel framework for aspect-based sociation rule mining [15], Conditional Random Fields recommendation, that can automatically extract aspect- (CRF) [16], knowledge-based topic modeling [17], or dou- based ratings from unstructured text (i.e. reviews) inde- ble propagation [18, 19]. In recent years, the success of pendently from the domain, using Deep Learning models; Deep Learning models in Natural Language Processing (b) An evaluation of the performance of Deep Learning- tasks meant that research focus has moved towards us- based ATE models in a domain adaptation setting (i.e. ing neural networks for ATE. Pavlopoulos and Androut- when no annotated data in the target domain is available); sopoulos [20] improved the method described in [15] by (c) An evaluation of the performance of our framework, using word embeddings generated via Word2Vec. Poria compared to a set of single-criteria recommendation base- et al. [21] used Convolutional Neural Networks (CNNs) lines, in terms of rating prediction accuracy. and several word embedding strategies. Giannakopou- los et al. [22] developed a model for both supervised and unsupervised ATE in large review datasets, based 2. Related work on Bi-Directional Long-Short Term Memory (Bi-LSTM) networks and CRF. Li and Lam [23] propose a multi- A great amount of work has been dedicated to research- task learning framework for ATE and sentiment analysis ing techniques for enhancing RSs by using data extracted based on LSTMs. Li et al. [24] use aspect detection his- from reviews. Chen et al. [1] and He et al. [2] contain a tory and opinion summary to enhance the ATE model. review of the state of the art of review-aware RSs. There Some works investigate the addition of dependency re- are three main types of approaches: Word-based, that lationships in order to improve the accuracy of neural consists of directly using words found in the review as network-based models, such as Ye et al. [25] and Luo et the user profile; Sentiment-based, that aims to extract al. [26]. the user’s overall rating of an item via Sentiment Analy- Finally, some works are focused on developing ATE sis; Aspect-based, that exploits multi-faceted ratings from methods that can generalize over different domains, us- reviews.Our work is strictly focused on aspect-based rec- ing transfer learning or domain adaptation approaches. ommendation, extracting explicit factors from text re- An early example is Jakob and Gurevych [16], which views rather than latent factors (such as in [3, 4, 5]). The used a CRF-based approach. Ding et al. [27] use RNNs main advantage is that aspects can be also useful outside combined with rule-based auxiliary labels. Wang and recommendation, e.g. for explanation. Pan [28] incorporate dependency tree information us- Many works employ strategies such as topic modeling ing Recursive Neural Networks for both Aspect Term [6], sentiment lexicons [7], or rule-based systems [2, 8] Extraction and Opinion Target Extraction tasks in or- in order to extract aspect-based ratings from reviews for der to transfer information between domains. Later, in recommendation purposes. The experiments performed [29] they introduce Transferable Interactive Memory Net- in these works prove that aspect-based ratings can indeed works (TIMN) that can effectively model a representation improve recommendation accuracy over single-criteria for aspect terms across domains. Marcacini et al. [30] use baselines. In our work, we plan to instead perform the transductive learning to map linguistic features of source ATE task by using techniques based on Deep Learning. and target domains in a heterogeneous network. Lee et al. In Musto et al. [9], ABSA is applied to a Multi-Criteria [31] propose a transfer learning approach for ATE that RS for the restaurant recommendation scenario using is based on sequentially fine-tuning pre-trained features a tool called SABRE [10], which is able to extract rele- over different product groups. Pereg et al. [32] investi- vant aspects from review text using the Kullback-Leibler gate the introduction of external syntactic features into a divergence [11], as well as the rating assigned to each BERT-based model in order to exploit structural similari- aspect. Aspects can also be organized into sub-aspects to ties of aspects across domains. Liang et al. [33] exploit obtain fine-grained information. Multi-criteria User-to- the correlation between coarse-grained aspect categories User and Item-to-Item CF algorithms were both proposed and fine-grained aspect terms via a multi-level recon- struction mechanism. In our work, we not only evaluate Analysis modules are used to compose the aspect-based the performance of several ATE approaches in a domain item ratings, which are organized into a 3-dimesional adaptation setting, but we also assess their effectiveness tensor (i.e. a tensor in which the first dimension repre- in improving the accuracy of the recommendations. sents the users, the second represents the items, and the Recently, Da’u et al. [34] investigated the application third represents the aspect clusters) which is then passed of Deep Learning aspect extraction models for recommen- to the Multi-Criteria recommendation algorithm. More dation. While this work has the same premise as ours, details on this component are discussed in Section 3.3. there are two major differences: first, the architecture Figure 1 shows an example of execution of our frame- used is based on CNNs, while we included several con- work. Each review is split into atomic sentences, and figurations based on residual LSTM and BERT. Second, then each sentence is given as input to both the ATE their work relies on the presence of annotated ATE data module and the SA module, in order to extract both as- for the target domain, and does not deal with domain pect terms and ratings. In the example, starting from the adaptation. sentence ”As always we had a great glass of wine while Based on the analysis of the literature, we have identi- we waited”, the ATE module extracts the ”glass of wine” fied a gap in the literature. In fact, the papers mentioned aspect term, and the SA module assigns a positive rating above either describe domain adaptation strategies for to it. The extracted aspect is then given as input to the As- ATE, or employ ATE for recommendation purposes. To pect Clustering module, that assigns it to the right cluster, the best of our knowledge, none combine the two ideas i.e. Beverage. The cluster information and the predicted together, by explicitly measuring the impact of domain sentiments are used to generate the aspect-based ratings adaptation on the quality of the recommendations. We tensor. The Recommendation Algorithm takes this ten- believe that this is very important, especially due to the sor as input for generating a list of recommendations. extreme scarcity of annotated datasets for training ATE systems, which hinders their applicability to the recom- mendation scenario. 3.1. Aspect Term Extraction This section is focused on describing the ATE compo- 3. Aspect-based recommendation nent of the framework. ATE is one of the sub-tasks of framework ABSA [14]. Most approaches treat the task of extracting relevant In this section, we describe a novel review-aware aspect- aspects as a sequence labeling problem [21], in which the based recommendation framework that has been created review is first tokenized, and then each token is classified for the purposes of this study. We exploit user reviews as either being an aspect term or not. A classifier can be in order to go beyond item ratings, by extracting richer trained by supplying supervised data, i.e. pre-annotated aspect-based evaluations. The main advantage of this reviews. The standard schema for annotating reviews is framework is that it lets us discover new aspects directly the BIO tagging. According to this schema, three distinct from user reviews. Additionally, the aspect-based item labels can be associated to each token: B means that ratings enrich the user profile, as they let us understand the token represents the beginning of an aspect term, I which aspects users care more about. Finally, they allow means that it represents the continuation of an aspect us to identify the individual strengths and weaknesses of term, while O means that it is not an aspect term. This each item from the user’s point of view. schema is shared with other sequence labeling tasks, such The proposed architecture is composed by several sub- as Named Entity Recognition (NER). modules as shown in the example in Figure 1. The first Figure 2 shows the architecture of the ATE module. one is the ATE module which is in charge of identifying For this task, we focused on techniques based on Deep aspects mentioned in the user reviews, by extracting the Learning, which have proven to be the most promising in corresponding aspect terms from the review text. The the state of the art. In our study, we focused on the well framework supports several ATE approaches, which will known BERT model and on the residual Bi-LSTM. BERT be detailed in Section 3.1. is one of the most recent pre-trained frameworks for NLP The second component is the Aspect Clustering mod- and it can be exploited for many tasks, including NER and ule, whose role is to group aspect terms that express ATE. The residual Bi-LSTM is a variant of the classical Bi- similar concepts together into aspects. The Sentiment directional LSTM which was successfully used in other Analysis module works in parallel with the previous two. sequence labeling tasks such as Tran et al. [35]. It is Its role is to extract the user’s sentiment from the review composed of two stacked Bi-LSTM layers, where the sum in order to assign a score to each aspect term. Details on of the output of the first and second layer is sent to the this step will be discussed in Section 3.2. final softmax layer, instead of sending only the output The outputs of the Aspect Clustering and Sentiment of the second layer. Different embedding strategies have Figure 1: Example of recommendation process been used in order to encode the tokens into real-valued the embeddings generated by ELMo are deeply contextu- vectors. In particular, we aim to use the ability to capture alized, and are more capable of handling polisemy. In this a contextual representation of words to learn a model configuration, the architecture is defined as follows: an that is independent from the domain, i.e. that is able to ELMo embedding layer is used, followed by the residual extract aspect terms from reviews of any domain. In this Bi-LSTM layers described in the previous configurations. way, we can exploit a model trained on a given domain BERT. For this configuration, we employed BERT, in- to extract aspect terms from another, unseen domain. troduced in Devlin et al. [39], which has been successfully Hence, the definition of domain adaptation. applied in a variety of NLP tasks such as NER and text The following is a list of all the ATE approaches that classification. Specifically, we employed a pre-trained are included in the evaluation. BERT model available from the PyTorch library3 . This Pre-trained Word2Vec-Residual LSTM. Word2Vec model is then fine-tuned, i.e. its parameters are updated is one of the first successful word embedding techniques, by training it on the ATE task. The NN architecture introduced in Mikolov et al. [36]. For this configuration, used by BERT is a multi-layer bidirectional Transformer we employed embeddings that were previously trained encoder, as described in [39]. from a part of the Google News datasets1 . The neural network architecture used in this configuration is the 3.2. Aspect Term Clustering and Residual Bi-directional Long-Short Term Memory (LSTM) described earlier. Sentiment Analysis Pre-trained GloVe-Residual LSTM. For this ap- As stated in the Introduction, one of the main problems proach, we used a set of pre-trained embeddings from of extracting aspect-based ratings from reviews is that GloVe. GloVe is a model for distributed word representa- users may refer to the same aspect in many different tion, introduced in Pennington et al. [37]. It is developed forms. Therefore, a strategy for grouping together all as an open-source project at Stanford University, and aspect forms that refer to the same concept is needed. We the pre-trained embeddings are publicly available2 . The propose to group aspect terms together based on their neural network architecture used is the Residual LSTM, Word2Vec representation. In the case of multi-word as- like in the previous configurations. pect terms, we calculated the average of the embeddings ELMo embeddings-Residual LSTM. ELMo (Peters of each word. We then perform a clustering task by using et al. [38]) stands for Embeddings from Language Models, the K-means algorithm. This allows us to automatically and is a novel contextualized embedding strategy. That group aspect terms into aspect categories in an unsuper- is, instead of using a single vector for each word in the vised way. dictionary, ELMo looks at the entire sentence before as- We then used the VADER sentiment analysis model of- signing each word in it its embedding. The result is that fered by the NLTK library4 to obtain the rating assigned to each aspect term in the review. Each review is split 1 https://code.google.com/archive/p/ into atomic sentences, which are fed to the sentiment word2vec/?fbclid=IwAR3poHsG_4PZdqfbR_ analyzer in order to predict their polarity. We then use JESidu9WLMf44ffd0A8ZFmrxCPiKTDghc5hQCLUeQ this sentiment to assign a score to all the aspect terms 2 https://nlp.stanford.edu/projects/glove/?fbclid= IwAR3JafEUyzBT5kwgdKHcQH20nQeTzG1NZs2_ 3 https://pypi.org/project/pytorch-pretrained-bert/ 4 BHAhuOgaluO0HC7P5WW6EC8 https://www.nltk.org/ Figure 2: Execution of the ATE task with the residual Bi-LSTM and BERT appearing in that sentence. The final output is the trans- gular Value Decomposition (SVD), which is a matrix fac- formation of each review into a set of (user, item, aspect, torization technique. More details about the SVD tech- rating) tuples. This information will be the input to the nique can be found in Koren et al. [41]. This technique Multi-Criteria RS. was originally developed for single-criteria RSs. In or- der to extend it to a multi-criteria scenario, we used a 3.3. Aspect-Based Multi-Criteria naive aggregation function-based approach [40, 42]: we divided the k-dimensional multi-criteria recommenda- recommendation tion task into a set of 𝑘 single-criteria tasks. This means Once the proposed framework has extracted all aspect- that we trained 𝑘 SVD models, one for each aspect 𝑎𝑐 , for based ratings from the reviews, the last step is the recom- 𝑐 ∈ {1, ..., 𝑘}. Each model predicts the rating for a spe- mendation. Recommendations are generated via a multi- cific aspect 𝑟𝑎𝑐 (𝑢, 𝑖). In order to predict the overall rating criteria algorithm based on collaborative filtering [40]. 𝑟(𝑢, 𝑖) for a given user 𝑢 and an item 𝑖, we calculate an For this purpose, we treated the sentiments extracted aggregate function: 𝑟(𝑢, 𝑖) = 𝑓 (𝑟𝑎1 (𝑢, 𝑖), ..., 𝑟𝑎𝑘 (𝑢, 𝑖)). In our by our framework as the ratings given by the user to case, the aggregate function is a simple average of the the item for each aspect. For each aspect that was not aspect-based ratings. mentioned in the user review, we decided to assign the item’s overall rating. This choice was made empirically, as it improved the performance of the recommendation 4. Evaluation algorithm. The rest of this section contains a description This section describes the in-vitro experiment that we set of the recommendation algorithms. up to evaluate the performance of our framework. The ex- User-to-User Multi-Criteria CF: This is an exten- periment is divided into two parts. First, we evaluate the sion of the similarity-based approaches for CF. The dis- ATE models that were described in Section 3.1, in order tance 𝑑(𝑢𝑗 , 𝑢𝑘 ) between users 𝑢𝑗 and 𝑢𝑘 is calculated using to determine which one has the best performance when a multi-criteria distance function that takes the ratings trained in a domain adaptation scenario. The second step given to each aspect into account (Equation 13 in [40]). of the experiment is the recommendation test: we extract For a new user-item pair, we generate a neighborhood aspect-based ratings from a dataset of restaurant reviews of top-n most similar users, and then we calculate the using the best ATE model from the previous test, and predicted overall rating using the adjusted weighted sum then we evaluate each of the multi-criteria recommen- of the neighbor’s ratings (Equation 3 in [40]). dation approaches discussed in Section 3.3 in terms of Item-to-Item Multi-Criteria CF: This is the multi- their rating prediction accuracy. These approaches will criteria equivalent of the item-based CF technique. As also be compared to several baselines. This experiment for the previous technique, the distance 𝑑(𝑖𝑗 , 𝑖𝑘 ) between will assess whether the multi-criteria recommendations items is calculated using a multi-criteria distance function generated by our framework are more accurate than the (Equation 5 in [9]). For any given user-item pair, we ones obtained by using single-criteria ratings. generate a neighborhood of the top-n most similar items. The overall predicted rating is calculated using the item- based equivalent of the adjusted weighted sum approach 4.1. Evaluation of the ATE approaches found in [40]. We collected six datasets for the ATE task from the lit- Multi-Criteria SVD: This approach is based on Sin- erature, three of which come from the SemEval ABSA Table 1 not being the smallest dataset, all approaches performed Description of the datasets especially poorly on it. Dataset #Sentences #Aspect terms In the domain adaptation test, ELMo outperforms the Restaurants (SemEval 2014-15-16) 7841 8183 other three models in five out of six datasets. We also Laptops (SemEval 2014) 3845 2918 compare the scores obtained from the single domain and Hotels (SemEval 2015) 266 213 Computers (Liu et al.) 531 363 domain transfer tests. In the largest datasets, we can Speakers (Liu et al.) 689 454 observe that the latter induces a substantial loss in F1 Routers (Liu et al.) 879 325 compared to the former: around 28% in the Restaurants domain, and around 47% in the Laptops domain. This loss can be attributed to the lack of domain-specific data challenges with reviews about restaurants, laptops and in the respective domains. In the smaller datasets such as hotels [12, 13, 14], while the other three are found in Liu Hotels, the loss is either very small, or nonexistent. Simi- et al. [18] and contain reviews about computers, speakers lar observations can be made for the BERT approach in and routers. Table 1 reports the number of sentences and the larger datasets. In the smaller datasets however, the aspect terms contained in each dataset. domain transfer configuration actually outperforms the A single domain study was conducted by training and single domain one. This gives more credibility to the hy- testing each ATE model on the same dataset. Train- pothesis that BERT is more susceptible to training set size test split was performed via 5-fold cross validation. The compared to ELMo. The GloVe and Word2Vec approaches metrics used to evaluate the performance are Precision, show much larger losses. This is a clear indication that Recall, and F1-score. An aspect term was considered they are less capable of transferring knowledge on the correctly recognized if all the tokens that compose it ATE task from one domain to another. were correctly tagged by the system. Therefore, partial Based on the results from this Section, we can say matches were not considered in the evaluation. For each with enough confidence that ELMo is the approach that configuration, we calculated the overall score by averag- obtained the best performance in the ATE task. Not ing the metrics obtained for each fold. only it outperformed the other three approaches in the In addition to the single domain study, we performed a single domain setting, but it is also demonstrated a good domain adaptation experiment, which tests each model’s ability to transfer the aspect extraction task over different ability to generalize the ATE task onto a new, unseen domains. For this reason, we chose this approach as part domain. We performed six tests, one for each dataset. of the ATE component of our framework. In each test, we used one dataset as the test set, and all remaining datasets as the training and development set, using a random 80-20 split. 4.2. Evaluation of the Recommender Table 2 describes the results of experiments. Single System refers to the single domain tests, while DA refers to the We performed an experiment to measure our frame- domain adaptation tests. We report the Precision, Recall work’s recommendation accuracy. In particular, the ob- and F1-measure for each dataset and each model. jective of this experiment is to answer the following re- The table shows that the combination of ELMo embed- search questions: dings with the residual Bi-LSTM is able to outperform RQ1: What is the impact of domain adaptation strate- all the other approaches, except for the domain adapta- gies for ATE on the quality of multi-criteria recommen- tion scenario in the Laptop dataset, in which case BERT dations? achieves slightly higher performance. Concerning the RQ2: How does our framework compare against sev- single domain experiment, it is also interesting to note eral single-criteria baselines? that all four approaches perform better on the Restau- For this experiment, we employed the Yelp Recruiting rants dataset than on the Laptops dataset. This is not Competition dataset5 , which contains restaurant reviews. surprising, due to the fact that the Restaurants dataset is This dataset is composed of 45, 981 users, 11, 537 items, larger than the Laptops one. Even on the smaller datasets and 229, 906 reviews, with a sparsity of around 99.95%. (Hotels, Speakers, Computers, Routers), ELMo still ob- Each item in the dataset contains the user ID, the business tained the best performance. ID, the review text, and an overall score given by the However, the situation is less clear for the other ap- user on a 1-5 scale. The review set was also filtered by proaches. On the Hotels dataset, which is the smallest excluding all users that rated less than 10 items. The one, GloVe and Word2Vec obtain second and third place, filtered dataset contains 4, 393 users, 10, 801 items, and having a F1 of 0.612 and 0.528 respectively. BERT is again 138, 301 reviews. last, with 0.332, which may suggest that this approach is especially affected by training set size. An interesting ob- servation can be made about the Routers dataset: despite 5 https://www.kaggle.com/c/yelp-recruiting/data Table 2 Results of the ATE task experiments Speakers Computers Routers ELMo BERT GloVe W2V ELMo BERT GloVe W2V ELMo BERT GloVe W2V P 0.682 0.372 0.486 0.452 0.506 0.334 0.448 0.462 0.462 0.24 0.424 0.24 Single R 0.516 0.4 0.338 0.38 0.521 0.286 0.306 0.394 0.388 0.168 0.226 0.14 F1 0.576 0.38 0.39 0.408 0.514 0.3 0.332 0.41 0.406 0.188 0.29 0.174 P 0.55 0.412 0.17 0.146 0.61 0.46 0.31 0.258 0.39 0.276 0.084 0.048 DA R 0.534 0.54 0.19 0.216 0.452 0.486 0.26 0.304 0.428 0.444 0.076 0.056 F1 0.534 0.464 0.178 0.176 0.52 0.472 0.282 0.28 0.408 0.336 0.078 0.052 Laptops Hotels Restaurants ELMo BERT GloVe W2V ELMo BERT GloVe W2V ELMo BERT GloVe W2V P 0.684 0.514 0.628 0.604 0.626 0.4 0.648 0.568 0.792 0.692 0.644 0.646 Single R 0.68 0.514 0.622 0.632 0.63 0.308 0.596 0.5 0.784 0.706 0.642 0.638 F1 0.676 0.51 0.626 0.618 0.624 0.332 0.612 0.528 0.784 0.696 0.642 0.638 P 0.508 0.436 0.092 0.08 0.648 0.592 0.61 0.542 0.67 0.59 0.186 0.186 DA R 0.282 0.31 0.04 0.046 0.624 0.672 0.552 0.464 0.496 0.364 0.096 0.096 F1 0.358 0.36 0.056 0.06 0.632 0.628 0.578 0.5 0.564 0.444 0.126 0.126 4.2.1. Experimental protocol CF baselines, we employed the variants that take into account the user and item means, to make them more The dataset was input to our framework, and all the steps comparable with the multi-criteria equivalents. This lets described in Section 3 were performed. Aspect terms us understand whether the aspect-based ratings extracted were extracted by using the ELMo approach. For this by our framework actually cause an improvement in rec- experiment, we used two ATE models: one trained on ommendation accuracy. all six datasets described in Section 4.1, and another was trained without the Restaurants datasets, which allows us to assess the difference in recommendation quality 4.2.2. Results caused by the lack of annotated ATE training data in the Table 3 reports the results obtained by the three multi- target domain. criteria recommendation algorithms supported by our The aspect terms were then grouped together into framework, with different combinations of parameters. 𝑘 aspects, and ratings were assigned via the Sentiment For the user-to-user and item-to-item algorithms, we Analysis component described in Section 3.2, which trans- chose to set the neighborhood size to 10, 20, 30, 80, and formed each review into a 𝑘 + 1-dimensional vector, con- 200. We chose these numbers as using a higher number of taining the user’s rating of the restaurant for each of the neighbors caused a decrease in the accuracy. For all three 𝑘 aspects, plus the overall rating. We experimented with algorithms, we can observe that the best performance is different sizes of 𝑘 (10, 30 and 50) in order to increase obtained by using 10 aspects. This means that by increas- the generality of the results. Finally, the aspect-based ing the number of aspects, the performance decreases. rating vectors were passed to the recommendation al- This makes sense, since the effectiveness of the multi- gorithms described in section 3.3. We evaluated the rat- criteria distance metrics largely depend on the number ing prediction accuracy of the algorithms by measuring of commonly rated aspects between the two users (or the Mean Average Error (MAE). 10-fold cross-validation the two items). Increasing the number of aspects also was performed on the dataset, and the MAE values for increases the sparsity of the aspect-based ratings, which each fold were averaged together. For each of the three makes these metrics less effective. Table 3 shows that multi-criteria recommendation algorithms (User-to-user, the multi-criteria user-to-user algorithm performs best Item-to-item, and SVD), we chose the combination of by setting the neighborhood size to 200, with a MAE parameters that obtained the best results. These models of 0.8147 and 0.8155 respectively for the model trained were then compared against several baselines: single- with and without the Restaurants dataset. For the multi- criteria user-to-user CF (with MSD and Pearson similar- criteria item-to-item variant, the best neighborhood size ity measures), single-criteria item-to-item CF (with MSD is 80 for the model trained with the Restaurants dataset, and Pearson similarity measures), Singular Value Decom- and 200 for the model trained without it. In both the position (SVD), and Non-negative Matrix Factorization neighborhood-based models, we can observe that the (NMF), which were also trained and tested using 10-fold model trained without the Restaurants dataset performs cross-validation. For both user-to-user and item-to-item slightly worse than the one trained with all datasets. This Table 3 Table 4 Results for the Multi-Criteria algorithms (MAE). The best results for each algorithm are Results of the recommendation in italic. The best overall results are in bold. test. Best results are in bold. 10 Aspects 30 Aspects 50 Aspects Configuration MAE Algorithm #N. W/Rest. W/O Rest. W/Rest. W/O Rest. W/Rest. W/O Rest. M.C. U2U (W/ Rest.) 0.8147 M.C. U2U 10 0.83 0.8306 0.8314 0.8333 0.8329 0.8349 M.C. U2U (W/O Rest.) 0.8155 M.C. U2U 20 0.8196 0.8206 0.821 0.8228 0.8222 0.8244 U2U (MSD) 0.8169 M.C. U2U 30 0.8169 0.8178 0.8182 0.8199 0.8194 0.8214 U2U (Pearson) 0.8565 M.C. U2U 80 0.8148 0.8157 0.8161 0.8176 0.8172 0.8191 M.C. I2I (W/ Rest.) 0.8183 M.C. U2U 200 0.8147 0.8155 0.8159 0.8174 0.817 0.8189 M.C. I2I (W/O Rest.) 0.8189 M.C. I2I 10 0.831 0.8321 0.8333 0.8346 0.8347 0.8364 I2I (MSD) 0.8202 M.C. I2I 20 0.8221 0.8228 0.8239 0.8252 0.8252 0.8269 I2I (Pearson) 0.8582 M.C. I2I 30 0.82 0.8206 0.8216 0.8229 0.8228 0.8246 M.C. SVD (W/ Rest.) 0.8062 M.C. I2I 80 0.8183 0.819 0.8199 0.8211 0.8211 0.8227 M.C. SVD (W/O Rest.) 0.8053 M.C. I2I 200 0.8184 0.8189 0.8199 0.8211 0.8211 0.8227 SVD 0.8107 M.C. SVD - 0.8062 0.8053 0.8064 0.8069 0.8074 0.8081 NMF 0.8737 is consistent with the observations made during the ex- recommendation accuracy. periment described in section 4.1, i.e. the loss in rec- ommendation accuracy may be caused by a loss in ATE accuracy. However, this is not true the multi-criteria SVD 5. Conclusion approach. In fact, the model trained without the Restau- In this paper, we presented an investigation on the use of rants dataset achieved better performance (MAE: 0.8053) domain adaptation strategies in order to perform Aspect compared to the one trained on all datasets (MAE: 0.8062). Term Extraction without the need for domain-specific This suggests that this approach is less susceptible to the training data, as well as the impact of using this strategy aspect-based rating sparsity problem. A Wilcoxon test in a multi-criteria recommender system. For this purpose, was performed to evaluate the significance of these dif- we developed an aspect-based recommendation frame- ferences. The test confirms that they are all significant work that automatically extracts multi-criteria ratings (𝑝 < 0.01). We can answer RQ1 by stating that that from text reviews using state-of-the-art Deep Learning the proposed domain adaptation strategy for ATE does ATE models. We performed several experiments to evalu- indeed cause a sensible loss in recommendation perfor- ate the ATE component both in a single domain and in a mance in the multi-criteria user-to-user and item-to-item domain adaptation setting in order to find the best model algorithms. However, it also was associated to an equally to use in the multi-criteria recommendation scenario. We small increase in the multi-criteria SVD algorithm. trained the aspect term extraction component twice: with Finally, in Table 4 we compare the performance of our domain-specific data, and without domain-specific data, framework with the baselines described earlier. We eval- and tested several combinations of parameters and differ- uated the single-criteria user-to-user and item-to-item ent multi-criteria recommendation algorithms in order baselines by setting the neighborhood size to 10, 20, 30, to increase the generality of the results. In all cases, the 80, and 200, and reported the best performance for each framework was able to outperform single-criteria base- baseline. The results show that all three multi-criteria lines, with small differences between the two models. algorithms are able to outperform their single-criteria Moreover, the proposed strategy improves the quality equivalents. The best result overall is achieved by the of the recommendations even when no domain-specific multi-criteria SVD on the model trained without restau- ATE training data is available. rants. In fact, even though it is based on a basic aggre- The most important limitation to the validity of our gation function-based approach, it managed to obtain a experiment is related to the small amount of data avail- significant improvement over all baselines. A Wilcoxon able for the ATE task. However, it is worth noting that statistical test was performed in order to verify the sig- this is a limitation of the state of the art, since all works nificance of the difference in MAE. The test was able to on the subject use the same datasets (or a subset of them) prove that indeed the multi-criteria SVD approach per- that we used in our work. As future work, we plan to ex- formed significantly better than all the baselines with tend this work by including more recent Deep Learning 𝑝 < 0.01. This allows us to confidently answer RQ2 by architectures for ATE. We also plan to extend the recom- stating that our framework compares favorably against mendation test, by including more multi-criteria recom- all the selected baselines even when no domain-specific mendation algorithms, and by comparing our framework ATE data was available during training. This proves with systems that extract latent factors from reviews. that the proposed domain adaptation approach is able to effectively exploit review data in order to improve the References Conference on Recommender Systems - RecSys ’17, ACM Press, Como, Italy, 2017, pp. 321–325. [1] L. Chen, G. Chen, F. Wang, Recommender URL: http://dl.acm.org/citation.cfm?doid=3109859. systems based on user reviews: the state of 3109905. doi:1 0 . 1 1 4 5 / 3 1 0 9 8 5 9 . 3 1 0 9 9 0 5 . the art, User Modeling and User-Adapted [10] A. Caputo, P. Basile, M. de Gemmis, P. Lops, G. Se- Interaction 25 (2015) 99–154. URL: http://link. meraro, G. Rossiello, SABRE: A Sentiment Aspect- springer.com/10.1007/s11257-015-9155-5. doi:1 0 . Based Retrieval Engine, in: C. Lai, A. Giuliani, 1007/s11257- 015- 9155- 5. G. Semeraro (Eds.), Information Filtering and Re- [2] X. He, T. Chen, M.-Y. Kan, X. Chen, TriRank: trieval: DART 2014: Revised and Invited Papers, Review-aware Explainable Recommendation by Studies in Computational Intelligence, Springer Modeling Aspects, in: Proceedings of the 24th International Publishing, Cham, 2017, pp. 63–78. ACM International on Conference on Information URL: https://doi.org/10.1007/978-3-319-46135-9_4. and Knowledge Management - CIKM ’15, ACM doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 3 1 9 - 4 6 1 3 5 - 9 _ 4 . Press, Melbourne, Australia, 2015, pp. 1661–1670. [11] J. M. Joyce, Kullback-Leibler Divergence, URL: http://dl.acm.org/citation.cfm?doid=2806416. Springer Berlin Heidelberg, Berlin, Hei- 2806504. doi:1 0 . 1 1 4 5 / 2 8 0 6 4 1 6 . 2 8 0 6 5 0 4 . delberg, 2011, pp. 720–722. URL: https: [3] R. Catherine, W. Cohen, Transnets: Learning to //doi.org/10.1007/978-3-642-04898-2_327. transform for recommendation, in: Proceedings doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 6 4 2 - 0 4 8 9 8 - 2 _ 3 2 7 . of the eleventh ACM conference on recommender [12] M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageor- systems, 2017, pp. 288–296. giou, I. Androutsopoulos, S. Manandhar, SemEval- [4] S. Seo, J. Huang, H. Yang, Y. Liu, Representation 2014 Task 4: Aspect Based Sentiment Analysis learning of users and items for review rating pre- (2014) 9. diction using attention-based convolutional neural [13] M. Pontiki, D. Galanis, H. Papageorgiou, S. Man- network, in: International Workshop on Machine andhar, I. Androutsopoulos, Semeval-2015 task 12: Learning Methods for Recommender Systems, 2017. Aspect based sentiment analysis, in: Proceedings [5] P. Li, A. Tuzhilin, Latent multi-criteria ratings for of the 9th international workshop on semantic eval- recommendations, in: Proceedings of the 13th ACM uation (SemEval 2015), 2015, pp. 486–495. Conference on Recommender Systems, 2019, pp. [14] M. Pontiki, D. Galanis, H. Papageorgiou, I. Androut- 428–431. sopoulos, S. Manandhar, A.-S. Mohammad, M. Al- [6] Q. Diao, M. Qiu, C.-Y. Wu, A. J. Smola, J. Jiang, Ayyoub, Y. Zhao, B. Qin, O. De Clercq, SemEval- C. Wang, Jointly modeling aspects, ratings and 2016 Task 5: Aspect Based Sentiment Analysis, in: sentiments for movie recommendation (JMARS), Proceedings of the 10th International Workshop in: Proceedings of the 20th ACM SIGKDD interna- on Semantic Evaluation (SemEval-2016), 2016, pp. tional conference on Knowledge discovery and data 19–30. mining - KDD ’14, ACM Press, New York, New York, [15] M. Hu, B. Liu, Mining and summarizing cus- USA, 2014, pp. 193–202. URL: http://dl.acm.org/ tomer reviews, in: Proceedings of the tenth ACM citation.cfm?doid=2623330.2623758. doi:1 0 . 1 1 4 5 / SIGKDD international conference on Knowledge 2623330.2623758. discovery and data mining, KDD ’04, Association [7] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, for Computing Machinery, Seattle, WA, USA, 2004, S. Ma, Explicit factor models for explainable recom- pp. 168–177. URL: https://doi.org/10.1145/1014052. mendation based on phrase-level sentiment anal- 1014073. doi:1 0 . 1 1 4 5 / 1 0 1 4 0 5 2 . 1 0 1 4 0 7 3 . ysis, in: Proceedings of the 37th international [16] N. Jakob, I. Gurevych, Extracting opinion targets ACM SIGIR conference on Research & development in a single-and cross-domain setting with condi- in information retrieval - SIGIR ’14, ACM Press, tional random fields, in: Proceedings of the 2010 Gold Coast, Queensland, Australia, 2014, pp. 83–92. conference on empirical methods in natural lan- URL: http://dl.acm.org/citation.cfm?doid=2600428. guage processing, Association for Computational 2609579. doi:1 0 . 1 1 4 5 / 2 6 0 0 4 2 8 . 2 6 0 9 5 7 9 . Linguistics, 2010, pp. 1035–1045. [8] K. Bauman, B. Liu, A. Tuzhilin, Recommending [17] Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. Castel- Items with Conditions Enhancing User Experiences lanos, R. Ghosh, Exploiting domain knowledge in Based on Sentiment Analysis of Reviews., in: aspect extraction, in: Proceedings of the 2013 Con- CBRecSys@ RecSys, 2016, pp. 19–22. ference on Empirical Methods in Natural Language [9] C. Musto, M. de Gemmis, G. Semeraro, P. Lops, Processing, 2013, pp. 1655–1667. A Multi-criteria Recommender System Exploiting [18] Q. Liu, Z. Gao, B. Liu, Y. Zhang, Automated rule Aspect-based Sentiment Analysis of Users’ Re- selection for aspect extraction in opinion mining, views, in: Proceedings of the Eleventh ACM in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. with auxiliary labels for cross-domain opinion tar- [19] Q. Liu, B. Liu, Y. Zhang, D. S. Kim, Z. Gao, Im- get extraction, in: Thirty-First AAAI Conference proving opinion aspect extraction using semantic on Artificial Intelligence, 2017. similarity and aspect associations, in: Thirtieth [28] W. Wang, S. J. Pan, Recursive Neural Structural AAAI Conference on Artificial Intelligence, 2016. Correspondence Network for Cross-domain As- [20] J. Pavlopoulos, I. Androutsopoulos, Aspect Term pect and Opinion Co-Extraction, in: Proceedings Extraction for Sentiment Analysis: New Datasets, of the 56th Annual Meeting of the Association New Evaluation Measures and an Improved Un- for Computational Linguistics (Volume 1: Long supervised Method, in: Proceedings of the 5th Papers), Association for Computational Linguis- Workshop on Language Analysis for Social Media tics, Melbourne, Australia, 2018, pp. 2171–2181. (LASM), Association for Computational Linguistics, URL: http://aclweb.org/anthology/P18-1202. doi:1 0 . Gothenburg, Sweden, 2014, pp. 44–52. URL: http: 18653/v1/P18- 1202. //aclweb.org/anthology/W14-1306. doi:1 0 . 3 1 1 5 / v 1 / [29] W. Wang, S. J. Pan, Transferable interactive mem- W14- 1306. ory network for domain adaptation in fine-grained [21] S. Poria, E. Cambria, A. Gelbukh, Aspect ex- opinion extraction, in: Proceedings of the AAAI traction for opinion mining with a deep convolu- Conference on Artificial Intelligence, volume 33, tional neural network, Knowledge-Based Systems 2019, pp. 7192–7199. Issue: 01. 108 (2016) 42–49. URL: https://linkinghub.elsevier. [30] R. M. Marcacini, R. G. Rossi, I. P. Matsuno, S. O. com/retrieve/pii/S0950705116301721. doi:1 0 . 1 0 1 6 / Rezende, Cross-domain aspect extraction for j.knosys.2016.06.009. sentiment analysis: A transductive learning ap- [22] A. Giannakopoulos, C. Musat, A. Hossmann, proach, Decision Support Systems 114 (2018) M. Baeriswyl, Unsupervised Aspect Term Extrac- 70–80. URL: http://www.sciencedirect.com/science/ tion with B-LSTM & CRF using Automatically La- article/pii/S0167923618301386. doi:1 0 . 1 0 1 6 / j . d s s . belled Datasets, in: Proceedings of the 8th Work- 2018.08.009. shop on Computational Approaches to Subjectiv- [31] Y. Lee, M. Chung, S. Cho, J. Choi, Extraction of ity, Sentiment and Social Media Analysis, 2017, pp. Product Evaluation Factors with a Convolutional 180–188. Neural Network and Transfer Learning, Neural [23] X. Li, W. Lam, Deep Multi-Task Learning for Aspect Processing Letters 50 (2019) 149–164. URL: https: Term Extraction with Memory Interaction, in: Pro- //doi.org/10.1007/s11063-018-9964-8. doi:1 0 . 1 0 0 7 / ceedings of the 2017 Conference on Empirical Meth- s11063- 018- 9964- 8. ods in Natural Language Processing, Association [32] O. Pereg, D. Korat, M. Wasserblat, Syntac- for Computational Linguistics, Copenhagen, Den- tically Aware Cross-Domain Aspect and Opin- mark, 2017, pp. 2886–2892. URL: http://aclweb.org/ ion Terms Extraction, in: Proceedings of the anthology/D17-1310. doi:1 0 . 1 8 6 5 3 / v 1 / D 1 7 - 1 3 1 0 . 28th International Conference on Computational [24] X. Li, L. Bing, P. Li, W. Lam, Z. Yang, Aspect term ex- Linguistics, International Committee on Compu- traction with history attention and selective trans- tational Linguistics, Barcelona, Spain (Online), formation, in: Proceedings of the 27th International 2020, pp. 1772–1777. URL: https://www.aclweb.org/ Joint Conference on Artificial Intelligence, 2018, pp. anthology/2020.coling-main.158. doi:1 0 . 1 8 6 5 3 / v 1 / 4194–4200. 2020.coling- main.158. [25] H. Ye, Z. Yan, Z. Luo, W. Chao, Dependency- [33] T. Liang, W. Wang, F. Lv, Weakly Supervised Do- Tree Based Convolutional Neural Networks for main Adaptation for Aspect Extraction via Multi- Aspect Term Extraction, in: J. Kim, K. Shim, level Interaction Transfer, IEEE Transactions on L. Cao, J.-G. Lee, X. Lin, Y.-S. Moon (Eds.), Ad- Neural Networks and Learning Systems (2021). Pub- vances in Knowledge Discovery and Data Mining, lisher: IEEE. volume 10235, Springer International Publishing, [34] A. Da’u, N. Salim, I. Rabiu, A. Osman, Recommenda- Cham, 2017, pp. 350–362. URL: http://link.springer. tion system exploiting aspect-based opinion mining com/10.1007/978-3-319-57529-2_28. doi:1 0 . 1 0 0 7 / with deep learning method, Information Sciences 9 7 8 - 3 - 3 1 9 - 5 7 5 2 9 - 2 _ 2 8 , series Title: Lecture Notes 512 (2020) 1279–1292. Publisher: Elsevier. in Computer Science. [35] Q. Tran, A. MacKinlay, A. J. Yepes, Named Entity [26] H. Luo, T. Li, B. Liu, B. Wang, H. Unger, Improv- Recognition with stack residual LSTM and trainable ing aspect term extraction with bidirectional de- bias decoding, arXiv:1706.07598 [cs] (2017). URL: pendency tree representation, IEEE/ACM Transac- http://arxiv.org/abs/1706.07598, arXiv: 1706.07598. tions on Audio, Speech, and Language Processing [36] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi- 27 (2019) 1201–1212. Publisher: IEEE. cient Estimation of Word Representations in Vec- [27] Y. Ding, J. Yu, J. Jiang, Recurrent neural networks tor Space, arXiv:1301.3781 [cs] (2013). URL: http: //arxiv.org/abs/1301.3781, arXiv: 1301.3781. [37] J. Pennington, R. Socher, C. Manning, Glove: Global Vectors for Word Representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Associa- tion for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543. URL: https://www.aclweb.org/ anthology/D14-1162. doi:1 0 . 3 1 1 5 / v 1 / D 1 4 - 1 1 6 2 . [38] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextual- ized word representations, arXiv:1802.05365 [cs] (2018). URL: http://arxiv.org/abs/1802.05365, arXiv: 1802.05365. [39] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv:1810.04805 [cs] (2019). URL: http://arxiv.org/abs/1810.04805, arXiv: 1810.04805. [40] G. Adomavicius, Y. Kwon, New Recommendation Techniques for Multicriteria Rating Systems, IEEE Intelligent Systems 22 (2007) 48–55. doi:1 0 . 1 1 0 9 / MIS.2007.58. [41] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, Computer 42 (2009) 30–37. Publisher: IEEE. [42] F. Ricci, L. Rokach, B. Shapira, P. B. Kan- tor (Eds.), Recommender Systems Handbook, Springer US, Boston, MA, 2011. URL: http://link. springer.com/10.1007/978-0-387-85820-3. doi:1 0 . 1007/978- 0- 387- 85820- 3.