-

General Aspect-Term-Extraction Model for Multi-Criteria Recom mendations

Paolo Pastore

paolo.pastore1@poliba.it 0 2 3 4 5 6

Andrea Iovine

andrea.iovine@uniba.it 0 1 2 3 5 6

Fedelucio Narducci

fedelucio.narducci@poliba.it 0 2 3 4 5 6

Giovanni Semeraro

giovanni.semeraro@uniba.it 0 1 2 3 5 6 0 Deep Learning models have been proposed. However 1 Dept. of Computer Science University of Bari , Italy 2 Environments (ComplexRec) Joint Workshop @ RecSys 2021 3 Learning-based Aspect Term Extraction (ATE) models 4 Polytechnic University of Bari , Italy 5 main. For this purpose , we developed an aspect-based 6 ule , an Aspect Clustering module, a Sentiment Analysis

2021

In recent years, increasingly large quantities of user reviews have been made available by several e-commerce platforms. This content is very useful for recommender systems (RSs), since it reflects the users' opinion of the items regarding several aspects. In fact, they are especially valuable for RSs that are able to exploit multi-faceted user ratings. However, extracting aspect-based ratings from unstructured text is not a trivial task. Deep Learning models for aspect extraction have proven to be efective, but they need to be trained on large quantities of domain-specific data, which are not always available. In this paper, we explore the possibility of transferring knowledge across domains for automatically extracting aspects from user reviews, and its implications in terms of recommendation accuracy. We performed diferent experiments with several Deep Learning-based Aspect Term Extraction (ATE) techniques and Multi-Criteria recommendation algorithms. Results show that our framework is able to improve recommendation accuracy compared to several baselines based on single-criteria recommendation, despite the fact that no labeled data in the target domain was used when training the ATE model. multi-criteria recommendation, deep learning, aspect term extraction, domain adaptation, transfer learning Nowadays, many Web platforms and e-commerce web- as Aspect-Based Sentiment Analysis (ABSA). ABSA is not dation process, many works in the literature proposed considerable importance of reviews in the recommen- or the quality of the service, when talking about smartthe idea of integrating them into RSs, as a way to im- the screen or the camera. In recent years, many models 3rd Edition of Knowledge-aware and Conversational Recommender from unstructured text. This task is usually referred to when no annotated data is available for the target do-

1. Introduction sites allow customers to express their opinions by providing reviews on items, services, or media. Such usergenerated content is extremely valuable for recommendation, since it reflects the user’s perception of a specific item and of specific features of that item listing its strengths and weaknesses, the most important features, and the tasks for which it is more (or less) suitable.

Extracting this information and exploiting it to enrich user profiles and item descriptions can give enormous advantages to Recommender Systems (RSs). Given the prove their accuracy. Specifically, text reviews can be a solution to the rating sparsity problem often encountered by RSs based on Collaborative Filtering (CF), and can be used to capture a much more fine-grained model of the customer’s preferences [1]. Accordingly, instead of it might be represented as a set of (item, aspect, rating) triples. Of course, the problem with this approach is that Systems (KaRS) & 5th Edition of Recommendation in Complex (SA) module, and a Multi-Criteria Recommender System. We performed an experimental study to compare several ATE models both in a single domain scenario and in a domain adaptation setting. We then chose the modeling the user’s profile as a set of (item, rating) pairs, adaptation strategies for aspect-based recommendation. model that obtained the best performance in both set- as recommendation algorithms. Our work follows a simitings, i.e. the model that is most able to capture the es- lar approach. In our framework however, the ATE task is sential, domain-invariant characteristics of aspect terms. performed using state-of-the-art Deep Learning models. Finally, we tested the framework in a recommendation ABSA has proven to be a very efective method for scenario, to understand whether the models involved in improving the accuracy, usefulness and persuasiveness this study actually improve the accuracy of RSs, com- of the recommendations. As a result, Natural Language pared to single-criteria recommendation baselines. This Processing (NLP) research focused on improving ABSA will prove that our framework is able to successfully ex- and ATE models, and more resources have been made tract fine-grained ratings from text, and exploit them for available for these tasks. Examples of such resources are improving the quality of the recommendations. the SemEval datasets [12, 13, 14], and Hu and Liu [15].

In summary, the main contributions of this work are: Earlier works on ATE proposed strategies such as as(a) The definition of a novel framework for aspect-based sociation rule mining [15], Conditional Random Fields recommendation, that can automatically extract aspect- (CRF) [16], knowledge-based topic modeling [17], or doubased ratings from unstructured text (i.e. reviews) inde- ble propagation [18, 19]. In recent years, the success of pendently from the domain, using Deep Learning models; Deep Learning models in Natural Language Processing (b) An evaluation of the performance of Deep Learning- tasks meant that research focus has moved towards usbased ATE models in a domain adaptation setting (i.e. ing neural networks for ATE. Pavlopoulos and Androutwhen no annotated data in the target domain is available); sopoulos [20] improved the method described in [15] by (c) An evaluation of the performance of our framework, using word embeddings generated via Word2Vec. Poria compared to a set of single-criteria recommendation base- et al. [21] used Convolutional Neural Networks (CNNs) lines, in terms of rating prediction accuracy. and several word embedding strategies. Giannakopoulos et al. [22] developed a model for both supervised and unsupervised ATE in large review datasets, based 2. Related work on Bi-Directional Long-Short Term Memory (Bi-LSTM) networks and CRF. Li and Lam [23] propose a multiA great amount of work has been dedicated to research- task learning framework for ATE and sentiment analysis ing techniques for enhancing RSs by using data extracted based on LSTMs. Li et al. [24] use aspect detection hisfrom reviews. Chen et al. [1] and He et al. [2] contain a tory and opinion summary to enhance the ATE model. review of the state of the art of review-aware RSs. There Some works investigate the addition of dependency reare three main types of approaches: Word-based, that lationships in order to improve the accuracy of neural consists of directly using words found in the review as network-based models, such as Ye et al. [25] and Luo et the user profile; Sentiment-based, that aims to extract al. [26]. the user’s overall rating of an item via Sentiment Analy- Finally, some works are focused on developing ATE sis; Aspect-based, that exploits multi-faceted ratings from methods that can generalize over diferent domains, usreviews.Our work is strictly focused on aspect-based rec- ing transfer learning or domain adaptation approaches. ommendation, extracting explicit factors from text re- An early example is Jakob and Gurevych [16], which views rather than latent factors (such as in [3, 4, 5]). The used a CRF-based approach. Ding et al. [27] use RNNs main advantage is that aspects can be also useful outside combined with rule-based auxiliary labels. Wang and recommendation, e.g. for explanation. Pan [28] incorporate dependency tree information us

Many works employ strategies such as topic modeling ing Recursive Neural Networks for both Aspect Term [6], sentiment lexicons [7], or rule-based systems [2, 8] Extraction and Opinion Target Extraction tasks in orin order to extract aspect-based ratings from reviews for der to transfer information between domains. Later, in recommendation purposes. The experiments performed [29] they introduce Transferable Interactive Memory Netin these works prove that aspect-based ratings can indeed works (TIMN) that can efectively model a representation improve recommendation accuracy over single-criteria for aspect terms across domains. Marcacini et al. [30] use baselines. In our work, we plan to instead perform the transductive learning to map linguistic features of source ATE task by using techniques based on Deep Learning. and target domains in a heterogeneous network. Lee et al.

In Musto et al. [9], ABSA is applied to a Multi-Criteria [31] propose a transfer learning approach for ATE that RS for the restaurant recommendation scenario using is based on sequentially fine-tuning pre-trained features a tool called SABRE [10], which is able to extract rele- over diferent product groups. Pereg et al. [ 32] investivant aspects from review text using the Kullback-Leibler gate the introduction of external syntactic features into a divergence [11], as well as the rating assigned to each BERT-based model in order to exploit structural similariaspect. Aspects can also be organized into sub-aspects to ties of aspects across domains. Liang et al. [33] exploit obtain fine-grained information. Multi-criteria User-to- the correlation between coarse-grained aspect categories User and Item-to-Item CF algorithms were both proposed and fine-grained aspect terms via a multi-level reconstruction mechanism. In our work, we not only evaluate Analysis modules are used to compose the aspect-based the performance of several ATE approaches in a domain item ratings, which are organized into a 3-dimesional adaptation setting, but we also assess their efectiveness tensor (i.e. a tensor in which the first dimension reprein improving the accuracy of the recommendations. sents the users, the second represents the items, and the

Recently, Da’u et al. [34] investigated the application third represents the aspect clusters) which is then passed of Deep Learning aspect extraction models for recommen- to the Multi-Criteria recommendation algorithm. More dation. While this work has the same premise as ours, details on this component are discussed in Section 3.3. there are two major diferences: first, the architecture Figure 1 shows an example of execution of our frameused is based on CNNs, while we included several con- work. Each review is split into atomic sentences, and ifgurations based on residual LSTM and BERT. Second, then each sentence is given as input to both the ATE their work relies on the presence of annotated ATE data module and the SA module, in order to extract both asfor the target domain, and does not deal with domain pect terms and ratings. In the example, starting from the adaptation. sentence ”As always we had a great glass of wine while

Based on the analysis of the literature, we have identi- we waited”, the ATE module extracts the ”glass of wine” ifed a gap in the literature. In fact, the papers mentioned aspect term, and the SA module assigns a positive rating above either describe domain adaptation strategies for to it. The extracted aspect is then given as input to the AsATE, or employ ATE for recommendation purposes. To pect Clustering module, that assigns it to the right cluster, the best of our knowledge, none combine the two ideas i.e. Beverage. The cluster information and the predicted together, by explicitly measuring the impact of domain sentiments are used to generate the aspect-based ratings adaptation on the quality of the recommendations. We tensor. The Recommendation Algorithm takes this tenbelieve that this is very important, especially due to the sor as input for generating a list of recommendations. extreme scarcity of annotated datasets for training ATE systems, which hinders their applicability to the recommendation scenario.

3.1. Aspect Term Extraction This section is focused on describing the ATE compo

3. Aspect-based recommendation nent of the framework. ATE is one of the sub-tasks of framework ABSA [14].

Most approaches treat the task of extracting relevant In this section, we describe a novel review-aware aspect- aspects as a sequence labeling problem [21], in which the based recommendation framework that has been created review is first tokenized, and then each token is classified for the purposes of this study. We exploit user reviews as either being an aspect term or not. A classifier can be in order to go beyond item ratings, by extracting richer trained by supplying supervised data, i.e. pre-annotated aspect-based evaluations. The main advantage of this reviews. The standard schema for annotating reviews is framework is that it lets us discover new aspects directly the BIO tagging. According to this schema, three distinct from user reviews. Additionally, the aspect-based item labels can be associated to each token: B means that ratings enrich the user profile, as they let us understand the token represents the beginning of an aspect term, I which aspects users care more about. Finally, they allow means that it represents the continuation of an aspect us to identify the individual strengths and weaknesses of term, while O means that it is not an aspect term. This each item from the user’s point of view. schema is shared with other sequence labeling tasks, such

The proposed architecture is composed by several sub- as Named Entity Recognition (NER). modules as shown in the example in Figure 1. The first Figure 2 shows the architecture of the ATE module. one is the ATE module which is in charge of identifying For this task, we focused on techniques based on Deep aspects mentioned in the user reviews, by extracting the Learning, which have proven to be the most promising in corresponding aspect terms from the review text. The the state of the art. In our study, we focused on the well framework supports several ATE approaches, which will known BERT model and on the residual Bi-LSTM. BERT be detailed in Section 3.1. is one of the most recent pre-trained frameworks for NLP

The second component is the Aspect Clustering mod- and it can be exploited for many tasks, including NER and ule, whose role is to group aspect terms that express ATE. The residual Bi-LSTM is a variant of the classical Bisimilar concepts together into aspects. The Sentiment directional LSTM which was successfully used in other Analysis module works in parallel with the previous two. sequence labeling tasks such as Tran et al. [35]. It is Its role is to extract the user’s sentiment from the review composed of two stacked Bi-LSTM layers, where the sum in order to assign a score to each aspect term. Details on of the output of the first and second layer is sent to the this step will be discussed in Section 3.2. ifnal softmax layer, instead of sending only the output

The outputs of the Aspect Clustering and Sentiment of the second layer. Diferent embedding strategies have 3https://pypi.org/project/pytorch-pretrained-bert/ 4https://www.nltk.org/ been used in order to encode the tokens into real-valued the embeddings generated by ELMo are deeply contextuvectors. In particular, we aim to use the ability to capture alized, and are more capable of handling polisemy. In this a contextual representation of words to learn a model configuration, the architecture is defined as follows: an that is independent from the domain, i.e. that is able to ELMo embedding layer is used, followed by the residual extract aspect terms from reviews of any domain. In this Bi-LSTM layers described in the previous configurations. way, we can exploit a model trained on a given domain BERT. For this configuration, we employed BERT, into extract aspect terms from another, unseen domain. troduced in Devlin et al. [39], which has been successfully Hence, the definition of domain adaptation. applied in a variety of NLP tasks such as NER and text

The following is a list of all the ATE approaches that classification. Specifically, we employed a pre-trained are included in the evaluation. BERT model available from the PyTorch library3. This

Pre-trained Word2Vec-Residual LSTM. Word2Vec model is then fine-tuned, i.e. its parameters are updated is one of the first successful word embedding techniques, by training it on the ATE task. The NN architecture introduced in Mikolov et al. [36]. For this configuration, used by BERT is a multi-layer bidirectional Transformer we employed embeddings that were previously trained encoder, as described in [39]. from a part of the Google News datasets1. The neural network architecture used in this configuration is the 3.2. Aspect Term Clustering and Residual Bi-directional Long-Short Term Memory (LSTM) Sentiment Analysis described earlier.

Pre-trained GloVe-Residual LSTM. For this ap- As stated in the Introduction, one of the main problems proach, we used a set of pre-trained embeddings from of extracting aspect-based ratings from reviews is that GloVe. GloVe is a model for distributed word representa- users may refer to the same aspect in many diferent tion, introduced in Pennington et al. [37]. It is developed forms. Therefore, a strategy for grouping together all as an open-source project at Stanford University, and aspect forms that refer to the same concept is needed. We the pre-trained embeddings are publicly available2. The propose to group aspect terms together based on their neural network architecture used is the Residual LSTM, Word2Vec representation. In the case of multi-word aslike in the previous configurations. pect terms, we calculated the average of the embeddings

ELMo embeddings-Residual LSTM. ELMo (Peters of each word. We then perform a clustering task by using et al. [38]) stands for Embeddings from Language Models, the K-means algorithm. This allows us to automatically and is a novel contextualized embedding strategy. That group aspect terms into aspect categories in an unsuperis, instead of using a single vector for each word in the vised way. dictionary, ELMo looks at the entire sentence before as- We then used the VADER sentiment analysis model ofsigning each word in it its embedding. The result is that fered by the NLTK library4 to obtain the rating assigned to each aspect term in the review. Each review is split into atomic sentences, which are fed to the sentiment analyzer in order to predict their polarity. We then use this sentiment to assign a score to all the aspect terms 1https://code.google.com/archive/p/ word2vec/?fbclid=IwAR3poHsG_4PZdqfbR_ JESidu9WLMf44ffd0A8ZFmrxCPiKTDghc5hQCLUeQ

2https://nlp.stanford.edu/projects/glove/?fbclid= IwAR3JafEUyzBT5kwgdKHcQH20nQeTzG1NZs2_ BHAhuOgaluO0HC7P5WW6EC8 3.3. Aspect-Based Multi-Criteria recommendation nique can be found in Koren et al. [41]. This technique was originally developed for single-criteria RSs. In order to extend it to a multi-criteria scenario, we used a naive aggregation function-based approach [40, 42]: we divided the k-dimensional multi-criteria recommendation task into a set of single-criteria tasks. This means Once the proposed framework has extracted all aspect- that we trained SVD models, one for each aspect , for based ratings from the reviews, the last step is the recom- ∈ {1, ..., } . Each model predicts the rating for a spemendation. Recommendations are generated via a multicriteria algorithm based on collaborative filtering [ 40]. (, ) cific aspect

(, ) . In order to predict the overall rating for a given user and an item , we calculate an For this purpose, we treated the sentiments extracted by our framework as the ratings given by the user to the item for each aspect. For each aspect that was not mentioned in the user review, we decided to assign the item’s overall rating. This choice was made empirically, as it improved the performance of the recommendation algorithm. The rest of this section contains a description of the recommendation algorithms.

User-to-User Multi-Criteria CF: This is an exten

sion of the similarity-based approaches for CF. The distance (

, ) between users and is calculated using a multi-criteria distance function that takes the ratings given to each aspect into account (Equation 13 in [40]). For a new user-item pair, we generate a neighborhood of top-n most similar users, and then we calculate the predicted overall rating using the adjusted weighted sum of the neighbor’s ratings (Equation 3 in [40]).

Item-to-Item Multi-Criteria CF: This is the multi

criteria equivalent of the item-based CF technique. As for the previous technique, the distance ( , ) between items is calculated using a multi-criteria distance function (Equation 5 in [9]). For any given user-item pair, we generate a neighborhood of the top-n most similar items. The overall predicted rating is calculated using the itembased equivalent of the adjusted weighted sum approach found in [40].

Multi-Criteria SVD: This approach is based on Sinaggregate function: (, ) = ( 1(, ), ..., (, )) . In our case, the aggregate function is a simple average of the aspect-based ratings. 4. Evaluation

This section describes the in-vitro experiment that we set

up to evaluate the performance of our framework. The experiment is divided into two parts. First, we evaluate the ATE models that were described in Section 3.1, in order to determine which one has the best performance when trained in a domain adaptation scenario. The second step of the experiment is the recommendation test: we extract aspect-based ratings from a dataset of restaurant reviews using the best ATE model from the previous test, and then we evaluate each of the multi-criteria recommendation approaches discussed in Section 3.3 in terms of their rating prediction accuracy. These approaches will also be compared to several baselines. This experiment will assess whether the multi-criteria recommendations generated by our framework are more accurate than the ones obtained by using single-criteria ratings.

4.1. Evaluation of the ATE approaches We collected six datasets for the ATE task from the literature, three of which come from the SemEval ABSA

Table 1 not being the smallest dataset, all approaches performed Description of the datasets especially poorly on it.

Dataset #Sentences #Aspect terms In the domain adaptation test, ELMo outperforms the Restaurants (SemEval 2014-15-16) 7841 8183 other three models in five out of six datasets. We also LHCapoottmeolppsus(tS(eSeremsm(ELEvivaualle2t200a11l5.4))) 3528364165 2239161338 cdoommpaainrettrhaenssfceorretsesotbst.aiInnedthferolmargtheestsidnagtlaesedtosm,waine acnand Speakers (Liu et al.) 689 454 observe that the latter induces a substantial loss in F1 Routers (Liu et al.) 879 325 compared to the former: around 28% in the Restaurants domain, and around 47% in the Laptops domain. This loss can be attributed to the lack of domain-specific data challenges with reviews about restaurants, laptops and in the respective domains. In the smaller datasets such as hotels [12, 13, 14], while the other three are found in Liu Hotels, the loss is either very small, or nonexistent. Simiet al. [18] and contain reviews about computers, speakers lar observations can be made for the BERT approach in and routers. Table 1 reports the number of sentences and the larger datasets. In the smaller datasets however, the aspect terms contained in each dataset. domain transfer configuration actually outperforms the

A single domain study was conducted by training and single domain one. This gives more credibility to the hytesting each ATE model on the same dataset. Train- pothesis that BERT is more susceptible to training set size test split was performed via 5-fold cross validation. The compared to ELMo. The GloVe and Word2Vec approaches metrics used to evaluate the performance are Precision, show much larger losses. This is a clear indication that Recall, and F1-score. An aspect term was considered they are less capable of transferring knowledge on the correctly recognized if all the tokens that compose it ATE task from one domain to another. were correctly tagged by the system. Therefore, partial Based on the results from this Section, we can say matches were not considered in the evaluation. For each with enough confidence that ELMo is the approach that configuration, we calculated the overall score by averag- obtained the best performance in the ATE task. Not ing the metrics obtained for each fold. only it outperformed the other three approaches in the

In addition to the single domain study, we performed a single domain setting, but it is also demonstrated a good domain adaptation experiment, which tests each model’s ability to transfer the aspect extraction task over diferent ability to generalize the ATE task onto a new, unseen domains. For this reason, we chose this approach as part domain. We performed six tests, one for each dataset. of the ATE component of our framework. In each test, we used one dataset as the test set, and all remaining datasets as the training and development set, using a random 80-20 split. 4.2. Evaluation of the Recommender

Table 2 describes the results of experiments. Single System refers to the single domain tests, while DA refers to the We performed an experiment to measure our framedomain adaptation tests. We report the Precision, Recall work’s recommendation accuracy. In particular, the oband F1-measure for each dataset and each model. jective of this experiment is to answer the following re

The table shows that the combination of ELMo embed- search questions: dings with the residual Bi-LSTM is able to outperform RQ1: What is the impact of domain adaptation strateall the other approaches, except for the domain adapta- gies for ATE on the quality of multi-criteria recommention scenario in the Laptop dataset, in which case BERT dations? achieves slightly higher performance. Concerning the RQ2: How does our framework compare against sevsingle domain experiment, it is also interesting to note eral single-criteria baselines? that all four approaches perform better on the Restau- For this experiment, we employed the Yelp Recruiting rants dataset than on the Laptops dataset. This is not Competition dataset5, which contains restaurant reviews. surprising, due to the fact that the Restaurants dataset is This dataset is composed of 45, 981 users, 11, 537 items, larger than the Laptops one. Even on the smaller datasets and 229, 906 reviews, with a sparsity of around 99.95%. (Hotels, Speakers, Computers, Routers), ELMo still ob- Each item in the dataset contains the user ID, the business tained the best performance. ID, the review text, and an overall score given by the

However, the situation is less clear for the other ap- user on a 1-5 scale. The review set was also filtered by proaches. On the Hotels dataset, which is the smallest excluding all users that rated less than 10 items. The one, GloVe and Word2Vec obtain second and third place, filtered dataset contains 4, 393 users, 10, 801 items, and having a F1 of 0.612 and 0.528 respectively. BERT is again 138, 301 reviews. last, with 0.332, which may suggest that this approach is especially afected by training set size. An interesting observation can be made about the Routers dataset: despite 4.2.1. Experimental protocol

CF baselines, we employed the variants that take into

account the user and item means, to make them more comparable with the multi-criteria equivalents. This lets us understand whether the aspect-based ratings extracted by our framework actually cause an improvement in recommendation accuracy.

The dataset was input to our framework, and all the steps described in Section 3 were performed. Aspect terms were extracted by using the ELMo approach. For this experiment, we used two ATE models: one trained on all six datasets described in Section 4.1, and another was trained without the Restaurants datasets, which allows us to assess the diference in recommendation quality 4.2.2. Results caused by the lack of annotated ATE training data in the Table 3 reports the results obtained by the three multitarget domain. criteria recommendation algorithms supported by our

The aspect terms were then grouped together into framework, with diferent combinations of parameters. aspects, and ratings were assigned via the Sentiment For the user-to-user and item-to-item algorithms, we Analysis component described in Section 3.2, which trans- chose to set the neighborhood size to 10, 20, 30, 80, and formed each review into a + 1 -dimensional vector, con- 200. We chose these numbers as using a higher number of taining the user’s rating of the restaurant for each of the neighbors caused a decrease in the accuracy. For all three aspects, plus the overall rating. We experimented with algorithms, we can observe that the best performance is diferent sizes of (10, 30 and 50) in order to increase obtained by using 10 aspects. This means that by increasthe generality of the results. Finally, the aspect-based ing the number of aspects, the performance decreases. rating vectors were passed to the recommendation al- This makes sense, since the efectiveness of the multigorithms described in section 3.3. We evaluated the rat- criteria distance metrics largely depend on the number ing prediction accuracy of the algorithms by measuring of commonly rated aspects between the two users (or the Mean Average Error (MAE). 10-fold cross-validation the two items). Increasing the number of aspects also was performed on the dataset, and the MAE values for increases the sparsity of the aspect-based ratings, which each fold were averaged together. For each of the three makes these metrics less efective. Table 3 shows that multi-criteria recommendation algorithms (User-to-user, the multi-criteria user-to-user algorithm performs best Item-to-item, and SVD), we chose the combination of by setting the neighborhood size to 200, with a MAE parameters that obtained the best results. These models of 0.8147 and 0.8155 respectively for the model trained were then compared against several baselines: single- with and without the Restaurants dataset. For the multicriteria user-to-user CF (with MSD and Pearson similar- criteria item-to-item variant, the best neighborhood size ity measures), single-criteria item-to-item CF (with MSD is 80 for the model trained with the Restaurants dataset, and Pearson similarity measures), Singular Value Decom- and 200 for the model trained without it. In both the position (SVD), and Non-negative Matrix Factorization neighborhood-based models, we can observe that the (NMF), which were also trained and tested using 10-fold model trained without the Restaurants dataset performs cross-validation. For both user-to-user and item-to-item slightly worse than the one trained with all datasets. This is consistent with the observations made during the ex- recommendation accuracy. periment described in section 4.1, i.e. the loss in recommendation accuracy may be caused by a loss in ATE accuracy. However, this is not true the multi-criteria SVD 5. Conclusion approach. In fact, the model trained without the Restaurants dataset achieved better performance (MAE: 0.8053) In this paper, we presented an investigation on the use of compared to the one trained on all datasets (MAE: 0.8062). domain adaptation strategies in order to perform Aspect This suggests that this approach is less susceptible to the Term Extraction without the need for domain-specific aspect-based rating sparsity problem. A Wilcoxon test training data, as well as the impact of using this strategy was performed to evaluate the significance of these dif- in a multi-criteria recommender system. For this purpose, ferences. The test confirms that they are all significant we developed an aspect-based recommendation frame(th<ep0ro.0p1os)e.dWdoemcaainn aandsawpteartiRoQn1strbaytesgtaytfinorg AthTaEt dtoheast wfroomrk ttehxattraeuvtieowmsatuiscianllgysetaxtter-aocft-sthmeu-altrit-cDreiteepriLaeraartni ninggs indeed cause a sensible loss in recommendation perfor- ATE models. We performed several experiments to evalumance in the multi-criteria user-to-user and item-to-item ate the ATE component both in a single domain and in a algorithms. However, it also was associated to an equally domain adaptation setting in order to find the best model small increase in the multi-criteria SVD algorithm. to use in the multi-criteria recommendation scenario. We

Finally, in Table 4 we compare the performance of our trained the aspect term extraction component twice: with framework with the baselines described earlier. We eval- domain-specific data, and without domain-specific data, uated the single-criteria user-to-user and item-to-item and tested several combinations of parameters and diferbaselines by setting the neighborhood size to 10, 20, 30, ent multi-criteria recommendation algorithms in order 80, and 200, and reported the best performance for each to increase the generality of the results. In all cases, the baseline. The results show that all three multi-criteria framework was able to outperform single-criteria basealgorithms are able to outperform their single-criteria lines, with small diferences between the two models. equivalents. The best result overall is achieved by the Moreover, the proposed strategy improves the quality multi-criteria SVD on the model trained without restau- of the recommendations even when no domain-specific rants. In fact, even though it is based on a basic aggre- ATE training data is available. gation function-based approach, it managed to obtain a The most important limitation to the validity of our significant improvement over all baselines. A Wilcoxon experiment is related to the small amount of data availstatistical test was performed in order to verify the sig- able for the ATE task. However, it is worth noting that nificance of the diference in MAE. The test was able to this is a limitation of the state of the art, since all works prove that indeed the multi-criteria SVD approach per- on the subject use the same datasets (or a subset of them) formed significantly better than all the baselines with that we used in our work. As future work, we plan to ex < 0.01 . This allows us to confidently answer RQ2 by tend this work by including more recent Deep Learning stating that our framework compares favorably against architectures for ATE. We also plan to extend the recomall the selected baselines even when no domain-specific mendation test, by including more multi-criteria recomATE data was available during training. This proves mendation algorithms, and by comparing our framework that the proposed domain adaptation approach is able to with systems that extract latent factors from reviews. efectively exploit review data in order to improve the

Conference on Recommender Systems - RecSys

’17, ACM Press, Como, Italy, 2017, pp. 321–325. [1] L. Chen, G. Chen, F. Wang, Recommender URL: http://dl.acm.org/citation.cfm?doid=3109859. systems based on user reviews: the state of 3109905. doi:1 0 . 1 1 4 5 / 3 1 0 9 8 5 9 . 3 1 0 9 9 0 5 . the art, User Modeling and User-Adapted [10] A. Caputo, P. Basile, M. de Gemmis, P. Lops, G. SeInteraction 25 (2015) 99–154. URL: http://link. meraro, G. Rossiello, SABRE: A Sentiment Aspectspringer.com/10.1007/s11257-015-9155-5. doi:1 0 . Based Retrieval Engine, in: C. Lai, A. Giuliani, 1 0 0 7 / s 1 1 2 5 7 - 0 1 5 - 9 1 5 5 - 5 . G. Semeraro (Eds.), Information Filtering and Re[2] X. He, T. Chen, M.-Y. Kan, X. Chen, TriRank: trieval: DART 2014: Revised and Invited Papers, Review-aware Explainable Recommendation by Studies in Computational Intelligence, Springer Modeling Aspects, in: Proceedings of the 24th International Publishing, Cham, 2017, pp. 63–78. ACM International on Conference on Information URL: https://doi.org/10.1007/978-3-319-46135-9_4. and Knowledge Management - CIKM ’15, ACM doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 3 1 9 - 4 6 1 3 5 - 9 _ 4 . Press, Melbourne, Australia, 2015, pp. 1661–1670. [11] J. M. Joyce, Kullback-Leibler Divergence, URL: http://dl.acm.org/citation.cfm?doid=2806416. Springer Berlin Heidelberg, Berlin, Hei2806504. doi:1 0 . 1 1 4 5 / 2 8 0 6 4 1 6 . 2 8 0 6 5 0 4 . delberg, 2011, pp. 720–722. URL: https: [3] R. Catherine, W. Cohen, Transnets: Learning to //doi.org/10.1007/978-3-642-04898-2_327. transform for recommendation, in: Proceedings doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 6 4 2 - 0 4 8 9 8 - 2 _ 3 2 7 . of the eleventh ACM conference on recommender [12] M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorsystems, 2017, pp. 288–296. giou, I. Androutsopoulos, S. Manandhar, SemEval[4] S. Seo, J. Huang, H. Yang, Y. Liu, Representation 2014 Task 4: Aspect Based Sentiment Analysis learning of users and items for review rating pre- ( 2014 ) 9. diction using attention-based convolutional neural [13] M. Pontiki, D. Galanis, H. Papageorgiou, S. Mannetwork, in: International Workshop on Machine andhar, I. Androutsopoulos, Semeval-2015 task 12: Learning Methods for Recommender Systems, 2017. Aspect based sentiment analysis, in: Proceedings [5] P. Li, A. Tuzhilin, Latent multi-criteria ratings for of the 9th international workshop on semantic evalrecommendations, in: Proceedings of the 13th ACM uation (SemEval 2015), 2015, pp. 486–495. Conference on Recommender Systems, 2019, pp. [14] M. Pontiki, D. Galanis, H. Papageorgiou, I. Androut428–431. sopoulos, S. Manandhar, A.-S. Mohammad, M. Al[6] Q. Diao, M. Qiu, C.-Y. Wu, A. J. Smola, J. Jiang, Ayyoub, Y. Zhao, B. Qin, O. De Clercq, SemEvalC. Wang, Jointly modeling aspects, ratings and 2016 Task 5: Aspect Based Sentiment Analysis, in: sentiments for movie recommendation (JMARS), Proceedings of the 10th International Workshop in: Proceedings of the 20th ACM SIGKDD interna- on Semantic Evaluation (SemEval-2016), 2016, pp. tional conference on Knowledge discovery and data 19–30. mining - KDD ’14, ACM Press, New York, New York, [15] M. Hu, B. Liu, Mining and summarizing cusUSA, 2014, pp. 193–202. URL: http://dl.acm.org/ tomer reviews, in: Proceedings of the tenth ACM citation.cfm?doid=2623330.2623758. doi:1 0 . 1 1 4 5 / SIGKDD international conference on Knowledge 2 6 2 3 3 3 0 . 2 6 2 3 7 5 8 . discovery and data mining, KDD ’04, Association [7] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, for Computing Machinery, Seattle, WA, USA, 2004, S. Ma, Explicit factor models for explainable recom- pp. 168–177. URL: https://doi.org/10.1145/1014052. mendation based on phrase-level sentiment anal- 1014073. doi:1 0 . 1 1 4 5 / 1 0 1 4 0 5 2 . 1 0 1 4 0 7 3 . ysis, in: Proceedings of the 37th international [16] N. Jakob, I. Gurevych, Extracting opinion targets ACM SIGIR conference on Research & development in a single-and cross-domain setting with condiin information retrieval - SIGIR ’14, ACM Press, tional random fields, in: Proceedings of the 2010 Gold Coast, Queensland, Australia, 2014, pp. 83–92. conference on empirical methods in natural lanURL: http://dl.acm.org/citation.cfm?doid=2600428. guage processing, Association for Computational 2609579. doi:1 0 . 1 1 4 5 / 2 6 0 0 4 2 8 . 2 6 0 9 5 7 9 . Linguistics, 2010, pp. 1035–1045. [8] K. Bauman, B. Liu, A. Tuzhilin, Recommending [17] Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. CastelItems with Conditions Enhancing User Experiences lanos, R. Ghosh, Exploiting domain knowledge in Based on Sentiment Analysis of Reviews., in: aspect extraction, in: Proceedings of the 2013 ConCBRecSys@ RecSys, 2016, pp. 19–22. ference on Empirical Methods in Natural Language [9] C. Musto, M. de Gemmis, G. Semeraro, P. Lops, Processing, 2013, pp. 1655–1667.

A Multi-criteria Recommender System Exploiting [18] Q. Liu, Z. Gao, B. Liu, Y. Zhang, Automated rule Aspect-based Sentiment Analysis of Users’ Re- selection for aspect extraction in opinion mining, views, in: Proceedings of the Eleventh ACM in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. with auxiliary labels for cross-domain opinion tar[19] Q. Liu, B. Liu, Y. Zhang, D. S. Kim, Z. Gao, Im- get extraction, in: Thirty-First AAAI Conference proving opinion aspect extraction using semantic on Artificial Intelligence, 2017. similarity and aspect associations, in: Thirtieth [28] W. Wang, S. J. Pan, Recursive Neural Structural AAAI Conference on Artificial Intelligence, 2016. Correspondence Network for Cross-domain As[20] J. Pavlopoulos, I. Androutsopoulos, Aspect Term pect and Opinion Co-Extraction, in: Proceedings Extraction for Sentiment Analysis: New Datasets, of the 56th Annual Meeting of the Association New Evaluation Measures and an Improved Un- for Computational Linguistics (Volume 1: Long supervised Method, in: Proceedings of the 5th Papers), Association for Computational LinguisWorkshop on Language Analysis for Social Media tics, Melbourne, Australia, 2018, pp. 2171–2181. (LASM), Association for Computational Linguistics, URL: http://aclweb.org/anthology/P18-1202. doi:1 0 . Gothenburg, Sweden, 2014, pp. 44–52. URL: http: 1 8 6 5 3 / v 1 / P 1 8 - 1 2 0 2 . //aclweb.org/anthology/W14-1306. doi:1 0 . 3 1 1 5 / v 1 / [29] W. Wang, S. J. Pan, Transferable interactive memW 1 4 - 1 3 0 6 . ory network for domain adaptation in fine-grained [21] S. Poria, E. Cambria, A. Gelbukh, Aspect ex- opinion extraction, in: Proceedings of the AAAI traction for opinion mining with a deep convolu- Conference on Artificial Intelligence, volume 33, tional neural network, Knowledge-Based Systems 2019, pp. 7192–7199. Issue: 01. 108 (2016) 42–49. URL: https://linkinghub.elsevier. [30] R. M. Marcacini, R. G. Rossi, I. P. Matsuno, S. O. com/retrieve/pii/S0950705116301721. doi:1 0 . 1 0 1 6 / Rezende, Cross-domain aspect extraction for j . k n o s y s . 2 0 1 6 . 0 6 . 0 0 9 . sentiment analysis: A transductive learning ap[22] A. Giannakopoulos, C. Musat, A. Hossmann, proach, Decision Support Systems 114 ( 2018 ) M. Baeriswyl, Unsupervised Aspect Term Extrac- 70–80. URL: http://www.sciencedirect.com/science/ tion with B-LSTM & CRF using Automatically La- article/pii/S0167923618301386. doi:1 0 . 1 0 1 6 / j . d s s . belled Datasets, in: Proceedings of the 8th Work- 2 0 1 8 . 0 8 . 0 0 9 . shop on Computational Approaches to Subjectiv- [31] Y. Lee, M. Chung, S. Cho, J. Choi, Extraction of ity, Sentiment and Social Media Analysis, 2017, pp. Product Evaluation Factors with a Convolutional 180–188. Neural Network and Transfer Learning, Neural [23] X. Li, W. Lam, Deep Multi-Task Learning for Aspect Processing Letters 50 ( 2019 ) 149–164. URL: https: Term Extraction with Memory Interaction, in: Pro- //doi.org/10.1007/s11063-018-9964-8. doi:1 0 . 1 0 0 7 / ceedings of the 2017 Conference on Empirical Meth- s 1 1 0 6 3 - 0 1 8 - 9 9 6 4 - 8 . ods in Natural Language Processing, Association [32] O. Pereg, D. Korat, M. Wasserblat, Syntacfor Computational Linguistics, Copenhagen, Den- tically Aware Cross-Domain Aspect and Opinmark, 2017, pp. 2886–2892. URL: http://aclweb.org/ ion Terms Extraction, in: Proceedings of the anthology/D17-1310. doi:1 0 . 1 8 6 5 3 / v 1 / D 1 7 - 1 3 1 0 . 28th International Conference on Computational [24] X. Li, L. Bing, P. Li, W. Lam, Z. Yang, Aspect term ex- Linguistics, International Committee on Computraction with history attention and selective trans- tational Linguistics, Barcelona, Spain (Online), formation, in: Proceedings of the 27th International 2020, pp. 1772–1777. URL: https://www.aclweb.org/ Joint Conference on Artificial Intelligence, 2018, pp. anthology/2020.coling-main.158. doi:1 0 . 1 8 6 5 3 / v 1 / 4194–4200. 2 0 2 0 . c o l i n g - m a i n . 1 5 8 . [25] H. Ye, Z. Yan, Z. Luo, W. Chao, Dependency- [33] T. Liang, W. Wang, F. Lv, Weakly Supervised DoTree Based Convolutional Neural Networks for main Adaptation for Aspect Extraction via MultiAspect Term Extraction, in: J. Kim, K. Shim, level Interaction Transfer, IEEE Transactions on L. Cao, J.-G. Lee, X. Lin, Y.-S. Moon (Eds.), Ad- Neural Networks and Learning Systems (2021). Pubvances in Knowledge Discovery and Data Mining, lisher: IEEE. volume 10235, Springer International Publishing, [34] A. Da’u, N. Salim, I. Rabiu, A. Osman, RecommendaCham, 2017, pp. 350–362. URL: http://link.springer. tion system exploiting aspect-based opinion mining com/10.1007/978-3-319-57529-2_28. doi:1 0 . 1 0 0 7 / with deep learning method, Information Sciences 9 7 8 - 3 - 3 1 9 - 5 7 5 2 9 - 2 _ 2 8 , series Title: Lecture Notes 512 (2020) 1279–1292. Publisher: Elsevier. in Computer Science. [35] Q. Tran, A. MacKinlay, A. J. Yepes, Named Entity [26] H. Luo, T. Li, B. Liu, B. Wang, H. Unger, Improv- Recognition with stack residual LSTM and trainable ing aspect term extraction with bidirectional de- bias decoding, arXiv:1706.07598 [cs] (2017). URL: pendency tree representation, IEEE/ACM Transac- http://arxiv.org/abs/1706.07598, arXiv: 1706.07598. tions on Audio, Speech, and Language Processing [36] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efi27 ( 2019 ) 1201–1212. Publisher: IEEE. cient Estimation of Word Representations in Vec[27] Y. Ding, J. Yu, J. Jiang, Recurrent neural networks tor Space, arXiv:1301.3781 [cs] (2013). URL: http:

//arxiv.org/abs/1301.3781, arXiv: 1301 . 3781 . [37]

Pennington ,

Socher ,

Manning , Glove: Global

of the 2014 Conference on Empirical Methods in

2014 , pp. 1532 - 1543 . URL: https://www.aclweb.org/

anthology/D14-1162 . doi: 10 .3115/v1/ D14 - 1162. [38]

M. E.

Peters ,

Neumann ,

Iyyer , M. Gardner,

ized word representations , arXiv: 1802 .05365 [cs]

( 2018 ). URL: http://arxiv.org/abs/ 1802 .05365, arXiv:

1802 . 05365 . [39]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT:

Language

Understanding , arXiv: 1810 .04805 [cs]

( 2019 ). URL: http://arxiv.org/abs/ 1810 .04805, arXiv:

1810 . 04805 . [40]

Adomavicius ,

Kwon , New Recommendation

Intelligent Systems 22 ( 2007 ) 48 - 55 . doi: 10 .1109/

MIS.

2007 . 58 . [41]

Koren ,

Bell ,

Volinsky , Matrix factorization

42 ( 2009 ) 30 - 37 . Publisher: IEEE. [42]

Ricci ,

Rokach ,

Shapira , P. B. Kan-

Springer

, Boston, MA, 2011 . URL: http://link.

springer.com/10.1007/978-0- 387 -85820-3. doi:10.