=Paper=
{{Paper
|id=Vol-2481/paper17
|storemode=property
|title=What Makes a Review helpful? Predicting the Helpfulness of Italian TripAdvisor Reviews
|pdfUrl=https://ceur-ws.org/Vol-2481/paper17.pdf
|volume=Vol-2481
|authors=Giulia Chiriatti,Dominique Brunato,Felice Dell'Orletta,Giulia Venturi
|dblpUrl=https://dblp.org/rec/conf/clic-it/ChiriattiBDV19
}}
==What Makes a Review helpful? Predicting the Helpfulness of Italian TripAdvisor Reviews==
What Makes a Review Helpful? Predicting the Helpfulness of Italian TripAdvisor Reviews Giulia Chiriatti• , Dominique Brunato , Felice Dell’Orletta , Giulia Venturi • University of Pisa chiriattigiulia@gmail.com Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it chiriattigiulia@gmail.com {dominique.brunato,felice.dellorletta,giulia.venturi}@ilc.cnr.it Abstract sophisticated textual features” that can be useful to model a writing style typical of helpful reviews, In this paper we introduce a classification and ii) the lack of studies focused on languages system devoted to predict the helpfulness other than English. of Italian online reviews. It is based on In this paper, we address these open issues and a wide set of features reflecting the dif- we present a study devoted to predict Italian re- ferent factors involved and tested on dif- view helpfulness with a specific focus on the role ferent categories of TripAdvisor reviews. played by linguistic features in modelling the style For this purpose, we collected the first Ital- of helpful reviews. Similarly to previous studies, ian corpus of online reviews enriched with we tackled the task as a text classification problem metadata related to their helpfulness and but with two main novelties. Firstly, we relied on we carried out an in-depth analysis of the different sets of predictors, considering both lexi- most predictive features.1 cal (content) and structural features (i.e. morpho- syntactic and syntactic) aimed at reconstructing 1 Introduction the style of a text (the linguistic “form”). Sec- Predicting and modeling relevant factors that de- ondly, we investigated which typology of features termine the helpfulness of online reviews have are the most effective to predict the helpfulness of been attracting a growing attention in the Natu- online reviews and whether they remain the same ral Language Processing (NLP) community. Both across different review categories. practical applications and the interest to study hu- Our contribution. i) We collected a corpus of man variables underlying the assignment of help- Italian online reviews enriched with metadata re- ful/unhelpful votes are mainly involved. The iden- lated to their helpfulness2 . ii) We developed the tification of product reviews which are useful to first classification system devoted to predict the customers can be important for several e-business helpfulness of Italian online reviews, based on fea- purposes (e.g. the development of product recom- tures modelling both lexical and linguistic factors mendation systems) as well as to investigate per- involved, and tested it in two experimental sce- suasive elements that make a review helpful for narios, i.e. in- and out-domain with respect to the a review reader (Hong et al., 2012; Park, 2018). training category of reviews. iii) We identified and Several approaches have been devised, differing ranked the most predictive features, showing the at the level of predicting methods (mainly regres- key role played by linguistic features, especially sion or classification algorithms) and of typologies to predict the helpfulness of reviews belonging to of factors considered, including content elements a category very different from the training one. found within the review and contextual ones refer- 2 Corpus ring to user profiles. Although various strategies have already been followed, according to the re- We collected a sample of almost 1 million user- cent survey by Diaz and Ng (2018), a number of generated reviews from the Italian section of Tri- issues are still open and deserve to be explored. pAdvisor, focusing on two travel-related cate- Among others, they include i) the need for “more gories, restaurants and attractions (e.g. parks, his- 1 torical sites), and two geographical areas, Rome Copyright c 2019 for this paper by its authors. Use per- 2 mitted under Creative Commons License Attribution 4.0 In- The corpus is available for research purposes at ternational (CC BY 4.0). http://www.italianlp.it/resources/ and Milan. We also gathered two types of meta- tasks, e.g. assessment of text readability (Collins, data associated with each review: review rating 2014), identification of textual genre of a docu- and number of helpful votes. Firstly, we filtered ment (Cimino et al., 2017), we investigated in this our data according to language (Italian) and length study whether they are able to model the linguis- (> 7 tokens), discarding 52.29% of the total re- tic “form” (the style) of helpful reviews. In ad- views. Then we empirically3 set a threshold at a dition, we explored the contribution of a kind of minimum of 3 votes in order to distinguish helpful metadata feature (i.e. the star rating given by the reviews (3+ votes) from unhelpful ones (0 votes). reviewer) that has also been widely tested in stud- Some examples of reviews that belong to the two ies on helpfulness prediction, as reported in Diaz classes are reported in Table 2. In line with stud- and Ng (2018). ies carried out for the English language (Park, In order to extract lexical and linguistic predic- 2018), also in our case review votes tend to be tors of helpfulness, the corpus was linguistically sparse across all categories: in particular reviews annotated at different levels of analysis. In par- with 3+ votes constitute only 5.10% of the un- ticular, it was tagged by the PoS tagger described filtered dataset. For this reason we balanced the in Dell’Orletta (2009) and dependency-parsed by data by selecting a comparable number of helpful the DeSR parser (Attardi et al., 2009). and unhelpful reviews per restaurant or attraction. As shown in Table 1, our final corpus consists of Lexical features. They include two types of fea- 42,107 reviews from 1,218 restaurants and 383 at- tures: (i) the distribution of unigrams and bigrams tractions for a total of 4,133,312 tokens. of characters, words and lemmas (hereafter NGR); (ii) word embedding combinations (WE) obtained Category #Helpful #Unhelpful #Reviews Rome rest. 12,635 12,404 25,039 by separately computing the average of the vector Milan rest. 6,105 5,991 12,096 representations of nouns, verbs and adjectives in Attractions 2,564 2,408 4,972 the review. The word embeddings were trained TOTAL 21,304 20,803 42,107 on the ItWaC corpus (Baroni et al., 2009) and a Table 1: Corpus of helpful and unhelpful TripAd- collection of Italian tweets4 using the word2vec visor reviews. toolkit (Mikolov et al., 2013). Linguistic features. They refer to four main 3 Helpfulness Predictors types, modelling diverse aspects of writing style: raw text features, i.e. review, sentence and word According to our research purposes, we consid- length, calculated in terms of sentences, tokens ered various categories of features aimed at mod- and characters, respectively; eling both the content and the linguistic “form” of features related to lexical richness, which is cap- online reviews. They can be grouped into three tured considering i) the internal composition of main classes: lexical, linguistic and metadata fea- the vocabulary of review with respect to the Basic tures. The first typology has already been tested Italian Vocabulary and its usage repertories (De in the literature (Diaz and Ng, 2018) in order to Mauro, 2000), and ii) Type/Token Ratio; predict review helpfulness on the basis of mean- morpho-syntactic features, i.e. the distribution of ingful words. On the contrary, the use of linguistic unigrams of Parts-of-Speech, and verb moods, features extracted from sentence structure is intro- tenses and persons; duced for the first time in this paper. Differently syntactic features, which refer to diverse charac- from previous studies (Kim et al., 2006; Hong et teristics of sentence structure: i) the depth of the al., 2012) where the distribution of some Parts-Of- whole parse tree (calculated in terms of the longest Speech was exploited as helpfulness predictor, we path from the root of the dependency tree to some rely here on a wide set of linguistic features auto- leaf); ii) the length of dependency links (i.e. the matically extracted from the corpus of reviews lin- tokens occurring between the head and the depen- guistically annotated. Since they have been shown dent); iii) the distribution of dependency types, iv) to have a high discriminative power in different the average depth and the distribution of embed- 3 In order to choose the threshold value, we considered the 4 mean and the standard deviation of the number of votes in the http://www.italianlp.it/resources/italian-word- initial dataset (2.21 ± 0.59). embeddings/ Label Category Example (Italian) Example (English) Helpful Rome restaurants La prima regola di un buon ristorante The first rule of a good restaurant that che fa pizza no stop è: Scegliere makes pizza no stop is: Choose the pizza la pizza che preferisco. Qui non I prefer. Here you can not only choose solo non si può scegliere la pizza ma the pizza but it often happens that the capita spesso che escano le stesse pizze same pizzas come out more times so più volte cosı̀ uno è costretto a man- one is forced to always eat the same giare sempre la stessa!! Per non par- one!!! Not to mention the environment lare dell’ambiente poi, un vero casino, then, a real mess, I understand that the capisco che l’area bambini è la princi- children’s area is the main attraction of pale attrazione del ristorante, rivolto so- the restaurant, aimed above all at fami- prattutto alle famiglie, ma il casino che lies, but the mess that is created is not si crea non è cmq giustificabile. La pizza justifiable anyway. The pizza is of a è di una qualità davvero scadente, prati- really poor quality, practically it was camente era cruda!!! La pizza con la raw!!! Pizza with Lonza....a simple fo- Lonza....una semplice focaccia con un caccia with a piece of ham most prob- pezzo di prosciutto preso molto proba- ably taken at the discount store! Guys, bilmente al discount! Ragazzi, carina nice idea to take care of the little ones, l’idea di prendersi cura dei pargoli, ma but let’s not fool around. non prendiamoci in giro però. Unhelpful Milan restaurants Devo dire che trovandomi per caso in I must say that finding myself by chance quella zona con i miei amici abbiamo in that area with my friends we tried the provato il posto è devo dire ché è molto place and I must say that it is very wel- accogliente e che la zona per mangiare coming and that the area to eat in the nel cortile è proprio intima e carina...Per courtyard is really intimate and pretty... quanto riguarda il mangiare posso dire As for eating I can say I’m satisfied be- di essere soddisfatto perché le portate er- cause the courses were on my ropes and ano nelle mie corde ed avendo preso il having caught the fish I was satisfied pesce ero soddisfatto di quanto cucinato with what the cook had cooked. dal cuoco. Bravi mica male. Table 2: Examples of helpful vs unhelpful reviews. ded prepositional chains modifying a noun; v) a influence the voting process outcome. The higher set of features aimed at modeling the behaviour of sentence length also has an expected effect on verbal predicates, i.e. the number of verbal roots, some syntactic features correlated to complexity. the average verbal arity and the distribution of Sentences occurring in helpful texts have deeper verbs by arity, the distribution of verbal predicates syntactic trees (Avg. max depth) and contain more with elliptical subject; vi) the usage of subordi- subordinate clauses and embedded prepositional nation, calculated considering the ratio between chains. However, they appear as simpler with principal and subordinate clauses, and the average respect to other features related for instance to depth and the distribution of embedded chains of canonicity effects. They show a more standard subordinate clauses; vii) a last set of features re- syntactic structure, with a higher distribution of lated to the canonical construction of a sentence objects in post verbal position and subjects pre- in Italian, i.e. the relative ordering of subordinates ceding the main verb. Interestingly, helpfulness with respect to the main clause and of subject and is also positively correlated with a reader-focused object with respect to their verbal head. style, as shown by the greater use of pronouns and The effectiveness of these features to predict verbs in the first and second person. helpful online reviews is confirmed by the fact Metadata feature. Review star rating (STR) is that according to the Wilcoxon rank sum test, the rating score assigned by the reviewer, rang- 75% of the considered features (i.e. 160 out of ing from 1 to 5. Previous research reported in 212) turned out to vary in a statistically significant Diaz and Ng (2018) has shown that a connec- way between helpful and unhelpful reviews. As tion exists between the rating of the review and shown in Table 3, helpful reviews are on average its helpfulness. In our dataset rating scores are 1-sentence longer than unhelpful ones and they unequally distributed across the different review also contain much longer sentences. The correla- categories. Restaurant reviews are more likely to tion between length and helpfulness is not surpris- have an extreme rating, either low or high, rather ing since longer sentences are likely to be more in- than a neutral one, and helpful reviews follow the formative, thus offering more contents that might same pattern: e.g., in the Rome restaurant cate- Feature Help UnHelp Diff. rants different from the ones in the training set. N. sent 4,61 3,46 1,15 Avg. sent length 36.79 26.22 10.57 As shown in Table 4, we obtained a general im- Avg. clause length 10 11.65 -1.65 provement over the baseline with all feature con- % Nouns 23.5 24.5 -1 figurations apart from the one that exploits only % Verbs 14.28 12.79 1.49 % Adj 8.32 10.37 -2.41 the metadata feature (STR, the star rating of the % Negative adv 1.33 0.97 0.36 reviews). Nevertheless, this feature does improve % Pronouns 4.99 4.14 0.85 the accuracy score of all models by at least one % 1st sing p. 9.23 8.15 1.08 % 2nd pl p. 1.34 1.08 0.26 point, thus confirming its usefulness for helpful- Avg. prep chains length 11,4 6,3 5,1 ness prediction (Diaz and Ng, 2018). The re- Avg. max depth 7,64 6,28 1,36 sults also highlight the prominent role of lexi- % Subord clause 62,09 43,89 18,2 cal information (NGR+WE) in assessing helpful- % Post obj 78,84 68,66 10,18 % Pre subj 73,13 65,03 8.01 ness, although this is primarily explained by the in-domain scenario. Even if the accuracy of the Table 3: A subset of linguistic features whose val- linguistic model (LING) is lower with respect to ues vary in a statistically significant way between the one obtained by the other feature models, we helpful and unhelpful reviews. found out that linguistic information plays a main role in the helpfulness prediction. It allows achiev- ing an accuracy score of 66% and of 70.81% by gory 37.05% of the helpful reviews have a rat- also adding review ratings, a value that is in line ing of 1 and 25.76% a rating of 5. On the con- with that of the lexical model. trary, attractions reviews tend to have higher rat- ings, with 56.12% of the helpful ones belonging Model Accuracy to the highest-rated class. Only the attractions cat- STR 49.6% NGR 69.9% egory seems to confirm the presence of the posi- NGR+STR 71.13% tivity bias that is discussed in Diaz and Ng (2018), WE 68.54% according to which reviews with positive ratings WE+STR 69.96% NGR+WE 70.17% are seen as more helpful. NGR+WE+STR 71.14% LING 66% 4 Experiments and Results LING+STR 70.81% ALL 70.04% We addressed the helpfulness prediction task as a ALL+STR 71.05% binary classification problem. In order to assess Baseline 50.46% the contribution of each set of features illustrated Table 4: In-domain classification of helpful vs. in Section 3, we defined two experimental sce- unhelpful reviews using different feature models. narios differing at the level of review categories chosen as test data and set-up (in terms of fea- ture configurations). We built a classifier based on In the out-domain scenario we tested the consid- the LIBLINEAR implementation of Support Vec- ered feature models on reviews that belong to the tor Machines with a linear kernel (Fan et al., 2008) other two categories (Milan restaurants and attrac- and trained on a set of 12,516 reviews written for tions). As reported in Table 5, we observed that 411 Rome restaurants. All the features were previ- the performances of the classifier tested on the re- ously scaled in the same range [0, 1]. We evaluated views of Milan restaurants, even if slightly worse, our system by computing the accuracy score for are very similar to the ones obtained on the test set each feature configuration. As baseline for each of Rome restaurants. This result suggests that the review category we implemented the score of a system may perform consistently across different classifier which always outputs the most proba- geographical areas, although further experiments ble class according to the class distribution of the should be carried out. For example, we might dataset (in this case the helpful class). test our models on a greater number of cities or In the first experimental scenario we tested the other types of geographical areas. As we expected, feature models generated by the SVM classifier the accuracy decreases mainly in the domain more on a test set of 12,523 reviews that belong to the distant from the training one (i.e. the attractions same domain of the training data (i.e. the Rome category). This is especially the case of the lexical restaurants category) but were written for restau- classification model, that has a drop of 10.5 points. The star rating feature is also shown to worsen the style, as shown by the distribution of verbs in the accuracy scores, probably because of the way the first and second person. These types of features ratings are distributed in the attractions category resulted to be discriminant in the comparison be- with respect to the restaurant ones. It is interesting tween helpful and unhelpful reviews (Section 3). to note that the best performing model resulted to This shows that the writing style of helpful re- be the one exploiting the linguistic features (with a views, informative but also personal and reader- lower drop of 5.24%), thus showing the predictive focused, has an high predictive power. power of sentence structure information in predict- The importance of the linguistic features is fur- ing review helpfulness. ther confirmed by a second inspection in which the same ranking method was applied to the all- Model Milan Attractions NGR+WE 69.38% 59.67% feature model. Also in this case, we found out that NGR+WE+STR 70.92% 58.02% 59.6% of the whole set of 212 linguistic features LING 65.82% 60.76% we considered is in the 90th percentile of the rank- LING+STR 70.92% 60.28% ALL 69.2% 59.9% ing of the total 741,339 features. ALL+STR 70.78% 58.49% Baseline 50.47% 51.56% 6 Conclusion Table 5: Out-domain classification of helpful vs. In this paper, we have presented the first approach unhelpful reviews in terms of accuracy using dif- to the task of review helpfulness prediction for the ferent feature models. Italian language. Two experimental scenarios have been tested in a corpus of TripAdvisor reviews be- longing to different categories (restaurants and at- 5 Discussion tractions). In line with previous findings obtained for the English language, we confirmed that lexi- As discussed in the previous section, we found cal information plays a significant role in classify- out that linguistic features allow achieving an ac- ing helpful reviews. In addition, we proved for the curacy almost in line with the one obtained us- first time the highly predictive power of linguis- ing only lexical information. Interestingly enough, tic features modeling the writing style indepen- they are the most predictive ones in the out-of- dently from the content. This is particularly true in domain scenario. In order to gain insight into the two out-domain experiments: in the first case which of these features are the most effective in (same category, different geographical area), the the task of automatic classification, we ranked classifier based on the linguistic features achieves them according to the absolute value of their the same accuracy of the model using lexical fea- weight in the linear SVM model generated with tures and it even outperforms all the other config- the linguistic feature configuration. Among the uration models when tested on the most distant re- 50 top-ranked ones, besides the raw text features view category (restaurants vs attractions). (whose role in predicting helpfulness has already Among the possible future issues that we would been proven in the literature), we found morpho- like to investigate, an interesting one concerns the syntactic and syntactic features. They are typi- role played by metadata features. In the reported cally related to a rich and articulated writing style. results, we showed that star ratings are not relevant This is the case for example of features concern- when considered alone, but they give a plus when ing nominal modification, in particular the num- combined with both lexical and linguistic features. ber of prepositional chains (holding the 1st posi- Beyond this metadata, we would like to extend the tion in the ranking) and their average length but analysis to further user information possibly re- also the distribution of adjectives and determin- lated to review helpfulness. ers. Others involve verbal structures, e.g. the num- ber of dependents instantiated by the verbal heads Acknowledgments and the frequency of adverbs (especially negation ones). Features related to the usage of subordina- This work was partially supported by the 2-year tion, such as the number of subordinate structures project PERFORMA (Personalizzazione di pER- and the average depth of parse trees, also appear corsi FORMativi Avanzati), co-funded by Regione among the top-ranked. Finally, another group of Toscana (POR FSE 2014-2020). high-ranked features concerns a subjective writing References G. Attardi, F. Dell’Orletta, M. Simi and J. Turian. 2009. Accurate dependency parsing with a stacked multilayer perceptron. Proceedings of Evalita’09, Evaluation of NLP and Speech Tools for Italian , Reggio Emilia, December. M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. 2009 The WaCky Wide Web: A Col- lection of Very Large Linguistically Processed Web- Crawled Corpora. Language Resources and Evalu- ation, 43(3), pp. 209-226. A. Cimino, M. Wieling, F. Dell’Orletta, S. Montemagni S. and G. Venturi. 2017 Identifying Predictive Fea- tures for Textual Genre Classification: the Key Role of Syntax. Proceedings of 4th Italian Conference on Computational Linguistics (CLiC-it), 11-13 Decem- ber, 2017, Rome. K. Collins-Thompson. 2014. Computational assess- ment of text readability: a survey of current and fu- ture research. Recent Advances in Automatic Read- ability Assessment and Text Simplification, Special issue of International Journal of Applied Linguistics, (165-2), 97-135. F. Dell’Orletta. 2009. Ensemble system for Part-of- Speech tagging. Proceedings of Evalita’09, Evalu- ation of NLP and Speech Tools for Italian , Reggio Emilia, December. T. De Mauro. 2000. Grande dizionario italiano dell’uso (GRADIT). Torino, UTET. G. O. Diaz and V. Ng. 2018. Modeling and Prediction of Online Product Review Helpfulness: A Survey. ACL, 2018. R. E. Fan, K.-W. Chang, C.-J. Hsieh, X. Wang and C.- J. Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9:1871–1874. Y. Hong, J. Lu, J. Yao, Q. Zhu and G. Zhou. 2012. What Reviews are Satisfactory: Novel Features for Automatic Helpfulness Voting. Proceedings of the 35th International ACM SIGIR Conference on Re- search and Development in Information Retrieval, pp. 495-504. S. M. Kim, P. Pantel, T. Chklovski and M. Pennac- chiotti. 2006. Automatically assessing review help- fulness. Proceedings of the the 2006 Conference on Empirical Methods in Natural Language Process- ing, pp. 423-430. T. Mikolov, K. Chen, G. Corrado and J. Dean. 2013 Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 Y-J. Park. 2018. Predicting the helpfulness of on- line customer reviews across different product types. Sustainability, 10(1735), pp. 1-20.