=Paper=
{{Paper
|id=Vol-2481/paper17
|storemode=property
|title=What Makes a Review helpful? Predicting the Helpfulness of Italian TripAdvisor Reviews
|pdfUrl=https://ceur-ws.org/Vol-2481/paper17.pdf
|volume=Vol-2481
|authors=Giulia Chiriatti,Dominique Brunato,Felice Dell'Orletta,Giulia Venturi
|dblpUrl=https://dblp.org/rec/conf/clic-it/ChiriattiBDV19
}}
==What Makes a Review helpful? Predicting the Helpfulness of Italian TripAdvisor Reviews==
<pdf width="1500px">https://ceur-ws.org/Vol-2481/paper17.pdf</pdf>
<pre>
                             What Makes a Review Helpful?
                Predicting the Helpfulness of Italian TripAdvisor Reviews
     Giulia Chiriatti• , Dominique Brunato , Felice Dell’Orletta , Giulia Venturi
                                       •
                                         University of Pisa
                                    chiriattigiulia@gmail.com
         
           Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                                ItaliaNLP Lab - www.italianlp.it
                             chiriattigiulia@gmail.com
{dominique.brunato,felice.dellorletta,giulia.venturi}@ilc.cnr.it
                         Abstract                               sophisticated textual features” that can be useful
                                                                to model a writing style typical of helpful reviews,
        In this paper we introduce a classification             and ii) the lack of studies focused on languages
        system devoted to predict the helpfulness               other than English.
        of Italian online reviews. It is based on                  In this paper, we address these open issues and
        a wide set of features reflecting the dif-              we present a study devoted to predict Italian re-
        ferent factors involved and tested on dif-              view helpfulness with a specific focus on the role
        ferent categories of TripAdvisor reviews.               played by linguistic features in modelling the style
        For this purpose, we collected the first Ital-          of helpful reviews. Similarly to previous studies,
        ian corpus of online reviews enriched with              we tackled the task as a text classification problem
        metadata related to their helpfulness and               but with two main novelties. Firstly, we relied on
        we carried out an in-depth analysis of the              different sets of predictors, considering both lexi-
        most predictive features.1                              cal (content) and structural features (i.e. morpho-
                                                                syntactic and syntactic) aimed at reconstructing
1       Introduction                                            the style of a text (the linguistic “form”). Sec-
Predicting and modeling relevant factors that de-               ondly, we investigated which typology of features
termine the helpfulness of online reviews have                  are the most effective to predict the helpfulness of
been attracting a growing attention in the Natu-                online reviews and whether they remain the same
ral Language Processing (NLP) community. Both                   across different review categories.
practical applications and the interest to study hu-            Our contribution. i) We collected a corpus of
man variables underlying the assignment of help-                Italian online reviews enriched with metadata re-
ful/unhelpful votes are mainly involved. The iden-              lated to their helpfulness2 . ii) We developed the
tification of product reviews which are useful to               first classification system devoted to predict the
customers can be important for several e-business               helpfulness of Italian online reviews, based on fea-
purposes (e.g. the development of product recom-                tures modelling both lexical and linguistic factors
mendation systems) as well as to investigate per-               involved, and tested it in two experimental sce-
suasive elements that make a review helpful for                 narios, i.e. in- and out-domain with respect to the
a review reader (Hong et al., 2012; Park, 2018).                training category of reviews. iii) We identified and
Several approaches have been devised, differing                 ranked the most predictive features, showing the
at the level of predicting methods (mainly regres-              key role played by linguistic features, especially
sion or classification algorithms) and of typologies            to predict the helpfulness of reviews belonging to
of factors considered, including content elements               a category very different from the training one.
found within the review and contextual ones refer-              2   Corpus
ring to user profiles. Although various strategies
have already been followed, according to the re-                We collected a sample of almost 1 million user-
cent survey by Diaz and Ng (2018), a number of                  generated reviews from the Italian section of Tri-
issues are still open and deserve to be explored.               pAdvisor, focusing on two travel-related cate-
Among others, they include i) the need for “more                gories, restaurants and attractions (e.g. parks, his-
    1
                                                                torical sites), and two geographical areas, Rome
     Copyright c 2019 for this paper by its authors. Use per-
                                                                    2
mitted under Creative Commons License Attribution 4.0 In-             The corpus is available for research purposes at
ternational (CC BY 4.0).                                        http://www.italianlp.it/resources/
and Milan. We also gathered two types of meta-                   tasks, e.g. assessment of text readability (Collins,
data associated with each review: review rating                  2014), identification of textual genre of a docu-
and number of helpful votes. Firstly, we filtered                ment (Cimino et al., 2017), we investigated in this
our data according to language (Italian) and length              study whether they are able to model the linguis-
(> 7 tokens), discarding 52.29% of the total re-                 tic “form” (the style) of helpful reviews. In ad-
views. Then we empirically3 set a threshold at a                 dition, we explored the contribution of a kind of
minimum of 3 votes in order to distinguish helpful               metadata feature (i.e. the star rating given by the
reviews (3+ votes) from unhelpful ones (0 votes).                reviewer) that has also been widely tested in stud-
Some examples of reviews that belong to the two                  ies on helpfulness prediction, as reported in Diaz
classes are reported in Table 2. In line with stud-              and Ng (2018).
ies carried out for the English language (Park,                     In order to extract lexical and linguistic predic-
2018), also in our case review votes tend to be                  tors of helpfulness, the corpus was linguistically
sparse across all categories: in particular reviews              annotated at different levels of analysis. In par-
with 3+ votes constitute only 5.10% of the un-                   ticular, it was tagged by the PoS tagger described
filtered dataset. For this reason we balanced the                in Dell’Orletta (2009) and dependency-parsed by
data by selecting a comparable number of helpful                 the DeSR parser (Attardi et al., 2009).
and unhelpful reviews per restaurant or attraction.
As shown in Table 1, our final corpus consists of                Lexical features. They include two types of fea-
42,107 reviews from 1,218 restaurants and 383 at-                tures: (i) the distribution of unigrams and bigrams
tractions for a total of 4,133,312 tokens.                       of characters, words and lemmas (hereafter NGR);
                                                                 (ii) word embedding combinations (WE) obtained
        Category      #Helpful   #Unhelpful    #Reviews
        Rome rest.     12,635       12,404       25,039          by separately computing the average of the vector
        Milan rest.     6,105        5,991       12,096          representations of nouns, verbs and adjectives in
        Attractions     2,564        2,408        4,972          the review. The word embeddings were trained
        TOTAL          21,304       20,803       42,107
                                                                 on the ItWaC corpus (Baroni et al., 2009) and a
Table 1: Corpus of helpful and unhelpful TripAd-                 collection of Italian tweets4 using the word2vec
visor reviews.                                                   toolkit (Mikolov et al., 2013).

                                                                 Linguistic features. They refer to four main
3       Helpfulness Predictors                                   types, modelling diverse aspects of writing style:
                                                                 raw text features, i.e. review, sentence and word
According to our research purposes, we consid-                   length, calculated in terms of sentences, tokens
ered various categories of features aimed at mod-                and characters, respectively;
eling both the content and the linguistic “form” of
                                                                 features related to lexical richness, which is cap-
online reviews. They can be grouped into three
                                                                 tured considering i) the internal composition of
main classes: lexical, linguistic and metadata fea-
                                                                 the vocabulary of review with respect to the Basic
tures. The first typology has already been tested
                                                                 Italian Vocabulary and its usage repertories (De
in the literature (Diaz and Ng, 2018) in order to
                                                                 Mauro, 2000), and ii) Type/Token Ratio;
predict review helpfulness on the basis of mean-
                                                                 morpho-syntactic features, i.e. the distribution of
ingful words. On the contrary, the use of linguistic
                                                                 unigrams of Parts-of-Speech, and verb moods,
features extracted from sentence structure is intro-
                                                                 tenses and persons;
duced for the first time in this paper. Differently
                                                                 syntactic features, which refer to diverse charac-
from previous studies (Kim et al., 2006; Hong et
                                                                 teristics of sentence structure: i) the depth of the
al., 2012) where the distribution of some Parts-Of-
                                                                 whole parse tree (calculated in terms of the longest
Speech was exploited as helpfulness predictor, we
                                                                 path from the root of the dependency tree to some
rely here on a wide set of linguistic features auto-
                                                                 leaf); ii) the length of dependency links (i.e. the
matically extracted from the corpus of reviews lin-
                                                                 tokens occurring between the head and the depen-
guistically annotated. Since they have been shown
                                                                 dent); iii) the distribution of dependency types, iv)
to have a high discriminative power in different
                                                                 the average depth and the distribution of embed-
    3
     In order to choose the threshold value, we considered the
                                                                   4
mean and the standard deviation of the number of votes in the        http://www.italianlp.it/resources/italian-word-
initial dataset (2.21 ± 0.59).                                   embeddings/
     Label        Category         Example (Italian)                              Example (English)
    Helpful    Rome restaurants    La prima regola di un buon ristorante          The first rule of a good restaurant that
                                   che fa pizza no stop è: Scegliere             makes pizza no stop is: Choose the pizza
                                   la pizza che preferisco.          Qui non      I prefer. Here you can not only choose
                                   solo non si può scegliere la pizza ma         the pizza but it often happens that the
                                   capita spesso che escano le stesse pizze       same pizzas come out more times so
                                   più volte cosı̀ uno è costretto a man-       one is forced to always eat the same
                                   giare sempre la stessa!! Per non par-          one!!! Not to mention the environment
                                   lare dell’ambiente poi, un vero casino,        then, a real mess, I understand that the
                                   capisco che l’area bambini è la princi-       children’s area is the main attraction of
                                   pale attrazione del ristorante, rivolto so-    the restaurant, aimed above all at fami-
                                   prattutto alle famiglie, ma il casino che      lies, but the mess that is created is not
                                   si crea non è cmq giustificabile. La pizza    justifiable anyway. The pizza is of a
                                   è di una qualità davvero scadente, prati-    really poor quality, practically it was
                                   camente era cruda!!! La pizza con la           raw!!! Pizza with Lonza....a simple fo-
                                   Lonza....una semplice focaccia con un          caccia with a piece of ham most prob-
                                   pezzo di prosciutto preso molto proba-         ably taken at the discount store! Guys,
                                   bilmente al discount! Ragazzi, carina          nice idea to take care of the little ones,
                                   l’idea di prendersi cura dei pargoli, ma       but let’s not fool around.
                                   non prendiamoci in giro però.
   Unhelpful   Milan restaurants   Devo dire che trovandomi per caso in           I must say that finding myself by chance
                                   quella zona con i miei amici abbiamo           in that area with my friends we tried the
                                   provato il posto è devo dire ché è molto    place and I must say that it is very wel-
                                   accogliente e che la zona per mangiare         coming and that the area to eat in the
                                   nel cortile è proprio intima e carina...Per   courtyard is really intimate and pretty...
                                   quanto riguarda il mangiare posso dire         As for eating I can say I’m satisfied be-
                                   di essere soddisfatto perché le portate er-   cause the courses were on my ropes and
                                   ano nelle mie corde ed avendo preso il         having caught the fish I was satisfied
                                   pesce ero soddisfatto di quanto cucinato       with what the cook had cooked.
                                   dal cuoco. Bravi mica male.

                             Table 2: Examples of helpful vs unhelpful reviews.


ded prepositional chains modifying a noun; v) a                 influence the voting process outcome. The higher
set of features aimed at modeling the behaviour of              sentence length also has an expected effect on
verbal predicates, i.e. the number of verbal roots,             some syntactic features correlated to complexity.
the average verbal arity and the distribution of                Sentences occurring in helpful texts have deeper
verbs by arity, the distribution of verbal predicates           syntactic trees (Avg. max depth) and contain more
with elliptical subject; vi) the usage of subordi-              subordinate clauses and embedded prepositional
nation, calculated considering the ratio between                chains. However, they appear as simpler with
principal and subordinate clauses, and the average              respect to other features related for instance to
depth and the distribution of embedded chains of                canonicity effects. They show a more standard
subordinate clauses; vii) a last set of features re-            syntactic structure, with a higher distribution of
lated to the canonical construction of a sentence               objects in post verbal position and subjects pre-
in Italian, i.e. the relative ordering of subordinates          ceding the main verb. Interestingly, helpfulness
with respect to the main clause and of subject and              is also positively correlated with a reader-focused
object with respect to their verbal head.                       style, as shown by the greater use of pronouns and
   The effectiveness of these features to predict               verbs in the first and second person.
helpful online reviews is confirmed by the fact                 Metadata feature. Review star rating (STR) is
that according to the Wilcoxon rank sum test,                   the rating score assigned by the reviewer, rang-
75% of the considered features (i.e. 160 out of                 ing from 1 to 5. Previous research reported in
212) turned out to vary in a statistically significant          Diaz and Ng (2018) has shown that a connec-
way between helpful and unhelpful reviews. As                   tion exists between the rating of the review and
shown in Table 3, helpful reviews are on average                its helpfulness. In our dataset rating scores are
1-sentence longer than unhelpful ones and they                  unequally distributed across the different review
also contain much longer sentences. The correla-                categories. Restaurant reviews are more likely to
tion between length and helpfulness is not surpris-             have an extreme rating, either low or high, rather
ing since longer sentences are likely to be more in-            than a neutral one, and helpful reviews follow the
formative, thus offering more contents that might               same pattern: e.g., in the Rome restaurant cate-
    Feature                    Help   UnHelp    Diff.   rants different from the ones in the training set.
    N. sent                    4,61     3,46    1,15
    Avg. sent length          36.79    26.22   10.57    As shown in Table 4, we obtained a general im-
    Avg. clause length           10    11.65   -1.65    provement over the baseline with all feature con-
    % Nouns                    23.5     24.5      -1    figurations apart from the one that exploits only
    % Verbs                   14.28    12.79    1.49
    % Adj                      8.32    10.37   -2.41    the metadata feature (STR, the star rating of the
    % Negative adv             1.33     0.97    0.36    reviews). Nevertheless, this feature does improve
    % Pronouns                 4.99     4.14    0.85    the accuracy score of all models by at least one
    % 1st sing p.              9.23     8.15    1.08
    % 2nd pl p.                1.34     1.08    0.26
                                                        point, thus confirming its usefulness for helpful-
    Avg. prep chains length    11,4      6,3     5,1    ness prediction (Diaz and Ng, 2018). The re-
    Avg. max depth             7,64     6,28    1,36    sults also highlight the prominent role of lexi-
    % Subord clause           62,09    43,89    18,2    cal information (NGR+WE) in assessing helpful-
    % Post obj                78,84    68,66   10,18
    % Pre subj                73,13    65,03    8.01    ness, although this is primarily explained by the
                                                        in-domain scenario. Even if the accuracy of the
Table 3: A subset of linguistic features whose val-     linguistic model (LING) is lower with respect to
ues vary in a statistically significant way between     the one obtained by the other feature models, we
helpful and unhelpful reviews.                          found out that linguistic information plays a main
                                                        role in the helpfulness prediction. It allows achiev-
                                                        ing an accuracy score of 66% and of 70.81% by
gory 37.05% of the helpful reviews have a rat-
                                                        also adding review ratings, a value that is in line
ing of 1 and 25.76% a rating of 5. On the con-
                                                        with that of the lexical model.
trary, attractions reviews tend to have higher rat-
ings, with 56.12% of the helpful ones belonging                      Model            Accuracy
to the highest-rated class. Only the attractions cat-                STR                 49.6%
                                                                     NGR                 69.9%
egory seems to confirm the presence of the posi-                     NGR+STR            71.13%
tivity bias that is discussed in Diaz and Ng (2018),                 WE                 68.54%
according to which reviews with positive ratings                     WE+STR             69.96%
                                                                     NGR+WE             70.17%
are seen as more helpful.                                            NGR+WE+STR        71.14%
                                                                     LING                  66%
4    Experiments and Results                                         LING+STR           70.81%
                                                                     ALL                70.04%
We addressed the helpfulness prediction task as a                    ALL+STR            71.05%
binary classification problem. In order to assess                    Baseline           50.46%
the contribution of each set of features illustrated    Table 4: In-domain classification of helpful vs.
in Section 3, we defined two experimental sce-          unhelpful reviews using different feature models.
narios differing at the level of review categories
chosen as test data and set-up (in terms of fea-
ture configurations). We built a classifier based on       In the out-domain scenario we tested the consid-
the LIBLINEAR implementation of Support Vec-            ered feature models on reviews that belong to the
tor Machines with a linear kernel (Fan et al., 2008)    other two categories (Milan restaurants and attrac-
and trained on a set of 12,516 reviews written for      tions). As reported in Table 5, we observed that
411 Rome restaurants. All the features were previ-      the performances of the classifier tested on the re-
ously scaled in the same range [0, 1]. We evaluated     views of Milan restaurants, even if slightly worse,
our system by computing the accuracy score for          are very similar to the ones obtained on the test set
each feature configuration. As baseline for each        of Rome restaurants. This result suggests that the
review category we implemented the score of a           system may perform consistently across different
classifier which always outputs the most proba-         geographical areas, although further experiments
ble class according to the class distribution of the    should be carried out. For example, we might
dataset (in this case the helpful class).               test our models on a greater number of cities or
   In the first experimental scenario we tested the     other types of geographical areas. As we expected,
feature models generated by the SVM classifier          the accuracy decreases mainly in the domain more
on a test set of 12,523 reviews that belong to the      distant from the training one (i.e. the attractions
same domain of the training data (i.e. the Rome         category). This is especially the case of the lexical
restaurants category) but were written for restau-      classification model, that has a drop of 10.5 points.
The star rating feature is also shown to worsen the      style, as shown by the distribution of verbs in the
accuracy scores, probably because of the way the         first and second person. These types of features
ratings are distributed in the attractions category      resulted to be discriminant in the comparison be-
with respect to the restaurant ones. It is interesting   tween helpful and unhelpful reviews (Section 3).
to note that the best performing model resulted to       This shows that the writing style of helpful re-
be the one exploiting the linguistic features (with a    views, informative but also personal and reader-
lower drop of 5.24%), thus showing the predictive        focused, has an high predictive power.
power of sentence structure information in predict-         The importance of the linguistic features is fur-
ing review helpfulness.                                  ther confirmed by a second inspection in which
                                                         the same ranking method was applied to the all-
       Model              Milan    Attractions
       NGR+WE            69.38%        59.67%
                                                         feature model. Also in this case, we found out that
       NGR+WE+STR        70.92%        58.02%            59.6% of the whole set of 212 linguistic features
       LING              65.82%       60.76%             we considered is in the 90th percentile of the rank-
       LING+STR          70.92%        60.28%
       ALL                69.2%         59.9%            ing of the total 741,339 features.
       ALL+STR           70.78%        58.49%
       Baseline          50.47%        51.56%            6   Conclusion
Table 5: Out-domain classification of helpful vs.        In this paper, we have presented the first approach
unhelpful reviews in terms of accuracy using dif-        to the task of review helpfulness prediction for the
ferent feature models.                                   Italian language. Two experimental scenarios have
                                                         been tested in a corpus of TripAdvisor reviews be-
                                                         longing to different categories (restaurants and at-
5   Discussion                                           tractions). In line with previous findings obtained
                                                         for the English language, we confirmed that lexi-
As discussed in the previous section, we found
                                                         cal information plays a significant role in classify-
out that linguistic features allow achieving an ac-
                                                         ing helpful reviews. In addition, we proved for the
curacy almost in line with the one obtained us-
                                                         first time the highly predictive power of linguis-
ing only lexical information. Interestingly enough,
                                                         tic features modeling the writing style indepen-
they are the most predictive ones in the out-of-
                                                         dently from the content. This is particularly true in
domain scenario. In order to gain insight into
                                                         the two out-domain experiments: in the first case
which of these features are the most effective in
                                                         (same category, different geographical area), the
the task of automatic classification, we ranked
                                                         classifier based on the linguistic features achieves
them according to the absolute value of their
                                                         the same accuracy of the model using lexical fea-
weight in the linear SVM model generated with
                                                         tures and it even outperforms all the other config-
the linguistic feature configuration. Among the
                                                         uration models when tested on the most distant re-
50 top-ranked ones, besides the raw text features
                                                         view category (restaurants vs attractions).
(whose role in predicting helpfulness has already
                                                            Among the possible future issues that we would
been proven in the literature), we found morpho-
                                                         like to investigate, an interesting one concerns the
syntactic and syntactic features. They are typi-
                                                         role played by metadata features. In the reported
cally related to a rich and articulated writing style.
                                                         results, we showed that star ratings are not relevant
This is the case for example of features concern-
                                                         when considered alone, but they give a plus when
ing nominal modification, in particular the num-
                                                         combined with both lexical and linguistic features.
ber of prepositional chains (holding the 1st posi-
                                                         Beyond this metadata, we would like to extend the
tion in the ranking) and their average length but
                                                         analysis to further user information possibly re-
also the distribution of adjectives and determin-
                                                         lated to review helpfulness.
ers. Others involve verbal structures, e.g. the num-
ber of dependents instantiated by the verbal heads
                                                         Acknowledgments
and the frequency of adverbs (especially negation
ones). Features related to the usage of subordina-       This work was partially supported by the 2-year
tion, such as the number of subordinate structures       project PERFORMA (Personalizzazione di pER-
and the average depth of parse trees, also appear        corsi FORMativi Avanzati), co-funded by Regione
among the top-ranked. Finally, another group of          Toscana (POR FSE 2014-2020).
high-ranked features concerns a subjective writing
References
G. Attardi, F. Dell’Orletta, M. Simi and J. Turian.
  2009. Accurate dependency parsing with a stacked
  multilayer perceptron. Proceedings of Evalita’09,
  Evaluation of NLP and Speech Tools for Italian ,
  Reggio Emilia, December.
M. Baroni, S. Bernardini, A. Ferraresi and E.
  Zanchetta. 2009 The WaCky Wide Web: A Col-
  lection of Very Large Linguistically Processed Web-
  Crawled Corpora. Language Resources and Evalu-
  ation, 43(3), pp. 209-226.
A. Cimino, M. Wieling, F. Dell’Orletta, S. Montemagni
   S. and G. Venturi. 2017 Identifying Predictive Fea-
   tures for Textual Genre Classification: the Key Role
   of Syntax. Proceedings of 4th Italian Conference on
   Computational Linguistics (CLiC-it), 11-13 Decem-
   ber, 2017, Rome.

K. Collins-Thompson. 2014. Computational assess-
  ment of text readability: a survey of current and fu-
  ture research. Recent Advances in Automatic Read-
  ability Assessment and Text Simplification, Special
  issue of International Journal of Applied Linguistics,
  (165-2), 97-135.
F. Dell’Orletta. 2009. Ensemble system for Part-of-
   Speech tagging. Proceedings of Evalita’09, Evalu-
   ation of NLP and Speech Tools for Italian , Reggio
   Emilia, December.
T. De Mauro. 2000. Grande dizionario italiano
   dell’uso (GRADIT). Torino, UTET.

G. O. Diaz and V. Ng. 2018. Modeling and Prediction
  of Online Product Review Helpfulness: A Survey.
  ACL, 2018.

R. E. Fan, K.-W. Chang, C.-J. Hsieh, X. Wang and C.-
   J. Lin. 2008. LIBLINEAR: A Library for Large
   Linear Classification. Journal of Machine Learning
   Research, 9:1871–1874.

Y. Hong, J. Lu, J. Yao, Q. Zhu and G. Zhou. 2012.
   What Reviews are Satisfactory: Novel Features for
   Automatic Helpfulness Voting. Proceedings of the
   35th International ACM SIGIR Conference on Re-
   search and Development in Information Retrieval,
   pp. 495-504.
S. M. Kim, P. Pantel, T. Chklovski and M. Pennac-
   chiotti. 2006. Automatically assessing review help-
   fulness. Proceedings of the the 2006 Conference on
   Empirical Methods in Natural Language Process-
   ing, pp. 423-430.
T. Mikolov, K. Chen, G. Corrado and J. Dean. 2013
   Efficient Estimation of Word Representations in
   Vector Space. CoRR abs/1301.3781
Y-J. Park. 2018. Predicting the helpfulness of on-
  line customer reviews across different product types.
  Sustainability, 10(1735), pp. 1-20.

</pre>