=Paper=
{{Paper
|id=Vol-2083/paper-09
|storemode=property
|title=Mining Implicit Data Association from Tripadvisor Hotel Reviews
|pdfUrl=https://ceur-ws.org/Vol-2083/paper-09.pdf
|volume=Vol-2083
|authors=Vittoria Cozza,Marinella Petrocchi,Angelo Spognardi
|dblpUrl=https://dblp.org/rec/conf/edbt/CozzaPS18
}}
==Mining Implicit Data Association from Tripadvisor Hotel Reviews==
<pdf width="1500px">https://ceur-ws.org/Vol-2083/paper-09.pdf</pdf>
<pre>
         Mining implicit data association from Tripadvisor hotel
                                 reviews
                 Vittoria Cozza                                     Marinella Petrocchi                                    Angelo Spognardi
          Department of Information                                       IIT-CNR                                       Dipartimento di Informatica,
                  Engineering,                                            Pisa, Italy                                   Sapienza Università di Roma
              University of Padua                               marinella.petrocchi@iit.cnr.it                                  Rome, Italy
                  Padua, Italy                                                                                           spognardi@di.uniroma1.it
          vittoria.cozza@dei.unipd.it

ABSTRACT                                                                                  techniques to identify mismatches between the text and the score
In this paper, we analyse a dataset of hotel reviews. In details, we                      in online review platforms.
enrich the review dataset, by extracting additional features, con-                           Since several aspects can influence the customer experience
sisting of information on the reviewers’ profiles and the reviewed                       (e.g., the hotel price, or the presence of restaurants, cafe, discos
hotels. We argue that the enriched data can gain insights on the                          in the hotel neighborhood, the connections with bus/train sta-
factors that most influence consumers when composing reviews                              tions and airports, etc.), in this work we propose an automatic
(e.g., if the appreciation for a certain kind of hotel is tied to spe-                    approach - based on association rules - to understand which
cific users’ profiles). Thus, we apply statistical analyses to reveal                     factors most influence consumers’ reviews. We consider a very
if there are specific characteristics of reviewers (almost) always                        large dataset consisting of around 190k hotel reviews collected
related to specific characteristics of hotels. Our experiments are                        from Tripadvisor, enriching the dataset by extracting a series of
carried out on a very large dataset, consisting of around 190k                            hotel-centric and reviewer-centric features. We leverage these
hotel reviews, collected from the Tripadvisor website.                                    features to list correlations among hotel properties, reviewer’s
                                                                                          characteristics, and the review score. The results are obtained
                                                                                          applying association rules techniques to our dataset. Findings are
1 INTRODUCTION                                                                            both expected - such as that the hotels close to entertainment and
Social media, forums, and blogs are privileged vehicles for post-                         food areas are ranked with the highest scores - and less intuitive
ing and spreading online reviews. Among the goods and services                           - such as that those reviewers featuring a very low activity (mea-
that are discussed every day on the Internet, we can find those                           sured with a lower bound in term of given reviews), considering
belonging to the most disparate categories, like, e.g., food, clothes,                    their stay in a particular area, select - very often - hotels with a
music, toys, etc. Particularly, the practice of choosing and booking                      low number of transportation means in the neighbourhood.
preferred destinations has been greatly eased by the possibility                             We argue that, with our approach, sociologists and marketing
for users to consult previous feedback about hotels and restau-                           experts could analyse the results of the association rules to better
rants. According to comScore Media Metrix1 , Tripadvisor is the                           understand some extra reviewers’s characteristics and connec-
world’s largest travel e-advice site, providing advices as report-                        tions with the reviewed service. This kind of analysis paves the
ed by actual travellers. Tripadvisor counts more than 87 million                         way for surveying a larger segment of the population than that
visitors per month2 .                                                                     usually interviewed through standard polls.
   Not only common users, but also service providers have strong
motivations to analyse the myriads of posts, tweets, and com-                            2    DATASET
ments available online. The latter will benefit by adjusting, e.g.,                      To conduct our study, we grounded it in a dataset composed
their products lines and advertisement campaigns, while the for-                         of real reviews taken from the Tripadvisor 3 website. In particu-
mer by relying on previous experiences for addressing their needs                        lar, our dataset contains all the reviews that can be accessed on
and matching their expectations. Furthermore, online reviews are                         the website between the 26th of June 2013 and the 25th of June
a precious source of information, e.g., to unveil implicit and/or un-                    2014 – date of the newest extracted review – for hotels in New
expected characteristics of the reviewers. As an example, in [13]                        York, Rome, Paris, Rio de Janeiro, and Tokyo. With a straightfor-
the authors investigate if and how the words —and their use— in                          ward approach, we were able to collect the following pieces of
a review are linked to the reviewer’s gender, country, and age.                          information for each review:
   In [8], the authors present a novel approach to build feature-
                                                                                              • the review date, text, and numeric score;
based user profiles and item descriptions by mining user-generated
                                                                                              • the reviewer username, location, and triptype, being the
reviews. Such additional information can be integrated into rec-
                                                                                                type of trip, one among the following five categories: Fam-
ommender systems to deliver better recommendations and an
                                                                                                ily, Friends, Couple, Solo Traveler, and Businessman;
improved user experience.
                                                                                              • the ID of the hotel which the review refers to.
   In our previous work [9], we exploited a Tripadvisor dataset
in order to investigate how subjectivity of reviewers affects the                           In addition to the above elements, we collected from Tripad-
scores assigned to hotels. Thus, we leverage sentiment analysis                          visor all the hotels of the considered reviews and included in
                                                                                         our review dataset some additional data regarding the reviewed
1 https://www.comscore.com/Products/Audience-Analytics/Media-Metrix - All sites
                                                                                         hotels. In particular, leveraging the ID of the hotel which the
last accessed December 23, 2017.
2 https://www.comscore.com/Insights/Rankings - Statistics updated to June 2017.          review refers to, we have gathered
                                                                                              • the hotel name and full address (where full address in-
© 2018 Copyright held by the owner/author(s). Published in the Workshop
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,                     cludes the street address, the city, and the country);
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
                                                                                         3 http://www.tripadvisor.com
under the terms of the Creative Commons license CC-by-nc-nd 4.0.


                                                                                    56
    • the category of the hotel (number of stars);                            2.1     Hotel-centric and reviewer-centric
    • the number of guest pictures for the hotel.                                     features
It is worth noting like the above lists are not exhaustive, i.e., they        Starting from the information collected in the basic dataset, we
do not represent all the information accessible from Tripadvi-                have augmented it performing some further elaboration. In par-
sor. As an example, further information available for a review                ticular, we enriched the data regarding the reviewed hotel with
are the scores assigned by reviewers to specific aspects of a ho-             the following features:
tel, like location, cleanliness, sleep quality, rooms, and service.                 • the popularity, defined as the number of reviews for a given
However, for the scope of the current work, we focus on those                         hotel. While we have neither the list of actual bookings
summarised for the reader’s convenience in Table 1. We exploited                      available, nor Tripadvisor requires the reviewer to show a
such pieces of information to further expand the dataset, with                        proof to have been a guest in the hotel, this feature, when
enriched features, as described in the next Section 2.1.                              computed on a large number of reviews per hotel, could
                                                                                      indirectly act as a quantification of the actual hotel clients;
                                                                                    • the hotel triptype, defined as the most frequent reviewer
                         Basic information                                            triptype for a given hotel (whereas triptypes are Families,
              Review                   Hotel                                          Friends, Couples, Solo Travelers, and Businessmen);
                                                                                    • the geospatial coordinates (latitude and longitude);
              Date                     Name                                         • three points of interest (POI) features, defined as the num-
              Text                     Street address                                 ber of transportation services, restaurants, and attractions,
              Score                    City                                           respectively, in a range of 300 meters around the hotel.
              Reviewer username        Country
              Reviewer location        Guest pictures                            Popularity and Hotel triptype have been computed looking at
              Triptype                                                        how many and which kind of reviewers have reviewed the hotel.
              Hotel ID                                                        The geospatial coordinates have been calculated with Google
                                                                              Places APIs4 , starting from the hotel name and full address. Then,
                                                                              latitude and longitude, together with the parameter “radius=300”,
   Table 1: Considered information in the basic dataset                       have been given as input to the Google Radarsearch API5 to find
                                                                              the number of points of interest (POI) related to transportation,
                                                                              food, and entertainment.
                                                                                 The data regarding a reviewer, instead, have been enriched
                                                                              with the following features:
   We have discarded reviews by “Anonymous” users, since they
represent users of the platform http://www.daodao.com—the                           • the reviewers’ activity, defined as the number of reviews
Chinese version of Tripadvisor—where all the reviewers are indif-                     they have written (under the observation period). Our
ferently grouped in this single virtual username. We have further                     intuition is that this feature could be useful to discriminate
limited our analysis on reviews whose textual part is in English,                     between frequent travelers and sporadic ones.
following the language identification and analysis approach pre-                    • the gender of the reviewer. This feature has been extracted
sented in [5]. While the reviews accessible from Tripadvisor in                       with the Namsor Onomastics6 machine learning tool, able
the year under investigation are 353,167, after the pre-processing                    to recognise the language behind a name, thus identifying
the resulting dataset is made up of 189,304 reviews in English,                       the gender according to that language vocabulary with
provided by 142,583 Tripadvisor’s registered users that reviewed                      high accuracy [4].
4,019 hotels. Table 1 recaps the information extracted from the               After cleaning the username from numbers and symbols and
dataset, while Table 2 shows the distribution of the reviews per              splitting it in two parts (where one is likely to be the name and
given score value. As shown, the values distribution is highly                the other one, when available, the surname), we have called the
unbalanced, being the highest score the most frequent in the                  “onomastics/api/json/gendre” API. This service takes as input
dataset (reflecting indeed the distribution usually featured by               name and surname and returns the recognised gender. We have
review platforms).                                                            used regular expressions to clean the username from symbols
                                                                              and numbers and for splitting the username. This was possible
                                                                              since, in many cases, the name and surname were separated by
                  Rating Value Occurrences                                    a space, or the surname started with an uppercase letter. Some
                  1             6,504                                         examples of username are: “Eldon S”, “MeganJones88”.
                  2             8,826                                            Unfortunately, for a subset of reviewers, it was not possible
                  3             24,627                                        to derive the gender from their usernames. This happened for
                  4             64,949                                        9,507 reviewers (corresponding to 6% of the entire reviewers
                  5             84,398                                        set), which wrote 12,653 reviews. Examples of usernames for
  Table 2: Distribution of the given scores in the dataset                    which it was not possible to derive the gender are Hope-and-
                                                                              Dreams, mistyrabbit, A TripAdvisor Member, R W, E A, Nickeykol,
                                                                              NawakRed, FreeTravel81. We labeled with unknown the gender of
                                                                              such 9,507 reviewers.

  Hereafter, we will refer to this dataset as the basic dataset. In-          4 https://developers.google.com/places
deed, in the following, we will extract hotel-centric and reviewer-           5 https://maps.googleapis.com/maps/api/place/radarsearch
centric features to enrich the basic set (see Section 2.1).                   6 http://api.namsor.com/onomastics/api


                                                                         57
                                 Features                                   For example, one rule could have a very high confidence, but
               Hotel                           Reviewer                     only due to the fact that the item in the consequence is very
                                                                            frequent. In this case, the rule is not relevant. Instead, one rule
               Popularity                      Activity                     could have a low confidence, due to the fact that the item in
               Hotel Triptype                  Gender                       the consequence is very unfrequent in general, but it could still
               Geospatial Coordinates                                       be relevant. Considering the above observation, to evaluate the
               Points of Interest                                           statistical significance of the ARs, two other metrics are often
                                                                            used: lift and convinction.
Table 3: Hotel-centric and reviewer-centric features aug-                      Lift is defined as the confidence divided by the support of the
menting the basic Tripadvisor dataset                                       consequence:
                                                                                                                 supp(X ∩ Y )
                                                                                          li f t(X =⇒ Y ) =                                 (1)
                                                                                                              supp(X ) ∗ supp(Y )
   It is worth noting that Popularity, Hotel triptype, and Activity         With respect to confidence, the lift measures the importance of
have been calculated as the result of queries to the basic dataset,         the association considering also the dependence from the support
with the aim of making explicit some data that originally were im-          of the consequence.
plicit in the information at disposition. A story apart deserves the           Convinction is defined by the ratio of the frequency of itemsets
computation of the reviewer gender, the points of interest close            that don’t contain the consequence, to the frequency of incorrect
to the hotel, and its geospatial coordinates. As above described,           predictions:
the latter have been computed relying on external data sources,                                                   1 − supp(Y )
namely the Google Points of Interest and the Namsor database,                             conv(X =⇒ Y ) =                                   (2)
                                                                                                              1 − con f (X =⇒ Y )
containing 800k names and statistical information about names
in each country of the world.                                                  Both lift and conviction values ranging over the (0,1) inter-
   Table 3 recaps the hotel-centric and reviewer-centric features           val mean negative dependence, values above 1 mean positive
we used to enrich the basic dataset.                                        dependence, and a value of 1 means independence.
                                                                               When items are also divided according to different classes, it
3 ASSOCIATION ANALYSIS                                                      is possible to force the AR analysis to return a specific class in
                                                                            the consequence. The obtained rule is called “class association
Association rule mining is a well known and widely applied
                                                                            rule" (CAR). The CAR is an implication of the form:
methodology for discovering frequent patterns, correlations, and
causal structures in transaction and relational databases, as well                         X =⇒ y , where X ⊆ I and y ∈ Y                   (3)
as in other information repositories [12]. Thus, given a set of
items (or itemsets), association rule mining allows to define rules         where I stands for the itemsets and Y for the classes. The defini-
predicting the occurrence of an item (or more), given the occur-            tion of the aforementioned metrics holds also for CARs.
rence of other items in the same itemsets.                                     The a priori algorithm [2, 16] is one of the most popular algo-
   A popular application is basket data analysis, where itemsets            rithms to find frequent itemsets, i.e., itemsets whose support ≥
are transactions, representing lists of items in the consumers’             minsup.
baskets. An example of transaction is: {Bread, Steak, Juice, Butter,           In this work, we apply the association rule mining to the hotel
Chips, Beer}. When several others are collected, e.g., in a large           reviews scenario. Each itemset corresponds to a distinguished
database, the methodology allows to automatically find associ-              review, and it is a vector whose components are the values of
ations like, e.g., {Bread} ⇒ {Steak} (steaks are often purchased            the features extracted and detailed in Section 2. The same fea-
with bread). Beside sales transactions, the basket analysis can be          tures are reported in Tables 4, 5, 6 for the reader’s convenience,
applied to other situations like click stream tracking, spare parts         together with additional information that are useful here. CARs
ordering and online recommendation engines - just to name a                 analysis can be applied when considering also the class, that in
few7 .                                                                      our scenario corresponds to the review score, a discrete value
   An association rule (AR) is generally defined as an implication          with a range between 1 and 5.
expression of the form X ⇒ Y , where X and Y are disjoint                      To enable the application of the a priori algorithm, we have
itemsets. They represent, resp., the condition and the consequence          first discretised those features that natively ranged over a large
of the rule.                                                                set of values. As an example, in Table 5, a very low label for Guest
   The strength of an AR is commonly measured through the                   Pictures indicates a hotel with a number of pictures comprised
two metrics support and confidence. Support gives the fraction of           from 0 to 11. Still in that table, a medium label for Popularity
itemsets in the dataset that contains both X and Y . Confidence             means a hotel that has been reviewed n times, where n ranges
says how frequently items in Y appear in itemsets that contain X .          over [433, 1156]. The values in Table 6 should be read as follows:
As an example, we want to known the strength of the rule {Bread}            looking at the first line of the “Geo Food" part of the table, our
⇒ {Steak} in a dataset with 100 transactions, corresponding to 100          review set contains 37,851 reviews about a hotel, which has a
consumers’ baskets. Suppose that itemset {Bread, Steak} occurs 30           number of restaurants in the range [0, 37] within a radius of 300
times, and that itemset {Bread} occurs 40 times, than the support           mt. Indeed, many different reviews are on the same hotels, being
                         30 , while its confidence is 30 .
of the rule is equal to 100                                                 the number of hotels reviewed equal to 4,019, see Section 2.
                                                      40                       All the tables also report the Frequency indication, i.e., how
   As discussed in [3], rules with high values for confidence and
support do not always correspond to meaningful ARs, especially              many reviews correspond to those values for those features, with
when working with real datasets, due data can be unbalanced.                respect to the values and features in the tables (still quite obvi-
                                                                            ously, the sum on the values in the Frequency column equals to
7 http://pbpython.com/market-basket-analysis.html                           the total number of reviews considered, 189,304).


                                                                       58
                                     Activity                                                     Gender
                       value                         frequency                       value                      frequency
                  up to 5 reviews                       138,419                    male                            102,565
                  6 or more                              50,885                    female                           74,086
                                                                                   unknown                          12,653

               TripType                                                               Month
     value                   frequency                value               frequency                       value                  frequency
 solo                             9,795             January                   14,280                   July                          18,096
 couple                          66,557             February                  11,440                   August                        16,870
 family                          35,833             March                     14,146                   September                     18,466
 friends                         19,621             April                     16,120                   October                       18,616
 business                        23,600             May                       18,978                   November                      14,149
 unspecified                     33,898             June                      14,937                   December                      13,206
                                              Table 4: Discretised features on reviewers


                                    Stars                                                 Guest Pictures
                     value                      frequency                      label: range                     frequency
                  1                                    694                  very low: (0-11)                        39,932
                  2                                  8,224                  low: (12-104)                           37,790
                  3                                 55,464                  medium: (105-271)                       37,882
                  4                                 83,584                  high: (272-525)                         37,840
                  5                                 15,456                  very high: >=526                        37,860
                  unspecified                       25,882

                            HotelPopularity                      Hotel Trip Type                    Country
                         label: range   frequency                value     frequency           value        frequency
                     low: (0-432)           63,165           solo                 974   Italy (it)              31,224
                     medium: (433-1156)     63,093           couple           139,860   United States (us)      83,605
                     high: >=1157           63,046           friends              672   Brasil (br)              3,631
                                                             family            20,429   Japan (jp)              11,966
                                                             business           9,112   France (fr)             40,621
                                                             unspecified       18,257   unspecified             18,257

                                                Table 5: Discretised features on hotels


                               Geo Food                    Geo Entertainment                    Geo Transport
                       label: range     frequency         label: range  frequency           label: range  frequency
                    very low: (0-37)        37,851      very low: (0-3)     38,282        very low: (0)       34,471
                    low: (38-136)           37,921      low: (4-15)         37,602        low: (1-3)          38,079
                    medium: (137-197)       36,827      medium: (16-35)     37,814        medium: (4-11)      40,382
                    high: (198-199)         21,848      high: (36-63)       37,708        high: (12-18)       39,674
                    very high: >=200        54,857      very high: >=64     37,998        very high: >=19     36,698

                                            Table 6: Discretised geolocation-based features


    In order to find ARs and CARs, we applied the Weka frame-              conviction. Table 7 and Table 8 report an excerpt of the results
work [11] implementation of the a priori. The Weka a priori                for both scenarios.
implementation allows to rank the rules according to different
metrics. Among them, we rely on confidence, lift, and conviction.
For AR analysis we generate a large number of rules with lift              3.1    Discussion
above 1. For CAR analysis, we generate a large number of rules
                                                                           Association analysis results are reported in Table 7 and Table 8,
with confidence above 0.2 and then we compute the lift (since,
                                                                           please notice we only consider those rules that lead to a lift and
for CAR, Weka does not natively include the ranking based on
                                                                           conviction greater than 1. It is worth noting like |X ∩ Y |, when
lift). We finally select the rules with lift greater than 1.
                                                                           divided by the size of the dataset, corresponds to the support of
    Both for the generated ARs and CARs, we then manually select
                                                                           the given rule.
the most interesting rules, among those with the highest lift and


                                                                      59
   We summarise the main findings, as follows. Rule r1 states that              highlights gender-specific lexical differences, the the distribution
those reviewers featuring a very low activity, considering their                of regional markers, spelling variations and the use of grammati-
stay in France, select - very often - hotels with a low number of               cal constructions across the reviewers.
transportation means in the neighbourhood. The rule holds for                      The work from [17], which focused on reviews manipulation,
19,199 reviews, over a total of 29,837 reviews, with equal premis-              exploits reviewer-centric and hotel-centric features to identify
es. Rule r2 says that males visiting US prefer hotels with a high               outliers: the work compares hotels reviews and related features
popularity. Rule r7 says that, when the hotel has low transporta-               across different review sites, outperforming the detection of suspi-
tion means in the neighbourhood, and the number of stars for                    cious hotels with respect to check the reviews on sites in isolation.
that hotel is unknown (this may corresponds to accommodation                    Relying on visualization tools, the authors of [6] highlight sus-
facilities like hostels), its rating is equal to 3. Rule r10 states that        picious changes on reviews scores, while work in [7] proposes
Japanese people staying in a 3 stars hotels rate those hotels with              new score aggregators to let review systems robust with respect
a score equal to 4. Rule r14 in Table 8 states that hotels close to             to injection of fake scores.
entertainments, which are 37,998, are scored with the top score                    Research effort has also being spent to understand which are
5 the 50% of times.                                                             the factors that let a review perceived as useful: in [15], the
   This kind of study provides a general approach for a prelimi-                authors highlight how the reviewer history is a dominant factor
nary data exploration. While the explanation for certain rules is               to let a review be voted as useful or not. In [14] propose to use
very intuitive, well-grounded justification for others is left to ex-           the reviews as a source for demographic recommendations.
perts in the field. We argue that this kind of analysis corresponds                In this work we enhance the review dataset with additional
to a preliminary step, useful for suggesting which extra-features               features based on characteristics of the reviewer (e.g., gender)
could be exploitable to build an enhanced hotel recommenda-                     and the hotel (e.g., popularity and the neighbourhood). On the
tion system. Also, we acknowledge that the analysis is based on                 contrary, work in [18] studies how, independently from the type
the available (direct or indirect) information, obtained from the               of service or the type of reviewer, the scores may be affected by
Tripadvisor’s website. More detailed features could consider ele-               external factors, such as the whether conditions and the daylight
ments like price or number of guests. This would allow to obtain                length of the service cities. We leverage an extensive experimen-
other interesting rules, which remain an exclusive prerogative of               tal campaign, addressing around 190k real reviews, which leads
the hoteliers.                                                                  to the provision of statistically sound results. Addressing a large s-
                                                                                cale of data has been done also in [13], which already has targeted
4 RELATED WORK                                                                  users’ reviews as a rich source of information for sociolinguistic
E-advice technology offers a form of “electronic word-of-mouth”,                studies. While they achieve correlations between metadata in the
with new potential for gathering valid suggestions that guides                  reviewers’ profile and the review text to let writing styles emerge,
the consumer’s choice. Extensive and nationally representative                  we highlight association evidence among hotels and reviewers
surveys have been carried out in the recent past, “to evaluate the              features and the reviewer’s attitude to score the hotel.
specific aspects of ratings information that affect people attitudes
toward e-commerce”. It is the case, e.g., of work in [10], which                5    CONCLUSIONS
highlights how people, while taking into accounts the average                   We focused on hotel reviews to investigate which factors could
of ratings for a product, still do not take care of the number of               impact the scores that reviewers assign to hotels throughout the
reviews leading to that average. Recent work showed that, in-                   world.First of all we have enriched review data with with novel
stead of showing first to the users the reviews with the highest                hotel-centric and reviewer-centric features, obtained for example
scores, a different order, based, e.g., on the user profile, could be           through linked data information available from the web, then we
considered [8]: that work integrates new features based on the                  have applied association rule mining to focus on these features
user profile into recommender systems, to deliver better recom-                 possibly motivating the classification scores.
mendations and provide an improved user experience. Similarly,                     The approach can help both consumers and providers: the
in [19], the authors focus on score values given by previous con-               former could achieve a better awareness on how to read the
tributors whose preferences are close to the user’s preference.                 reviews (consumers), the latter on how to improve their services
Even almost one decade ago, the work in [1] applies text mining                 (providers). The providers also can query a very large segment of
tools to online reviews to define rules sets, to identify contextual            population, in an automatic way and without relying on standard
information in the texts, which goes beyond a mere order of                     interviews.
numerical scores. Similarly to our work, they rely on Tripadvisor,                 The proposed technique is also applicable to a various range
focusing however on text analysis only.                                         of services: accomodation, car rental, food services, to cite a few.
   However, the cited literature proposes systems that recom-                   Being association rule mining parametric with respect to the
mend a service based on the intrinsic characteristics of that ser-              itemsets in input, the approach is easily extensible to further
vice (e.g., characteristics of the hotel and its facilities). Other             features not considered here, such as, e.g., the service price.
works, similar to ours, investigate if, and how, the review data
hide social and/or economic information of the reviewers. One                   6    ACKNOWLEDGMENTS
example is mining reviews to exploit them as a textual resource
for sociolinguistic studies at a large-scale, as done in [13]. This             This research is partly supported by the EU H2020 Program, grant
work leverages the size of the reviews corpus as a more statistical-            agreement #675320 (NECS: European Network of Excellence in Cy-
ly solid base for the analysis, with respect to manually-collected              bersecurity). Funding has also been received by Fondazione Cassa
corpora. Since reviews sites, such as Trustpilot8 , may contain                 di Risparmio di Lucca that partially finances the regional project
reviewer metadata like, e.g., age, gender and location, the work                ReviewLand. Vittoria Cozza is also supported by the Starting
                                                                                Grants Project DAKKAR (DAta benchmarK for Keyword-based
8 https://www.trustpilot.com/                                                   Access and Retrieval) promoted by University of Padua, Italy and


                                                                           60
         Rule       Condition                                                                 Confidence            |X|      |X∩ Y|        Lift   Convinction
         r1         {memberActivity=1 country=fr ==> geotransp=low}                                  0.64       29,837       19,199        3.2           2.24
         r2         {gender=male country=us ==> hotelPopularity=very                                 0.59       44,703       26,505       1.78           1.64
                    high}
         r3         {memberActivity=1 country=us ==> guestPics=very                                    0.34     61,155       20,926       1.71               1.22
                    high}
         r4         {memberActivity=1 country=us ==> geoenter=high}                                    0.54     61,155      20,4316       1.68               1.48
         r5         {gender=male country=us ==> hotelTripType=couple}                                  0.76     44,703       34,192       1.04               1.11
         r6         {memberActivity=very low revtripType=family ==>                                    0.74     27,343       20,362       1.01               1.02
                    hotelTripType=couple }

Table 7: Excerpt of ARs where user features are premises and the consequences the features of selected hotel, results are
sorted by decreasing lift


               Rule     Condition                                     Confidence                                 |X|   |X∩ Y|      Lift      Convinction
               r7       {stars=0 country=None guestPics=very low geo-        0.25                             9,007     2,214     1.89              1.15
                        transp=low ==> rating=3}
               r8       {stars=5 hotelPopularity=medium geofood=very         0.76                             2,582       1,962   1.70                2.31
                        high ==> rating=5}
               r9       {memberActivity=very low gender=female                0.7                             2744        1918    1.57                1.84
                        guestPics=very high hotelTripType=couple
                        geoenter=very high ==> rating=5}
               r10      { stars=3 country=jp ==> rating=4}                   0.47                             5,265       2,492   1.38                1.25
               r11      {memberActivity star=3 guestPics=low ==> rat-        0.46                             4,326       1,998   1.35                1.22
                        ing=4}
               r12      {star=3 geofood=very high ==> rating=4}              0.44                             4,312       1,901   1.28                1.17
               r13      {country=jp hotelTripType=business ==> rat-          0.44                             4,483       1,954   1.27                1.16
                        ing=4}
               r14      {geoenter=very high ==> rating=5}                     0.5                          37,998      19,120     1.13                1.12

                     Table 8: Excerpt of CARs, the class is the review rating, results are sorted by decreasing lift


Fondazione Cariparo, Padua, Italy. The first author would like to                               Computation 9, 5 (01 Oct 2017), 689–701.
thank Giorgio Maria Di Nunzio, for his helpful support.                                    [10] AndrewJ. Flanagin, MiriamJ. Metzger, Rebekah Pure, Alex Markov, and Ethan
                                                                                                Hartsell. 2014. Mitigating risk in e-commerce transactions: perceptions of
                                                                                                information credibility and the role of user-generated ratings in product
REFERENCES                                                                                      quality and purchase intention. Electronic Commerce Research 14, 1 (2014),
                                                                                                1–23.
[1] Silvana Aciar. 2009. Mining context information from consumer’s Reviews.               [11] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute-
    Prooceedings of the Context-Aware Recommender Systems (CARS) Workshop                       mann, and Ian H Witten. 2009. The WEKA data mining software: an update.
    (2009).                                                                                     ACM SIGKDD explorations newsletter 11, 1 (2009), 10–18.
[2] Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining              [12] Jochen Hipp, Ulrich Güntzer, and Gholamreza Nakhaeizadeh. 2000. Algorithms
    Association Rules in Large Databases. In Proceedings of the 20th International              for Association Rule Mining: a General Survey and Comparison. SIGKDD
    Conference on Very Large Data Bases (VLDB ’94). Morgan Kaufmann Publishers                  Explor. Newsl. 2, 1 (June 2000), 58–64.
    Inc., San Francisco, CA, USA, 487–499.                                                 [13] Dirk Hovy, Anders Johannsen, and Anders Søgaard. 2015. User Review Sites
[3] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. 1997. Dy-                  As a Resource for Large-Scale Sociolinguistic Studies. In 24th International
    namic Itemset Counting and Implication Rules for Market Basket Data. In                     Conference on World Wide Web (WWW ’15). 452–461.
    Proceedings of the 1997 ACM SIGMOD International Conference on Management              [14] Nikolaos Korfiatis and Marios Poulos. 2013. Using online consumer reviews
    of Data (SIGMOD ’97). ACM, New York, NY, USA, 255–264.                                      as a source for demographic recommendations: A case study using online
[4] Elian Carsenat. 2013. Onomastics and Big Data Mining. CoRR abs/1310.6311                    travel reviews. Expert Systems with Applications 40, 14 (2013), 5507 – 5515.
    (2013). http://arxiv.org/abs/1310.6311                                                 [15] Asher Levi and Osnat Mokryn. 2014. The Social Aspect of Voting for Useful
[5] Fabio Celli, F. Marta L. Di Lascio, Matteo Magnani, Barbara Pacelli, and Luca               Reviews. In Social Computing, Behavioral-Cultural Modeling and Prediction.
    Rossi. 2010. Social Network Data and Practices: The Case of Friendfeed. In                  LNCS, Vol. 8393. Springer International Publishing, 293–300.
    Advances in Social Computing. LNCS, Vol. 6007. Springer Berlin Heidelberg,             [16] Bing Liu, Wynne Hsu, and Yiming Ma. 1998. Integrating Classification and
    346–353.                                                                                    Association Rule Mining. In KDD. 80–86.
[6] Alessandro Colantonio, Roberto Di Pietro, Marinella Petrocchi, and Angelo              [17] Amanda J. Minnich, Nikan Chavoshi, Abdullah Mueen, Shuang Luan, and
    Spognardi. 2015. Visual detection of singularities in review platforms. In 30th             Michalis Faloutsos. 2015. TrueView: Harnessing the Power of Multiple Review
    Annual ACM Symposium on Applied Computing,. 1294–1295.                                      Sites. In 24th International Conference on World Wide Web (WWW ’15). 787–
[7] Roberto Di Pietro, Marinella Petrocchi, and Angelo Spognardi. 2014. A Lot of                797.
    Slots - Outliers Confinement in Review-Based Systems. In Web Information               [18] SyedA. Rahman, Tazin Afrin, and Don Adjeroh. 2015. Determinants of User
    Systems Engineering Part I. 15–30.                                                          Ratings in Online Business Rating Services. In Social Computing, Behavioral-
[8] Ruihai Dong and Barry Smyth. 2016. From More-Like-This to Better-Than-                      Cultural Modeling, and Prediction. LNCS, Vol. 9021. Springer International
    This: Hotel Recommendations from User Generated Reviews. In Proceedings of                  Publishing, 412–420.
    the 2016 Conference on User Modeling Adaptation and Personalization (UMAP              [19] Koji Takuma, Junya Yamamoto, Sayaka Kamei, and Satoshi Fujita. 2016. A
    ’16). ACM, New York, NY, USA, 309–310.                                                      Hotel Recommendation System Based on Reviews: What Do You Attach Impor-
[9] Michela Fazzolari, Vittoria Cozza, Marinella Petrocchi, and Angelo Spognardi.               tance To?. In Fourth International Symposium on Computing and Networking,
    2017. A Study on Text-Score Disagreement in Online Reviews. Cognitive                       CANDAR 2016, Hiroshima, Japan, November 22-25, 2016. 710–712.


                                                                                      61

</pre>