=Paper=
{{Paper
|id=Vol-2083/paper-09
|storemode=property
|title=Mining Implicit Data Association from Tripadvisor Hotel Reviews
|pdfUrl=https://ceur-ws.org/Vol-2083/paper-09.pdf
|volume=Vol-2083
|authors=Vittoria Cozza,Marinella Petrocchi,Angelo Spognardi
|dblpUrl=https://dblp.org/rec/conf/edbt/CozzaPS18
}}
==Mining Implicit Data Association from Tripadvisor Hotel Reviews==
Mining implicit data association from Tripadvisor hotel
reviews
Vittoria Cozza Marinella Petrocchi Angelo Spognardi
Department of Information IIT-CNR Dipartimento di Informatica,
Engineering, Pisa, Italy Sapienza Università di Roma
University of Padua marinella.petrocchi@iit.cnr.it Rome, Italy
Padua, Italy spognardi@di.uniroma1.it
vittoria.cozza@dei.unipd.it
ABSTRACT techniques to identify mismatches between the text and the score
In this paper, we analyse a dataset of hotel reviews. In details, we in online review platforms.
enrich the review dataset, by extracting additional features, con- Since several aspects can influence the customer experience
sisting of information on the reviewers’ profiles and the reviewed (e.g., the hotel price, or the presence of restaurants, cafe, discos
hotels. We argue that the enriched data can gain insights on the in the hotel neighborhood, the connections with bus/train sta-
factors that most influence consumers when composing reviews tions and airports, etc.), in this work we propose an automatic
(e.g., if the appreciation for a certain kind of hotel is tied to spe- approach - based on association rules - to understand which
cific users’ profiles). Thus, we apply statistical analyses to reveal factors most influence consumers’ reviews. We consider a very
if there are specific characteristics of reviewers (almost) always large dataset consisting of around 190k hotel reviews collected
related to specific characteristics of hotels. Our experiments are from Tripadvisor, enriching the dataset by extracting a series of
carried out on a very large dataset, consisting of around 190k hotel-centric and reviewer-centric features. We leverage these
hotel reviews, collected from the Tripadvisor website. features to list correlations among hotel properties, reviewer’s
characteristics, and the review score. The results are obtained
applying association rules techniques to our dataset. Findings are
1 INTRODUCTION both expected - such as that the hotels close to entertainment and
Social media, forums, and blogs are privileged vehicles for post- food areas are ranked with the highest scores - and less intuitive
ing and spreading online reviews. Among the goods and services - such as that those reviewers featuring a very low activity (mea-
that are discussed every day on the Internet, we can find those sured with a lower bound in term of given reviews), considering
belonging to the most disparate categories, like, e.g., food, clothes, their stay in a particular area, select - very often - hotels with a
music, toys, etc. Particularly, the practice of choosing and booking low number of transportation means in the neighbourhood.
preferred destinations has been greatly eased by the possibility We argue that, with our approach, sociologists and marketing
for users to consult previous feedback about hotels and restau- experts could analyse the results of the association rules to better
rants. According to comScore Media Metrix1 , Tripadvisor is the understand some extra reviewers’s characteristics and connec-
world’s largest travel e-advice site, providing advices as report- tions with the reviewed service. This kind of analysis paves the
ed by actual travellers. Tripadvisor counts more than 87 million way for surveying a larger segment of the population than that
visitors per month2 . usually interviewed through standard polls.
Not only common users, but also service providers have strong
motivations to analyse the myriads of posts, tweets, and com- 2 DATASET
ments available online. The latter will benefit by adjusting, e.g., To conduct our study, we grounded it in a dataset composed
their products lines and advertisement campaigns, while the for- of real reviews taken from the Tripadvisor 3 website. In particu-
mer by relying on previous experiences for addressing their needs lar, our dataset contains all the reviews that can be accessed on
and matching their expectations. Furthermore, online reviews are the website between the 26th of June 2013 and the 25th of June
a precious source of information, e.g., to unveil implicit and/or un- 2014 – date of the newest extracted review – for hotels in New
expected characteristics of the reviewers. As an example, in [13] York, Rome, Paris, Rio de Janeiro, and Tokyo. With a straightfor-
the authors investigate if and how the words —and their use— in ward approach, we were able to collect the following pieces of
a review are linked to the reviewer’s gender, country, and age. information for each review:
In [8], the authors present a novel approach to build feature-
• the review date, text, and numeric score;
based user profiles and item descriptions by mining user-generated
• the reviewer username, location, and triptype, being the
reviews. Such additional information can be integrated into rec-
type of trip, one among the following five categories: Fam-
ommender systems to deliver better recommendations and an
ily, Friends, Couple, Solo Traveler, and Businessman;
improved user experience.
• the ID of the hotel which the review refers to.
In our previous work [9], we exploited a Tripadvisor dataset
in order to investigate how subjectivity of reviewers affects the In addition to the above elements, we collected from Tripad-
scores assigned to hotels. Thus, we leverage sentiment analysis visor all the hotels of the considered reviews and included in
our review dataset some additional data regarding the reviewed
1 https://www.comscore.com/Products/Audience-Analytics/Media-Metrix - All sites
hotels. In particular, leveraging the ID of the hotel which the
last accessed December 23, 2017.
2 https://www.comscore.com/Insights/Rankings - Statistics updated to June 2017. review refers to, we have gathered
• the hotel name and full address (where full address in-
© 2018 Copyright held by the owner/author(s). Published in the Workshop
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, cludes the street address, the city, and the country);
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
3 http://www.tripadvisor.com
under the terms of the Creative Commons license CC-by-nc-nd 4.0.
56
• the category of the hotel (number of stars); 2.1 Hotel-centric and reviewer-centric
• the number of guest pictures for the hotel. features
It is worth noting like the above lists are not exhaustive, i.e., they Starting from the information collected in the basic dataset, we
do not represent all the information accessible from Tripadvi- have augmented it performing some further elaboration. In par-
sor. As an example, further information available for a review ticular, we enriched the data regarding the reviewed hotel with
are the scores assigned by reviewers to specific aspects of a ho- the following features:
tel, like location, cleanliness, sleep quality, rooms, and service. • the popularity, defined as the number of reviews for a given
However, for the scope of the current work, we focus on those hotel. While we have neither the list of actual bookings
summarised for the reader’s convenience in Table 1. We exploited available, nor Tripadvisor requires the reviewer to show a
such pieces of information to further expand the dataset, with proof to have been a guest in the hotel, this feature, when
enriched features, as described in the next Section 2.1. computed on a large number of reviews per hotel, could
indirectly act as a quantification of the actual hotel clients;
• the hotel triptype, defined as the most frequent reviewer
Basic information triptype for a given hotel (whereas triptypes are Families,
Review Hotel Friends, Couples, Solo Travelers, and Businessmen);
• the geospatial coordinates (latitude and longitude);
Date Name • three points of interest (POI) features, defined as the num-
Text Street address ber of transportation services, restaurants, and attractions,
Score City respectively, in a range of 300 meters around the hotel.
Reviewer username Country
Reviewer location Guest pictures Popularity and Hotel triptype have been computed looking at
Triptype how many and which kind of reviewers have reviewed the hotel.
Hotel ID The geospatial coordinates have been calculated with Google
Places APIs4 , starting from the hotel name and full address. Then,
latitude and longitude, together with the parameter “radius=300”,
Table 1: Considered information in the basic dataset have been given as input to the Google Radarsearch API5 to find
the number of points of interest (POI) related to transportation,
food, and entertainment.
The data regarding a reviewer, instead, have been enriched
with the following features:
We have discarded reviews by “Anonymous” users, since they
represent users of the platform http://www.daodao.com—the • the reviewers’ activity, defined as the number of reviews
Chinese version of Tripadvisor—where all the reviewers are indif- they have written (under the observation period). Our
ferently grouped in this single virtual username. We have further intuition is that this feature could be useful to discriminate
limited our analysis on reviews whose textual part is in English, between frequent travelers and sporadic ones.
following the language identification and analysis approach pre- • the gender of the reviewer. This feature has been extracted
sented in [5]. While the reviews accessible from Tripadvisor in with the Namsor Onomastics6 machine learning tool, able
the year under investigation are 353,167, after the pre-processing to recognise the language behind a name, thus identifying
the resulting dataset is made up of 189,304 reviews in English, the gender according to that language vocabulary with
provided by 142,583 Tripadvisor’s registered users that reviewed high accuracy [4].
4,019 hotels. Table 1 recaps the information extracted from the After cleaning the username from numbers and symbols and
dataset, while Table 2 shows the distribution of the reviews per splitting it in two parts (where one is likely to be the name and
given score value. As shown, the values distribution is highly the other one, when available, the surname), we have called the
unbalanced, being the highest score the most frequent in the “onomastics/api/json/gendre” API. This service takes as input
dataset (reflecting indeed the distribution usually featured by name and surname and returns the recognised gender. We have
review platforms). used regular expressions to clean the username from symbols
and numbers and for splitting the username. This was possible
since, in many cases, the name and surname were separated by
Rating Value Occurrences a space, or the surname started with an uppercase letter. Some
1 6,504 examples of username are: “Eldon S”, “MeganJones88”.
2 8,826 Unfortunately, for a subset of reviewers, it was not possible
3 24,627 to derive the gender from their usernames. This happened for
4 64,949 9,507 reviewers (corresponding to 6% of the entire reviewers
5 84,398 set), which wrote 12,653 reviews. Examples of usernames for
Table 2: Distribution of the given scores in the dataset which it was not possible to derive the gender are Hope-and-
Dreams, mistyrabbit, A TripAdvisor Member, R W, E A, Nickeykol,
NawakRed, FreeTravel81. We labeled with unknown the gender of
such 9,507 reviewers.
Hereafter, we will refer to this dataset as the basic dataset. In- 4 https://developers.google.com/places
deed, in the following, we will extract hotel-centric and reviewer- 5 https://maps.googleapis.com/maps/api/place/radarsearch
centric features to enrich the basic set (see Section 2.1). 6 http://api.namsor.com/onomastics/api
57
Features For example, one rule could have a very high confidence, but
Hotel Reviewer only due to the fact that the item in the consequence is very
frequent. In this case, the rule is not relevant. Instead, one rule
Popularity Activity could have a low confidence, due to the fact that the item in
Hotel Triptype Gender the consequence is very unfrequent in general, but it could still
Geospatial Coordinates be relevant. Considering the above observation, to evaluate the
Points of Interest statistical significance of the ARs, two other metrics are often
used: lift and convinction.
Table 3: Hotel-centric and reviewer-centric features aug- Lift is defined as the confidence divided by the support of the
menting the basic Tripadvisor dataset consequence:
supp(X ∩ Y )
li f t(X =⇒ Y ) = (1)
supp(X ) ∗ supp(Y )
It is worth noting that Popularity, Hotel triptype, and Activity With respect to confidence, the lift measures the importance of
have been calculated as the result of queries to the basic dataset, the association considering also the dependence from the support
with the aim of making explicit some data that originally were im- of the consequence.
plicit in the information at disposition. A story apart deserves the Convinction is defined by the ratio of the frequency of itemsets
computation of the reviewer gender, the points of interest close that don’t contain the consequence, to the frequency of incorrect
to the hotel, and its geospatial coordinates. As above described, predictions:
the latter have been computed relying on external data sources, 1 − supp(Y )
namely the Google Points of Interest and the Namsor database, conv(X =⇒ Y ) = (2)
1 − con f (X =⇒ Y )
containing 800k names and statistical information about names
in each country of the world. Both lift and conviction values ranging over the (0,1) inter-
Table 3 recaps the hotel-centric and reviewer-centric features val mean negative dependence, values above 1 mean positive
we used to enrich the basic dataset. dependence, and a value of 1 means independence.
When items are also divided according to different classes, it
3 ASSOCIATION ANALYSIS is possible to force the AR analysis to return a specific class in
the consequence. The obtained rule is called “class association
Association rule mining is a well known and widely applied
rule" (CAR). The CAR is an implication of the form:
methodology for discovering frequent patterns, correlations, and
causal structures in transaction and relational databases, as well X =⇒ y , where X ⊆ I and y ∈ Y (3)
as in other information repositories [12]. Thus, given a set of
items (or itemsets), association rule mining allows to define rules where I stands for the itemsets and Y for the classes. The defini-
predicting the occurrence of an item (or more), given the occur- tion of the aforementioned metrics holds also for CARs.
rence of other items in the same itemsets. The a priori algorithm [2, 16] is one of the most popular algo-
A popular application is basket data analysis, where itemsets rithms to find frequent itemsets, i.e., itemsets whose support ≥
are transactions, representing lists of items in the consumers’ minsup.
baskets. An example of transaction is: {Bread, Steak, Juice, Butter, In this work, we apply the association rule mining to the hotel
Chips, Beer}. When several others are collected, e.g., in a large reviews scenario. Each itemset corresponds to a distinguished
database, the methodology allows to automatically find associ- review, and it is a vector whose components are the values of
ations like, e.g., {Bread} ⇒ {Steak} (steaks are often purchased the features extracted and detailed in Section 2. The same fea-
with bread). Beside sales transactions, the basket analysis can be tures are reported in Tables 4, 5, 6 for the reader’s convenience,
applied to other situations like click stream tracking, spare parts together with additional information that are useful here. CARs
ordering and online recommendation engines - just to name a analysis can be applied when considering also the class, that in
few7 . our scenario corresponds to the review score, a discrete value
An association rule (AR) is generally defined as an implication with a range between 1 and 5.
expression of the form X ⇒ Y , where X and Y are disjoint To enable the application of the a priori algorithm, we have
itemsets. They represent, resp., the condition and the consequence first discretised those features that natively ranged over a large
of the rule. set of values. As an example, in Table 5, a very low label for Guest
The strength of an AR is commonly measured through the Pictures indicates a hotel with a number of pictures comprised
two metrics support and confidence. Support gives the fraction of from 0 to 11. Still in that table, a medium label for Popularity
itemsets in the dataset that contains both X and Y . Confidence means a hotel that has been reviewed n times, where n ranges
says how frequently items in Y appear in itemsets that contain X . over [433, 1156]. The values in Table 6 should be read as follows:
As an example, we want to known the strength of the rule {Bread} looking at the first line of the “Geo Food" part of the table, our
⇒ {Steak} in a dataset with 100 transactions, corresponding to 100 review set contains 37,851 reviews about a hotel, which has a
consumers’ baskets. Suppose that itemset {Bread, Steak} occurs 30 number of restaurants in the range [0, 37] within a radius of 300
times, and that itemset {Bread} occurs 40 times, than the support mt. Indeed, many different reviews are on the same hotels, being
30 , while its confidence is 30 .
of the rule is equal to 100 the number of hotels reviewed equal to 4,019, see Section 2.
40 All the tables also report the Frequency indication, i.e., how
As discussed in [3], rules with high values for confidence and
support do not always correspond to meaningful ARs, especially many reviews correspond to those values for those features, with
when working with real datasets, due data can be unbalanced. respect to the values and features in the tables (still quite obvi-
ously, the sum on the values in the Frequency column equals to
7 http://pbpython.com/market-basket-analysis.html the total number of reviews considered, 189,304).
58
Activity Gender
value frequency value frequency
up to 5 reviews 138,419 male 102,565
6 or more 50,885 female 74,086
unknown 12,653
TripType Month
value frequency value frequency value frequency
solo 9,795 January 14,280 July 18,096
couple 66,557 February 11,440 August 16,870
family 35,833 March 14,146 September 18,466
friends 19,621 April 16,120 October 18,616
business 23,600 May 18,978 November 14,149
unspecified 33,898 June 14,937 December 13,206
Table 4: Discretised features on reviewers
Stars Guest Pictures
value frequency label: range frequency
1 694 very low: (0-11) 39,932
2 8,224 low: (12-104) 37,790
3 55,464 medium: (105-271) 37,882
4 83,584 high: (272-525) 37,840
5 15,456 very high: >=526 37,860
unspecified 25,882
HotelPopularity Hotel Trip Type Country
label: range frequency value frequency value frequency
low: (0-432) 63,165 solo 974 Italy (it) 31,224
medium: (433-1156) 63,093 couple 139,860 United States (us) 83,605
high: >=1157 63,046 friends 672 Brasil (br) 3,631
family 20,429 Japan (jp) 11,966
business 9,112 France (fr) 40,621
unspecified 18,257 unspecified 18,257
Table 5: Discretised features on hotels
Geo Food Geo Entertainment Geo Transport
label: range frequency label: range frequency label: range frequency
very low: (0-37) 37,851 very low: (0-3) 38,282 very low: (0) 34,471
low: (38-136) 37,921 low: (4-15) 37,602 low: (1-3) 38,079
medium: (137-197) 36,827 medium: (16-35) 37,814 medium: (4-11) 40,382
high: (198-199) 21,848 high: (36-63) 37,708 high: (12-18) 39,674
very high: >=200 54,857 very high: >=64 37,998 very high: >=19 36,698
Table 6: Discretised geolocation-based features
In order to find ARs and CARs, we applied the Weka frame- conviction. Table 7 and Table 8 report an excerpt of the results
work [11] implementation of the a priori. The Weka a priori for both scenarios.
implementation allows to rank the rules according to different
metrics. Among them, we rely on confidence, lift, and conviction.
For AR analysis we generate a large number of rules with lift 3.1 Discussion
above 1. For CAR analysis, we generate a large number of rules
Association analysis results are reported in Table 7 and Table 8,
with confidence above 0.2 and then we compute the lift (since,
please notice we only consider those rules that lead to a lift and
for CAR, Weka does not natively include the ranking based on
conviction greater than 1. It is worth noting like |X ∩ Y |, when
lift). We finally select the rules with lift greater than 1.
divided by the size of the dataset, corresponds to the support of
Both for the generated ARs and CARs, we then manually select
the given rule.
the most interesting rules, among those with the highest lift and
59
We summarise the main findings, as follows. Rule r1 states that highlights gender-specific lexical differences, the the distribution
those reviewers featuring a very low activity, considering their of regional markers, spelling variations and the use of grammati-
stay in France, select - very often - hotels with a low number of cal constructions across the reviewers.
transportation means in the neighbourhood. The rule holds for The work from [17], which focused on reviews manipulation,
19,199 reviews, over a total of 29,837 reviews, with equal premis- exploits reviewer-centric and hotel-centric features to identify
es. Rule r2 says that males visiting US prefer hotels with a high outliers: the work compares hotels reviews and related features
popularity. Rule r7 says that, when the hotel has low transporta- across different review sites, outperforming the detection of suspi-
tion means in the neighbourhood, and the number of stars for cious hotels with respect to check the reviews on sites in isolation.
that hotel is unknown (this may corresponds to accommodation Relying on visualization tools, the authors of [6] highlight sus-
facilities like hostels), its rating is equal to 3. Rule r10 states that picious changes on reviews scores, while work in [7] proposes
Japanese people staying in a 3 stars hotels rate those hotels with new score aggregators to let review systems robust with respect
a score equal to 4. Rule r14 in Table 8 states that hotels close to to injection of fake scores.
entertainments, which are 37,998, are scored with the top score Research effort has also being spent to understand which are
5 the 50% of times. the factors that let a review perceived as useful: in [15], the
This kind of study provides a general approach for a prelimi- authors highlight how the reviewer history is a dominant factor
nary data exploration. While the explanation for certain rules is to let a review be voted as useful or not. In [14] propose to use
very intuitive, well-grounded justification for others is left to ex- the reviews as a source for demographic recommendations.
perts in the field. We argue that this kind of analysis corresponds In this work we enhance the review dataset with additional
to a preliminary step, useful for suggesting which extra-features features based on characteristics of the reviewer (e.g., gender)
could be exploitable to build an enhanced hotel recommenda- and the hotel (e.g., popularity and the neighbourhood). On the
tion system. Also, we acknowledge that the analysis is based on contrary, work in [18] studies how, independently from the type
the available (direct or indirect) information, obtained from the of service or the type of reviewer, the scores may be affected by
Tripadvisor’s website. More detailed features could consider ele- external factors, such as the whether conditions and the daylight
ments like price or number of guests. This would allow to obtain length of the service cities. We leverage an extensive experimen-
other interesting rules, which remain an exclusive prerogative of tal campaign, addressing around 190k real reviews, which leads
the hoteliers. to the provision of statistically sound results. Addressing a large s-
cale of data has been done also in [13], which already has targeted
4 RELATED WORK users’ reviews as a rich source of information for sociolinguistic
E-advice technology offers a form of “electronic word-of-mouth”, studies. While they achieve correlations between metadata in the
with new potential for gathering valid suggestions that guides reviewers’ profile and the review text to let writing styles emerge,
the consumer’s choice. Extensive and nationally representative we highlight association evidence among hotels and reviewers
surveys have been carried out in the recent past, “to evaluate the features and the reviewer’s attitude to score the hotel.
specific aspects of ratings information that affect people attitudes
toward e-commerce”. It is the case, e.g., of work in [10], which 5 CONCLUSIONS
highlights how people, while taking into accounts the average We focused on hotel reviews to investigate which factors could
of ratings for a product, still do not take care of the number of impact the scores that reviewers assign to hotels throughout the
reviews leading to that average. Recent work showed that, in- world.First of all we have enriched review data with with novel
stead of showing first to the users the reviews with the highest hotel-centric and reviewer-centric features, obtained for example
scores, a different order, based, e.g., on the user profile, could be through linked data information available from the web, then we
considered [8]: that work integrates new features based on the have applied association rule mining to focus on these features
user profile into recommender systems, to deliver better recom- possibly motivating the classification scores.
mendations and provide an improved user experience. Similarly, The approach can help both consumers and providers: the
in [19], the authors focus on score values given by previous con- former could achieve a better awareness on how to read the
tributors whose preferences are close to the user’s preference. reviews (consumers), the latter on how to improve their services
Even almost one decade ago, the work in [1] applies text mining (providers). The providers also can query a very large segment of
tools to online reviews to define rules sets, to identify contextual population, in an automatic way and without relying on standard
information in the texts, which goes beyond a mere order of interviews.
numerical scores. Similarly to our work, they rely on Tripadvisor, The proposed technique is also applicable to a various range
focusing however on text analysis only. of services: accomodation, car rental, food services, to cite a few.
However, the cited literature proposes systems that recom- Being association rule mining parametric with respect to the
mend a service based on the intrinsic characteristics of that ser- itemsets in input, the approach is easily extensible to further
vice (e.g., characteristics of the hotel and its facilities). Other features not considered here, such as, e.g., the service price.
works, similar to ours, investigate if, and how, the review data
hide social and/or economic information of the reviewers. One 6 ACKNOWLEDGMENTS
example is mining reviews to exploit them as a textual resource
for sociolinguistic studies at a large-scale, as done in [13]. This This research is partly supported by the EU H2020 Program, grant
work leverages the size of the reviews corpus as a more statistical- agreement #675320 (NECS: European Network of Excellence in Cy-
ly solid base for the analysis, with respect to manually-collected bersecurity). Funding has also been received by Fondazione Cassa
corpora. Since reviews sites, such as Trustpilot8 , may contain di Risparmio di Lucca that partially finances the regional project
reviewer metadata like, e.g., age, gender and location, the work ReviewLand. Vittoria Cozza is also supported by the Starting
Grants Project DAKKAR (DAta benchmarK for Keyword-based
8 https://www.trustpilot.com/ Access and Retrieval) promoted by University of Padua, Italy and
60
Rule Condition Confidence |X| |X∩ Y| Lift Convinction
r1 {memberActivity=1 country=fr ==> geotransp=low} 0.64 29,837 19,199 3.2 2.24
r2 {gender=male country=us ==> hotelPopularity=very 0.59 44,703 26,505 1.78 1.64
high}
r3 {memberActivity=1 country=us ==> guestPics=very 0.34 61,155 20,926 1.71 1.22
high}
r4 {memberActivity=1 country=us ==> geoenter=high} 0.54 61,155 20,4316 1.68 1.48
r5 {gender=male country=us ==> hotelTripType=couple} 0.76 44,703 34,192 1.04 1.11
r6 {memberActivity=very low revtripType=family ==> 0.74 27,343 20,362 1.01 1.02
hotelTripType=couple }
Table 7: Excerpt of ARs where user features are premises and the consequences the features of selected hotel, results are
sorted by decreasing lift
Rule Condition Confidence |X| |X∩ Y| Lift Convinction
r7 {stars=0 country=None guestPics=very low geo- 0.25 9,007 2,214 1.89 1.15
transp=low ==> rating=3}
r8 {stars=5 hotelPopularity=medium geofood=very 0.76 2,582 1,962 1.70 2.31
high ==> rating=5}
r9 {memberActivity=very low gender=female 0.7 2744 1918 1.57 1.84
guestPics=very high hotelTripType=couple
geoenter=very high ==> rating=5}
r10 { stars=3 country=jp ==> rating=4} 0.47 5,265 2,492 1.38 1.25
r11 {memberActivity star=3 guestPics=low ==> rat- 0.46 4,326 1,998 1.35 1.22
ing=4}
r12 {star=3 geofood=very high ==> rating=4} 0.44 4,312 1,901 1.28 1.17
r13 {country=jp hotelTripType=business ==> rat- 0.44 4,483 1,954 1.27 1.16
ing=4}
r14 {geoenter=very high ==> rating=5} 0.5 37,998 19,120 1.13 1.12
Table 8: Excerpt of CARs, the class is the review rating, results are sorted by decreasing lift
Fondazione Cariparo, Padua, Italy. The first author would like to Computation 9, 5 (01 Oct 2017), 689–701.
thank Giorgio Maria Di Nunzio, for his helpful support. [10] AndrewJ. Flanagin, MiriamJ. Metzger, Rebekah Pure, Alex Markov, and Ethan
Hartsell. 2014. Mitigating risk in e-commerce transactions: perceptions of
information credibility and the role of user-generated ratings in product
REFERENCES quality and purchase intention. Electronic Commerce Research 14, 1 (2014),
1–23.
[1] Silvana Aciar. 2009. Mining context information from consumer’s Reviews. [11] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute-
Prooceedings of the Context-Aware Recommender Systems (CARS) Workshop mann, and Ian H Witten. 2009. The WEKA data mining software: an update.
(2009). ACM SIGKDD explorations newsletter 11, 1 (2009), 10–18.
[2] Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining [12] Jochen Hipp, Ulrich Güntzer, and Gholamreza Nakhaeizadeh. 2000. Algorithms
Association Rules in Large Databases. In Proceedings of the 20th International for Association Rule Mining: a General Survey and Comparison. SIGKDD
Conference on Very Large Data Bases (VLDB ’94). Morgan Kaufmann Publishers Explor. Newsl. 2, 1 (June 2000), 58–64.
Inc., San Francisco, CA, USA, 487–499. [13] Dirk Hovy, Anders Johannsen, and Anders Søgaard. 2015. User Review Sites
[3] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. 1997. Dy- As a Resource for Large-Scale Sociolinguistic Studies. In 24th International
namic Itemset Counting and Implication Rules for Market Basket Data. In Conference on World Wide Web (WWW ’15). 452–461.
Proceedings of the 1997 ACM SIGMOD International Conference on Management [14] Nikolaos Korfiatis and Marios Poulos. 2013. Using online consumer reviews
of Data (SIGMOD ’97). ACM, New York, NY, USA, 255–264. as a source for demographic recommendations: A case study using online
[4] Elian Carsenat. 2013. Onomastics and Big Data Mining. CoRR abs/1310.6311 travel reviews. Expert Systems with Applications 40, 14 (2013), 5507 – 5515.
(2013). http://arxiv.org/abs/1310.6311 [15] Asher Levi and Osnat Mokryn. 2014. The Social Aspect of Voting for Useful
[5] Fabio Celli, F. Marta L. Di Lascio, Matteo Magnani, Barbara Pacelli, and Luca Reviews. In Social Computing, Behavioral-Cultural Modeling and Prediction.
Rossi. 2010. Social Network Data and Practices: The Case of Friendfeed. In LNCS, Vol. 8393. Springer International Publishing, 293–300.
Advances in Social Computing. LNCS, Vol. 6007. Springer Berlin Heidelberg, [16] Bing Liu, Wynne Hsu, and Yiming Ma. 1998. Integrating Classification and
346–353. Association Rule Mining. In KDD. 80–86.
[6] Alessandro Colantonio, Roberto Di Pietro, Marinella Petrocchi, and Angelo [17] Amanda J. Minnich, Nikan Chavoshi, Abdullah Mueen, Shuang Luan, and
Spognardi. 2015. Visual detection of singularities in review platforms. In 30th Michalis Faloutsos. 2015. TrueView: Harnessing the Power of Multiple Review
Annual ACM Symposium on Applied Computing,. 1294–1295. Sites. In 24th International Conference on World Wide Web (WWW ’15). 787–
[7] Roberto Di Pietro, Marinella Petrocchi, and Angelo Spognardi. 2014. A Lot of 797.
Slots - Outliers Confinement in Review-Based Systems. In Web Information [18] SyedA. Rahman, Tazin Afrin, and Don Adjeroh. 2015. Determinants of User
Systems Engineering Part I. 15–30. Ratings in Online Business Rating Services. In Social Computing, Behavioral-
[8] Ruihai Dong and Barry Smyth. 2016. From More-Like-This to Better-Than- Cultural Modeling, and Prediction. LNCS, Vol. 9021. Springer International
This: Hotel Recommendations from User Generated Reviews. In Proceedings of Publishing, 412–420.
the 2016 Conference on User Modeling Adaptation and Personalization (UMAP [19] Koji Takuma, Junya Yamamoto, Sayaka Kamei, and Satoshi Fujita. 2016. A
’16). ACM, New York, NY, USA, 309–310. Hotel Recommendation System Based on Reviews: What Do You Attach Impor-
[9] Michela Fazzolari, Vittoria Cozza, Marinella Petrocchi, and Angelo Spognardi. tance To?. In Fourth International Symposium on Computing and Networking,
2017. A Study on Text-Score Disagreement in Online Reviews. Cognitive CANDAR 2016, Hiroshima, Japan, November 22-25, 2016. 710–712.
61