Mining implicit data association from Tripadvisor hotel reviews Vittoria Cozza Marinella Petrocchi Angelo Spognardi Department of Information IIT-CNR Dipartimento di Informatica, Engineering, Pisa, Italy Sapienza Università di Roma University of Padua marinella.petrocchi@iit.cnr.it Rome, Italy Padua, Italy spognardi@di.uniroma1.it vittoria.cozza@dei.unipd.it ABSTRACT techniques to identify mismatches between the text and the score In this paper, we analyse a dataset of hotel reviews. In details, we in online review platforms. enrich the review dataset, by extracting additional features, con- Since several aspects can influence the customer experience sisting of information on the reviewers’ profiles and the reviewed (e.g., the hotel price, or the presence of restaurants, cafe, discos hotels. We argue that the enriched data can gain insights on the in the hotel neighborhood, the connections with bus/train sta- factors that most influence consumers when composing reviews tions and airports, etc.), in this work we propose an automatic (e.g., if the appreciation for a certain kind of hotel is tied to spe- approach - based on association rules - to understand which cific users’ profiles). Thus, we apply statistical analyses to reveal factors most influence consumers’ reviews. We consider a very if there are specific characteristics of reviewers (almost) always large dataset consisting of around 190k hotel reviews collected related to specific characteristics of hotels. Our experiments are from Tripadvisor, enriching the dataset by extracting a series of carried out on a very large dataset, consisting of around 190k hotel-centric and reviewer-centric features. We leverage these hotel reviews, collected from the Tripadvisor website. features to list correlations among hotel properties, reviewer’s characteristics, and the review score. The results are obtained applying association rules techniques to our dataset. Findings are 1 INTRODUCTION both expected - such as that the hotels close to entertainment and Social media, forums, and blogs are privileged vehicles for post- food areas are ranked with the highest scores - and less intuitive ing and spreading online reviews. Among the goods and services - such as that those reviewers featuring a very low activity (mea- that are discussed every day on the Internet, we can find those sured with a lower bound in term of given reviews), considering belonging to the most disparate categories, like, e.g., food, clothes, their stay in a particular area, select - very often - hotels with a music, toys, etc. Particularly, the practice of choosing and booking low number of transportation means in the neighbourhood. preferred destinations has been greatly eased by the possibility We argue that, with our approach, sociologists and marketing for users to consult previous feedback about hotels and restau- experts could analyse the results of the association rules to better rants. According to comScore Media Metrix1 , Tripadvisor is the understand some extra reviewers’s characteristics and connec- world’s largest travel e-advice site, providing advices as report- tions with the reviewed service. This kind of analysis paves the ed by actual travellers. Tripadvisor counts more than 87 million way for surveying a larger segment of the population than that visitors per month2 . usually interviewed through standard polls. Not only common users, but also service providers have strong motivations to analyse the myriads of posts, tweets, and com- 2 DATASET ments available online. The latter will benefit by adjusting, e.g., To conduct our study, we grounded it in a dataset composed their products lines and advertisement campaigns, while the for- of real reviews taken from the Tripadvisor 3 website. In particu- mer by relying on previous experiences for addressing their needs lar, our dataset contains all the reviews that can be accessed on and matching their expectations. Furthermore, online reviews are the website between the 26th of June 2013 and the 25th of June a precious source of information, e.g., to unveil implicit and/or un- 2014 – date of the newest extracted review – for hotels in New expected characteristics of the reviewers. As an example, in [13] York, Rome, Paris, Rio de Janeiro, and Tokyo. With a straightfor- the authors investigate if and how the words —and their use— in ward approach, we were able to collect the following pieces of a review are linked to the reviewer’s gender, country, and age. information for each review: In [8], the authors present a novel approach to build feature- • the review date, text, and numeric score; based user profiles and item descriptions by mining user-generated • the reviewer username, location, and triptype, being the reviews. Such additional information can be integrated into rec- type of trip, one among the following five categories: Fam- ommender systems to deliver better recommendations and an ily, Friends, Couple, Solo Traveler, and Businessman; improved user experience. • the ID of the hotel which the review refers to. In our previous work [9], we exploited a Tripadvisor dataset in order to investigate how subjectivity of reviewers affects the In addition to the above elements, we collected from Tripad- scores assigned to hotels. Thus, we leverage sentiment analysis visor all the hotels of the considered reviews and included in our review dataset some additional data regarding the reviewed 1 https://www.comscore.com/Products/Audience-Analytics/Media-Metrix - All sites hotels. In particular, leveraging the ID of the hotel which the last accessed December 23, 2017. 2 https://www.comscore.com/Insights/Rankings - Statistics updated to June 2017. review refers to, we have gathered • the hotel name and full address (where full address in- © 2018 Copyright held by the owner/author(s). Published in the Workshop Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, cludes the street address, the city, and the country); Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted 3 http://www.tripadvisor.com under the terms of the Creative Commons license CC-by-nc-nd 4.0. 56 • the category of the hotel (number of stars); 2.1 Hotel-centric and reviewer-centric • the number of guest pictures for the hotel. features It is worth noting like the above lists are not exhaustive, i.e., they Starting from the information collected in the basic dataset, we do not represent all the information accessible from Tripadvi- have augmented it performing some further elaboration. In par- sor. As an example, further information available for a review ticular, we enriched the data regarding the reviewed hotel with are the scores assigned by reviewers to specific aspects of a ho- the following features: tel, like location, cleanliness, sleep quality, rooms, and service. • the popularity, defined as the number of reviews for a given However, for the scope of the current work, we focus on those hotel. While we have neither the list of actual bookings summarised for the reader’s convenience in Table 1. We exploited available, nor Tripadvisor requires the reviewer to show a such pieces of information to further expand the dataset, with proof to have been a guest in the hotel, this feature, when enriched features, as described in the next Section 2.1. computed on a large number of reviews per hotel, could indirectly act as a quantification of the actual hotel clients; • the hotel triptype, defined as the most frequent reviewer Basic information triptype for a given hotel (whereas triptypes are Families, Review Hotel Friends, Couples, Solo Travelers, and Businessmen); • the geospatial coordinates (latitude and longitude); Date Name • three points of interest (POI) features, defined as the num- Text Street address ber of transportation services, restaurants, and attractions, Score City respectively, in a range of 300 meters around the hotel. Reviewer username Country Reviewer location Guest pictures Popularity and Hotel triptype have been computed looking at Triptype how many and which kind of reviewers have reviewed the hotel. Hotel ID The geospatial coordinates have been calculated with Google Places APIs4 , starting from the hotel name and full address. Then, latitude and longitude, together with the parameter “radius=300”, Table 1: Considered information in the basic dataset have been given as input to the Google Radarsearch API5 to find the number of points of interest (POI) related to transportation, food, and entertainment. The data regarding a reviewer, instead, have been enriched with the following features: We have discarded reviews by “Anonymous” users, since they represent users of the platform http://www.daodao.com—the • the reviewers’ activity, defined as the number of reviews Chinese version of Tripadvisor—where all the reviewers are indif- they have written (under the observation period). Our ferently grouped in this single virtual username. We have further intuition is that this feature could be useful to discriminate limited our analysis on reviews whose textual part is in English, between frequent travelers and sporadic ones. following the language identification and analysis approach pre- • the gender of the reviewer. This feature has been extracted sented in [5]. While the reviews accessible from Tripadvisor in with the Namsor Onomastics6 machine learning tool, able the year under investigation are 353,167, after the pre-processing to recognise the language behind a name, thus identifying the resulting dataset is made up of 189,304 reviews in English, the gender according to that language vocabulary with provided by 142,583 Tripadvisor’s registered users that reviewed high accuracy [4]. 4,019 hotels. Table 1 recaps the information extracted from the After cleaning the username from numbers and symbols and dataset, while Table 2 shows the distribution of the reviews per splitting it in two parts (where one is likely to be the name and given score value. As shown, the values distribution is highly the other one, when available, the surname), we have called the unbalanced, being the highest score the most frequent in the “onomastics/api/json/gendre” API. This service takes as input dataset (reflecting indeed the distribution usually featured by name and surname and returns the recognised gender. We have review platforms). used regular expressions to clean the username from symbols and numbers and for splitting the username. This was possible since, in many cases, the name and surname were separated by Rating Value Occurrences a space, or the surname started with an uppercase letter. Some 1 6,504 examples of username are: “Eldon S”, “MeganJones88”. 2 8,826 Unfortunately, for a subset of reviewers, it was not possible 3 24,627 to derive the gender from their usernames. This happened for 4 64,949 9,507 reviewers (corresponding to 6% of the entire reviewers 5 84,398 set), which wrote 12,653 reviews. Examples of usernames for Table 2: Distribution of the given scores in the dataset which it was not possible to derive the gender are Hope-and- Dreams, mistyrabbit, A TripAdvisor Member, R W, E A, Nickeykol, NawakRed, FreeTravel81. We labeled with unknown the gender of such 9,507 reviewers. Hereafter, we will refer to this dataset as the basic dataset. In- 4 https://developers.google.com/places deed, in the following, we will extract hotel-centric and reviewer- 5 https://maps.googleapis.com/maps/api/place/radarsearch centric features to enrich the basic set (see Section 2.1). 6 http://api.namsor.com/onomastics/api 57 Features For example, one rule could have a very high confidence, but Hotel Reviewer only due to the fact that the item in the consequence is very frequent. In this case, the rule is not relevant. Instead, one rule Popularity Activity could have a low confidence, due to the fact that the item in Hotel Triptype Gender the consequence is very unfrequent in general, but it could still Geospatial Coordinates be relevant. Considering the above observation, to evaluate the Points of Interest statistical significance of the ARs, two other metrics are often used: lift and convinction. Table 3: Hotel-centric and reviewer-centric features aug- Lift is defined as the confidence divided by the support of the menting the basic Tripadvisor dataset consequence: supp(X ∩ Y ) li f t(X =⇒ Y ) = (1) supp(X ) ∗ supp(Y ) It is worth noting that Popularity, Hotel triptype, and Activity With respect to confidence, the lift measures the importance of have been calculated as the result of queries to the basic dataset, the association considering also the dependence from the support with the aim of making explicit some data that originally were im- of the consequence. plicit in the information at disposition. A story apart deserves the Convinction is defined by the ratio of the frequency of itemsets computation of the reviewer gender, the points of interest close that don’t contain the consequence, to the frequency of incorrect to the hotel, and its geospatial coordinates. As above described, predictions: the latter have been computed relying on external data sources, 1 − supp(Y ) namely the Google Points of Interest and the Namsor database, conv(X =⇒ Y ) = (2) 1 − con f (X =⇒ Y ) containing 800k names and statistical information about names in each country of the world. Both lift and conviction values ranging over the (0,1) inter- Table 3 recaps the hotel-centric and reviewer-centric features val mean negative dependence, values above 1 mean positive we used to enrich the basic dataset. dependence, and a value of 1 means independence. When items are also divided according to different classes, it 3 ASSOCIATION ANALYSIS is possible to force the AR analysis to return a specific class in the consequence. The obtained rule is called “class association Association rule mining is a well known and widely applied rule" (CAR). The CAR is an implication of the form: methodology for discovering frequent patterns, correlations, and causal structures in transaction and relational databases, as well X =⇒ y , where X ⊆ I and y ∈ Y (3) as in other information repositories [12]. Thus, given a set of items (or itemsets), association rule mining allows to define rules where I stands for the itemsets and Y for the classes. The defini- predicting the occurrence of an item (or more), given the occur- tion of the aforementioned metrics holds also for CARs. rence of other items in the same itemsets. The a priori algorithm [2, 16] is one of the most popular algo- A popular application is basket data analysis, where itemsets rithms to find frequent itemsets, i.e., itemsets whose support ≥ are transactions, representing lists of items in the consumers’ minsup. baskets. An example of transaction is: {Bread, Steak, Juice, Butter, In this work, we apply the association rule mining to the hotel Chips, Beer}. When several others are collected, e.g., in a large reviews scenario. Each itemset corresponds to a distinguished database, the methodology allows to automatically find associ- review, and it is a vector whose components are the values of ations like, e.g., {Bread} ⇒ {Steak} (steaks are often purchased the features extracted and detailed in Section 2. The same fea- with bread). Beside sales transactions, the basket analysis can be tures are reported in Tables 4, 5, 6 for the reader’s convenience, applied to other situations like click stream tracking, spare parts together with additional information that are useful here. CARs ordering and online recommendation engines - just to name a analysis can be applied when considering also the class, that in few7 . our scenario corresponds to the review score, a discrete value An association rule (AR) is generally defined as an implication with a range between 1 and 5. expression of the form X ⇒ Y , where X and Y are disjoint To enable the application of the a priori algorithm, we have itemsets. They represent, resp., the condition and the consequence first discretised those features that natively ranged over a large of the rule. set of values. As an example, in Table 5, a very low label for Guest The strength of an AR is commonly measured through the Pictures indicates a hotel with a number of pictures comprised two metrics support and confidence. Support gives the fraction of from 0 to 11. Still in that table, a medium label for Popularity itemsets in the dataset that contains both X and Y . Confidence means a hotel that has been reviewed n times, where n ranges says how frequently items in Y appear in itemsets that contain X . over [433, 1156]. The values in Table 6 should be read as follows: As an example, we want to known the strength of the rule {Bread} looking at the first line of the “Geo Food" part of the table, our ⇒ {Steak} in a dataset with 100 transactions, corresponding to 100 review set contains 37,851 reviews about a hotel, which has a consumers’ baskets. Suppose that itemset {Bread, Steak} occurs 30 number of restaurants in the range [0, 37] within a radius of 300 times, and that itemset {Bread} occurs 40 times, than the support mt. Indeed, many different reviews are on the same hotels, being 30 , while its confidence is 30 . of the rule is equal to 100 the number of hotels reviewed equal to 4,019, see Section 2. 40 All the tables also report the Frequency indication, i.e., how As discussed in [3], rules with high values for confidence and support do not always correspond to meaningful ARs, especially many reviews correspond to those values for those features, with when working with real datasets, due data can be unbalanced. respect to the values and features in the tables (still quite obvi- ously, the sum on the values in the Frequency column equals to 7 http://pbpython.com/market-basket-analysis.html the total number of reviews considered, 189,304). 58 Activity Gender value frequency value frequency up to 5 reviews 138,419 male 102,565 6 or more 50,885 female 74,086 unknown 12,653 TripType Month value frequency value frequency value frequency solo 9,795 January 14,280 July 18,096 couple 66,557 February 11,440 August 16,870 family 35,833 March 14,146 September 18,466 friends 19,621 April 16,120 October 18,616 business 23,600 May 18,978 November 14,149 unspecified 33,898 June 14,937 December 13,206 Table 4: Discretised features on reviewers Stars Guest Pictures value frequency label: range frequency 1 694 very low: (0-11) 39,932 2 8,224 low: (12-104) 37,790 3 55,464 medium: (105-271) 37,882 4 83,584 high: (272-525) 37,840 5 15,456 very high: >=526 37,860 unspecified 25,882 HotelPopularity Hotel Trip Type Country label: range frequency value frequency value frequency low: (0-432) 63,165 solo 974 Italy (it) 31,224 medium: (433-1156) 63,093 couple 139,860 United States (us) 83,605 high: >=1157 63,046 friends 672 Brasil (br) 3,631 family 20,429 Japan (jp) 11,966 business 9,112 France (fr) 40,621 unspecified 18,257 unspecified 18,257 Table 5: Discretised features on hotels Geo Food Geo Entertainment Geo Transport label: range frequency label: range frequency label: range frequency very low: (0-37) 37,851 very low: (0-3) 38,282 very low: (0) 34,471 low: (38-136) 37,921 low: (4-15) 37,602 low: (1-3) 38,079 medium: (137-197) 36,827 medium: (16-35) 37,814 medium: (4-11) 40,382 high: (198-199) 21,848 high: (36-63) 37,708 high: (12-18) 39,674 very high: >=200 54,857 very high: >=64 37,998 very high: >=19 36,698 Table 6: Discretised geolocation-based features In order to find ARs and CARs, we applied the Weka frame- conviction. Table 7 and Table 8 report an excerpt of the results work [11] implementation of the a priori. The Weka a priori for both scenarios. implementation allows to rank the rules according to different metrics. Among them, we rely on confidence, lift, and conviction. For AR analysis we generate a large number of rules with lift 3.1 Discussion above 1. For CAR analysis, we generate a large number of rules Association analysis results are reported in Table 7 and Table 8, with confidence above 0.2 and then we compute the lift (since, please notice we only consider those rules that lead to a lift and for CAR, Weka does not natively include the ranking based on conviction greater than 1. It is worth noting like |X ∩ Y |, when lift). We finally select the rules with lift greater than 1. divided by the size of the dataset, corresponds to the support of Both for the generated ARs and CARs, we then manually select the given rule. the most interesting rules, among those with the highest lift and 59 We summarise the main findings, as follows. Rule r1 states that highlights gender-specific lexical differences, the the distribution those reviewers featuring a very low activity, considering their of regional markers, spelling variations and the use of grammati- stay in France, select - very often - hotels with a low number of cal constructions across the reviewers. transportation means in the neighbourhood. The rule holds for The work from [17], which focused on reviews manipulation, 19,199 reviews, over a total of 29,837 reviews, with equal premis- exploits reviewer-centric and hotel-centric features to identify es. Rule r2 says that males visiting US prefer hotels with a high outliers: the work compares hotels reviews and related features popularity. Rule r7 says that, when the hotel has low transporta- across different review sites, outperforming the detection of suspi- tion means in the neighbourhood, and the number of stars for cious hotels with respect to check the reviews on sites in isolation. that hotel is unknown (this may corresponds to accommodation Relying on visualization tools, the authors of [6] highlight sus- facilities like hostels), its rating is equal to 3. Rule r10 states that picious changes on reviews scores, while work in [7] proposes Japanese people staying in a 3 stars hotels rate those hotels with new score aggregators to let review systems robust with respect a score equal to 4. Rule r14 in Table 8 states that hotels close to to injection of fake scores. entertainments, which are 37,998, are scored with the top score Research effort has also being spent to understand which are 5 the 50% of times. the factors that let a review perceived as useful: in [15], the This kind of study provides a general approach for a prelimi- authors highlight how the reviewer history is a dominant factor nary data exploration. While the explanation for certain rules is to let a review be voted as useful or not. In [14] propose to use very intuitive, well-grounded justification for others is left to ex- the reviews as a source for demographic recommendations. perts in the field. We argue that this kind of analysis corresponds In this work we enhance the review dataset with additional to a preliminary step, useful for suggesting which extra-features features based on characteristics of the reviewer (e.g., gender) could be exploitable to build an enhanced hotel recommenda- and the hotel (e.g., popularity and the neighbourhood). On the tion system. Also, we acknowledge that the analysis is based on contrary, work in [18] studies how, independently from the type the available (direct or indirect) information, obtained from the of service or the type of reviewer, the scores may be affected by Tripadvisor’s website. More detailed features could consider ele- external factors, such as the whether conditions and the daylight ments like price or number of guests. This would allow to obtain length of the service cities. We leverage an extensive experimen- other interesting rules, which remain an exclusive prerogative of tal campaign, addressing around 190k real reviews, which leads the hoteliers. to the provision of statistically sound results. Addressing a large s- cale of data has been done also in [13], which already has targeted 4 RELATED WORK users’ reviews as a rich source of information for sociolinguistic E-advice technology offers a form of “electronic word-of-mouth”, studies. While they achieve correlations between metadata in the with new potential for gathering valid suggestions that guides reviewers’ profile and the review text to let writing styles emerge, the consumer’s choice. Extensive and nationally representative we highlight association evidence among hotels and reviewers surveys have been carried out in the recent past, “to evaluate the features and the reviewer’s attitude to score the hotel. specific aspects of ratings information that affect people attitudes toward e-commerce”. It is the case, e.g., of work in [10], which 5 CONCLUSIONS highlights how people, while taking into accounts the average We focused on hotel reviews to investigate which factors could of ratings for a product, still do not take care of the number of impact the scores that reviewers assign to hotels throughout the reviews leading to that average. Recent work showed that, in- world.First of all we have enriched review data with with novel stead of showing first to the users the reviews with the highest hotel-centric and reviewer-centric features, obtained for example scores, a different order, based, e.g., on the user profile, could be through linked data information available from the web, then we considered [8]: that work integrates new features based on the have applied association rule mining to focus on these features user profile into recommender systems, to deliver better recom- possibly motivating the classification scores. mendations and provide an improved user experience. Similarly, The approach can help both consumers and providers: the in [19], the authors focus on score values given by previous con- former could achieve a better awareness on how to read the tributors whose preferences are close to the user’s preference. reviews (consumers), the latter on how to improve their services Even almost one decade ago, the work in [1] applies text mining (providers). The providers also can query a very large segment of tools to online reviews to define rules sets, to identify contextual population, in an automatic way and without relying on standard information in the texts, which goes beyond a mere order of interviews. numerical scores. Similarly to our work, they rely on Tripadvisor, The proposed technique is also applicable to a various range focusing however on text analysis only. of services: accomodation, car rental, food services, to cite a few. However, the cited literature proposes systems that recom- Being association rule mining parametric with respect to the mend a service based on the intrinsic characteristics of that ser- itemsets in input, the approach is easily extensible to further vice (e.g., characteristics of the hotel and its facilities). Other features not considered here, such as, e.g., the service price. works, similar to ours, investigate if, and how, the review data hide social and/or economic information of the reviewers. One 6 ACKNOWLEDGMENTS example is mining reviews to exploit them as a textual resource for sociolinguistic studies at a large-scale, as done in [13]. This This research is partly supported by the EU H2020 Program, grant work leverages the size of the reviews corpus as a more statistical- agreement #675320 (NECS: European Network of Excellence in Cy- ly solid base for the analysis, with respect to manually-collected bersecurity). Funding has also been received by Fondazione Cassa corpora. Since reviews sites, such as Trustpilot8 , may contain di Risparmio di Lucca that partially finances the regional project reviewer metadata like, e.g., age, gender and location, the work ReviewLand. Vittoria Cozza is also supported by the Starting Grants Project DAKKAR (DAta benchmarK for Keyword-based 8 https://www.trustpilot.com/ Access and Retrieval) promoted by University of Padua, Italy and 60 Rule Condition Confidence |X| |X∩ Y| Lift Convinction r1 {memberActivity=1 country=fr ==> geotransp=low} 0.64 29,837 19,199 3.2 2.24 r2 {gender=male country=us ==> hotelPopularity=very 0.59 44,703 26,505 1.78 1.64 high} r3 {memberActivity=1 country=us ==> guestPics=very 0.34 61,155 20,926 1.71 1.22 high} r4 {memberActivity=1 country=us ==> geoenter=high} 0.54 61,155 20,4316 1.68 1.48 r5 {gender=male country=us ==> hotelTripType=couple} 0.76 44,703 34,192 1.04 1.11 r6 {memberActivity=very low revtripType=family ==> 0.74 27,343 20,362 1.01 1.02 hotelTripType=couple } Table 7: Excerpt of ARs where user features are premises and the consequences the features of selected hotel, results are sorted by decreasing lift Rule Condition Confidence |X| |X∩ Y| Lift Convinction r7 {stars=0 country=None guestPics=very low geo- 0.25 9,007 2,214 1.89 1.15 transp=low ==> rating=3} r8 {stars=5 hotelPopularity=medium geofood=very 0.76 2,582 1,962 1.70 2.31 high ==> rating=5} r9 {memberActivity=very low gender=female 0.7 2744 1918 1.57 1.84 guestPics=very high hotelTripType=couple geoenter=very high ==> rating=5} r10 { stars=3 country=jp ==> rating=4} 0.47 5,265 2,492 1.38 1.25 r11 {memberActivity star=3 guestPics=low ==> rat- 0.46 4,326 1,998 1.35 1.22 ing=4} r12 {star=3 geofood=very high ==> rating=4} 0.44 4,312 1,901 1.28 1.17 r13 {country=jp hotelTripType=business ==> rat- 0.44 4,483 1,954 1.27 1.16 ing=4} r14 {geoenter=very high ==> rating=5} 0.5 37,998 19,120 1.13 1.12 Table 8: Excerpt of CARs, the class is the review rating, results are sorted by decreasing lift Fondazione Cariparo, Padua, Italy. The first author would like to Computation 9, 5 (01 Oct 2017), 689–701. thank Giorgio Maria Di Nunzio, for his helpful support. [10] AndrewJ. Flanagin, MiriamJ. Metzger, Rebekah Pure, Alex Markov, and Ethan Hartsell. 2014. Mitigating risk in e-commerce transactions: perceptions of information credibility and the role of user-generated ratings in product REFERENCES quality and purchase intention. Electronic Commerce Research 14, 1 (2014), 1–23. [1] Silvana Aciar. 2009. Mining context information from consumer’s Reviews. [11] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute- Prooceedings of the Context-Aware Recommender Systems (CARS) Workshop mann, and Ian H Witten. 2009. The WEKA data mining software: an update. (2009). ACM SIGKDD explorations newsletter 11, 1 (2009), 10–18. [2] Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining [12] Jochen Hipp, Ulrich Güntzer, and Gholamreza Nakhaeizadeh. 2000. Algorithms Association Rules in Large Databases. In Proceedings of the 20th International for Association Rule Mining: a General Survey and Comparison. SIGKDD Conference on Very Large Data Bases (VLDB ’94). Morgan Kaufmann Publishers Explor. Newsl. 2, 1 (June 2000), 58–64. Inc., San Francisco, CA, USA, 487–499. [13] Dirk Hovy, Anders Johannsen, and Anders Søgaard. 2015. User Review Sites [3] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. 1997. Dy- As a Resource for Large-Scale Sociolinguistic Studies. In 24th International namic Itemset Counting and Implication Rules for Market Basket Data. In Conference on World Wide Web (WWW ’15). 452–461. Proceedings of the 1997 ACM SIGMOD International Conference on Management [14] Nikolaos Korfiatis and Marios Poulos. 2013. Using online consumer reviews of Data (SIGMOD ’97). ACM, New York, NY, USA, 255–264. as a source for demographic recommendations: A case study using online [4] Elian Carsenat. 2013. Onomastics and Big Data Mining. CoRR abs/1310.6311 travel reviews. Expert Systems with Applications 40, 14 (2013), 5507 – 5515. (2013). http://arxiv.org/abs/1310.6311 [15] Asher Levi and Osnat Mokryn. 2014. The Social Aspect of Voting for Useful [5] Fabio Celli, F. Marta L. Di Lascio, Matteo Magnani, Barbara Pacelli, and Luca Reviews. In Social Computing, Behavioral-Cultural Modeling and Prediction. Rossi. 2010. Social Network Data and Practices: The Case of Friendfeed. In LNCS, Vol. 8393. Springer International Publishing, 293–300. Advances in Social Computing. LNCS, Vol. 6007. Springer Berlin Heidelberg, [16] Bing Liu, Wynne Hsu, and Yiming Ma. 1998. Integrating Classification and 346–353. Association Rule Mining. In KDD. 80–86. [6] Alessandro Colantonio, Roberto Di Pietro, Marinella Petrocchi, and Angelo [17] Amanda J. Minnich, Nikan Chavoshi, Abdullah Mueen, Shuang Luan, and Spognardi. 2015. Visual detection of singularities in review platforms. In 30th Michalis Faloutsos. 2015. TrueView: Harnessing the Power of Multiple Review Annual ACM Symposium on Applied Computing,. 1294–1295. Sites. In 24th International Conference on World Wide Web (WWW ’15). 787– [7] Roberto Di Pietro, Marinella Petrocchi, and Angelo Spognardi. 2014. A Lot of 797. Slots - Outliers Confinement in Review-Based Systems. In Web Information [18] SyedA. Rahman, Tazin Afrin, and Don Adjeroh. 2015. Determinants of User Systems Engineering Part I. 15–30. Ratings in Online Business Rating Services. In Social Computing, Behavioral- [8] Ruihai Dong and Barry Smyth. 2016. From More-Like-This to Better-Than- Cultural Modeling, and Prediction. LNCS, Vol. 9021. Springer International This: Hotel Recommendations from User Generated Reviews. In Proceedings of Publishing, 412–420. the 2016 Conference on User Modeling Adaptation and Personalization (UMAP [19] Koji Takuma, Junya Yamamoto, Sayaka Kamei, and Satoshi Fujita. 2016. A ’16). ACM, New York, NY, USA, 309–310. Hotel Recommendation System Based on Reviews: What Do You Attach Impor- [9] Michela Fazzolari, Vittoria Cozza, Marinella Petrocchi, and Angelo Spognardi. tance To?. In Fourth International Symposium on Computing and Networking, 2017. A Study on Text-Score Disagreement in Online Reviews. Cognitive CANDAR 2016, Hiroshima, Japan, November 22-25, 2016. 710–712. 61