<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Mining implicit data association from Tripadvisor hotel reviews</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Vittoria</forename><surname>Cozza</surname></persName>
							<email>vittoria.cozza@dei.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Engineering</orgName>
								<orgName type="institution">University of Padua Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marinella</forename><surname>Petrocchi</surname></persName>
							<email>marinella.petrocchi@iit.cnr.it</email>
							<affiliation key="aff1">
								<orgName type="institution">IIT-CNR Pisa</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Angelo</forename><surname>Spognardi</surname></persName>
							<email>spognardi@di.uniroma1.it</email>
							<affiliation key="aff2">
								<orgName type="department">Dipartimento di Informatica</orgName>
								<orgName type="institution">Sapienza Università di Roma Rome</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Mining implicit data association from Tripadvisor hotel reviews</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">873F8733BD09AB2F8F434DE9F34468EA</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T23:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we analyse a dataset of hotel reviews. In details, we enrich the review dataset, by extracting additional features, consisting of information on the reviewers' profiles and the reviewed hotels. We argue that the enriched data can gain insights on the factors that most influence consumers when composing reviews (e.g., if the appreciation for a certain kind of hotel is tied to specific users' profiles). Thus, we apply statistical analyses to reveal if there are specific characteristics of reviewers (almost) always related to specific characteristics of hotels. Our experiments are carried out on a very large dataset, consisting of around 190k hotel reviews, collected from the Tripadvisor website.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Social media, forums, and blogs are privileged vehicles for posting and spreading online reviews. Among the goods and services that are discussed every day on the Internet, we can find those belonging to the most disparate categories, like, e.g., food, clothes, music, toys, etc. Particularly, the practice of choosing and booking preferred destinations has been greatly eased by the possibility for users to consult previous feedback about hotels and restaurants. According to comScore Media Metrix<ref type="foot" target="#foot_0">1</ref> , Tripadvisor is the world's largest travel e-advice site, providing advices as reported by actual travellers. Tripadvisor counts more than 87 million visitors per month 2 .</p><p>Not only common users, but also service providers have strong motivations to analyse the myriads of posts, tweets, and comments available online. The latter will benefit by adjusting, e.g., their products lines and advertisement campaigns, while the former by relying on previous experiences for addressing their needs and matching their expectations. Furthermore, online reviews are a precious source of information, e.g., to unveil implicit and/or unexpected characteristics of the reviewers. As an example, in <ref type="bibr" target="#b12">[13]</ref> the authors investigate if and how the words -and their use-in a review are linked to the reviewer's gender, country, and age.</p><p>In <ref type="bibr" target="#b7">[8]</ref>, the authors present a novel approach to build featurebased user profiles and item descriptions by mining user-generated reviews. Such additional information can be integrated into recommender systems to deliver better recommendations and an improved user experience.</p><p>In our previous work <ref type="bibr" target="#b8">[9]</ref>, we exploited a Tripadvisor dataset in order to investigate how subjectivity of reviewers affects the scores assigned to hotels. Thus, we leverage sentiment analysis techniques to identify mismatches between the text and the score in online review platforms.</p><p>Since several aspects can influence the customer experience (e.g., the hotel price, or the presence of restaurants, cafe, discos in the hotel neighborhood, the connections with bus/train stations and airports, etc.), in this work we propose an automatic approach -based on association rules -to understand which factors most influence consumers' reviews. We consider a very large dataset consisting of around 190k hotel reviews collected from Tripadvisor, enriching the dataset by extracting a series of hotel-centric and reviewer-centric features. We leverage these features to list correlations among hotel properties, reviewer's characteristics, and the review score. The results are obtained applying association rules techniques to our dataset. Findings are both expected -such as that the hotels close to entertainment and food areas are ranked with the highest scores -and less intuitive -such as that those reviewers featuring a very low activity (measured with a lower bound in term of given reviews), considering their stay in a particular area, select -very often -hotels with a low number of transportation means in the neighbourhood.</p><p>We argue that, with our approach, sociologists and marketing experts could analyse the results of the association rules to better understand some extra reviewers's characteristics and connections with the reviewed service. This kind of analysis paves the way for surveying a larger segment of the population than that usually interviewed through standard polls.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">DATASET</head><p>To conduct our study, we grounded it in a dataset composed of real reviews taken from the Tripadvisor<ref type="foot" target="#foot_2">3</ref> website. In particular, our dataset contains all the reviews that can be accessed on the website between the 26th of June 2013 and the 25th of June 2014 -date of the newest extracted review -for hotels in New York, Rome, Paris, Rio de Janeiro, and Tokyo. With a straightforward approach, we were able to collect the following pieces of information for each review:</p><p>• the review date, text, and numeric score;</p><p>• the reviewer username, location, and triptype, being the type of trip, one among the following five categories: Family, Friends, Couple, Solo Traveler, and Businessman;</p><p>• the ID of the hotel which the review refers to. In addition to the above elements, we collected from Tripadvisor all the hotels of the considered reviews and included in our review dataset some additional data regarding the reviewed hotels. In particular, leveraging the ID of the hotel which the review refers to, we have gathered</p><p>• the hotel name and full address (where full address includes the street address, the city, and the country);</p><p>• the category of the hotel (number of stars);</p><p>• the number of guest pictures for the hotel.</p><p>It is worth noting like the above lists are not exhaustive, i.e., they do not represent all the information accessible from Tripadvisor. As an example, further information available for a review are the scores assigned by reviewers to specific aspects of a hotel, like location, cleanliness, sleep quality, rooms, and service. However, for the scope of the current work, we focus on those summarised for the reader's convenience in Table <ref type="table" target="#tab_0">1</ref>. We exploited such pieces of information to further expand the dataset, with enriched features, as described in the next Section 2.1. We have discarded reviews by "Anonymous" users, since they represent users of the platform http://www.daodao.com-the Chinese version of Tripadvisor-where all the reviewers are indifferently grouped in this single virtual username. We have further limited our analysis on reviews whose textual part is in English, following the language identification and analysis approach presented in <ref type="bibr" target="#b4">[5]</ref>. While the reviews accessible from Tripadvisor in the year under investigation are 353,167, after the pre-processing the resulting dataset is made up of 189,304 reviews in English, provided by 142,583 Tripadvisor's registered users that reviewed 4,019 hotels. Table <ref type="table" target="#tab_0">1</ref> recaps the information extracted from the dataset, while Table <ref type="table" target="#tab_1">2</ref> shows the distribution of the reviews per given score value. As shown, the values distribution is highly unbalanced, being the highest score the most frequent in the dataset (reflecting indeed the distribution usually featured by review platforms). Hereafter, we will refer to this dataset as the basic dataset. Indeed, in the following, we will extract hotel-centric and reviewercentric features to enrich the basic set (see Section 2.1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Basic information</head><note type="other">Review</note></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Rating</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Hotel-centric and reviewer-centric features</head><p>Starting from the information collected in the basic dataset, we have augmented it performing some further elaboration. In particular, we enriched the data regarding the reviewed hotel with the following features:</p><p>• the popularity, defined as the number of reviews for a given hotel. While we have neither the list of actual bookings available, nor Tripadvisor requires the reviewer to show a proof to have been a guest in the hotel, this feature, when computed on a large number of reviews per hotel, could indirectly act as a quantification of the actual hotel clients;</p><p>• the hotel triptype, defined as the most frequent reviewer triptype for a given hotel (whereas triptypes are Families, Friends, Couples, Solo Travelers, and Businessmen);</p><p>• the geospatial coordinates (latitude and longitude);</p><p>• three points of interest (POI) features, defined as the number of transportation services, restaurants, and attractions, respectively, in a range of 300 meters around the hotel.</p><p>Popularity and Hotel triptype have been computed looking at how many and which kind of reviewers have reviewed the hotel. The geospatial coordinates have been calculated with Google Places APIs<ref type="foot" target="#foot_3">4</ref> , starting from the hotel name and full address. Then, latitude and longitude, together with the parameter "radius=300", have been given as input to the Google Radarsearch API <ref type="foot" target="#foot_4">5</ref> to find the number of points of interest (POI) related to transportation, food, and entertainment.</p><p>The data regarding a reviewer, instead, have been enriched with the following features:</p><p>• the reviewers' activity, defined as the number of reviews they have written (under the observation period). Our intuition is that this feature could be useful to discriminate between frequent travelers and sporadic ones.</p><p>• the gender of the reviewer. This feature has been extracted with the Namsor Onomastics<ref type="foot" target="#foot_5">6</ref> machine learning tool, able to recognise the language behind a name, thus identifying the gender according to that language vocabulary with high accuracy <ref type="bibr" target="#b3">[4]</ref>.</p><p>After cleaning the username from numbers and symbols and splitting it in two parts (where one is likely to be the name and the other one, when available, the surname), we have called the "onomastics/api/json/gendre" API. This service takes as input name and surname and returns the recognised gender. We have used regular expressions to clean the username from symbols and numbers and for splitting the username. This was possible since, in many cases, the name and surname were separated by a space, or the surname started with an uppercase letter. Some examples of username are: "Eldon S", "MeganJones88". Unfortunately, for a subset of reviewers, it was not possible to derive the gender from their usernames. This happened for 9,507 reviewers (corresponding to 6% of the entire reviewers set), which wrote 12,653 reviews. Examples of usernames for which it was not possible to derive the gender are Hope-and-Dreams, mistyrabbit, A TripAdvisor Member, R W, E A, Nickeykol, NawakRed, FreeTravel81. We labeled with unknown the gender of such 9,507 reviewers. It is worth noting that Popularity, Hotel triptype, and Activity have been calculated as the result of queries to the basic dataset, with the aim of making explicit some data that originally were implicit in the information at disposition. A story apart deserves the computation of the reviewer gender, the points of interest close to the hotel, and its geospatial coordinates. As above described, the latter have been computed relying on external data sources, namely the Google Points of Interest and the Namsor database, containing 800k names and statistical information about names in each country of the world.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Features</head><p>Table <ref type="table" target="#tab_2">3</ref> recaps the hotel-centric and reviewer-centric features we used to enrich the basic dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">ASSOCIATION ANALYSIS</head><p>Association rule mining is a well known and widely applied methodology for discovering frequent patterns, correlations, and causal structures in transaction and relational databases, as well as in other information repositories <ref type="bibr" target="#b11">[12]</ref>. Thus, given a set of items (or itemsets), association rule mining allows to define rules predicting the occurrence of an item (or more), given the occurrence of other items in the same itemsets.</p><p>A popular application is basket data analysis, where itemsets are transactions, representing lists of items in the consumers' baskets. An example of transaction is: {Bread, Steak, Juice, Butter, Chips, Beer}. When several others are collected, e.g., in a large database, the methodology allows to automatically find associations like, e.g., {Bread} ⇒ {Steak} (steaks are often purchased with bread). Beside sales transactions, the basket analysis can be applied to other situations like click stream tracking, spare parts ordering and online recommendation engines -just to name a few 7 .</p><p>An association rule (AR) is generally defined as an implication expression of the form X ⇒ Y , where X and Y are disjoint itemsets. They represent, resp., the condition and the consequence of the rule.</p><p>The strength of an AR is commonly measured through the two metrics support and confidence. Support gives the fraction of itemsets in the dataset that contains both X and Y . Confidence says how frequently items in Y appear in itemsets that contain X . As an example, we want to known the strength of the rule {Bread} ⇒ {Steak} in a dataset with 100 transactions, corresponding to 100 consumers' baskets. Suppose that itemset {Bread, Steak} occurs 30 times, and that itemset {Bread} occurs 40 times, than the support of the rule is equal to 30 100 , while its confidence is 30 40 . As discussed in <ref type="bibr" target="#b2">[3]</ref>, rules with high values for confidence and support do not always correspond to meaningful ARs, especially when working with real datasets, due data can be unbalanced. 7 http://pbpython.com/market-basket-analysis.html For example, one rule could have a very high confidence, but only due to the fact that the item in the consequence is very frequent. In this case, the rule is not relevant. Instead, one rule could have a low confidence, due to the fact that the item in the consequence is very unfrequent in general, but it could still be relevant. Considering the above observation, to evaluate the statistical significance of the ARs, two other metrics are often used: lift and convinction.</p><p>Lift is defined as the confidence divided by the support of the consequence:</p><formula xml:id="formula_0">li f t(X =⇒ Y ) = supp(X ∩ Y ) supp(X ) * supp(Y )<label>(1)</label></formula><p>With respect to confidence, the lift measures the importance of the association considering also the dependence from the support of the consequence.</p><p>Convinction is defined by the ratio of the frequency of itemsets that don't contain the consequence, to the frequency of incorrect predictions:</p><formula xml:id="formula_1">conv(X =⇒ Y ) = 1 − supp(Y ) 1 − con f (X =⇒ Y )<label>(2)</label></formula><p>Both lift and conviction values ranging over the (0,1) interval mean negative dependence, values above 1 mean positive dependence, and a value of 1 means independence.</p><p>When items are also divided according to different classes, it is possible to force the AR analysis to return a specific class in the consequence. The obtained rule is called "class association rule" (CAR). The CAR is an implication of the form: X =⇒ y , where X ⊆ I and y ∈ Y</p><p>where I stands for the itemsets and Y for the classes. The definition of the aforementioned metrics holds also for CARs. The a priori algorithm <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b15">16]</ref> is one of the most popular algorithms to find frequent itemsets, i.e., itemsets whose support ≥ minsup.</p><p>In this work, we apply the association rule mining to the hotel reviews scenario. Each itemset corresponds to a distinguished review, and it is a vector whose components are the values of the features extracted and detailed in Section 2. The same features are reported in Tables <ref type="table" target="#tab_3">4, 5</ref>, 6 for the reader's convenience, together with additional information that are useful here. CARs analysis can be applied when considering also the class, that in our scenario corresponds to the review score, a discrete value with a range between 1 and 5.</p><p>To enable the application of the a priori algorithm, we have first discretised those features that natively ranged over a large set of values. As an example, in Table <ref type="table" target="#tab_3">5</ref>, a very low label for Guest Pictures indicates a hotel with a number of pictures comprised from 0 to 11. Still in that table, a medium label for Popularity means a hotel that has been reviewed n times, where n ranges over <ref type="bibr">[433,</ref><ref type="bibr">1156]</ref>. The values in Table <ref type="table" target="#tab_4">6</ref> should be read as follows: looking at the first line of the "Geo Food" part of the table, our review set contains 37,851 reviews about a hotel, which has a number of restaurants in the range [0, 37] within a radius of 300 mt. Indeed, many different reviews are on the same hotels, being the number of hotels reviewed equal to 4,019, see Section 2.</p><p>All the tables also report the Frequency indication, i.e., how many reviews correspond to those values for those features, with respect to the values and features in the tables (still quite obviously, the sum on the values in the Frequency column equals to the total number of reviews considered, 189,304). In order to find ARs and CARs, we applied the Weka framework <ref type="bibr" target="#b10">[11]</ref> implementation of the a priori. The Weka a priori implementation allows to rank the rules according to different metrics. Among them, we rely on confidence, lift, and conviction. For AR analysis we generate a large number of rules with lift above 1. For CAR analysis, we generate a large number of rules with confidence above 0.2 and then we compute the lift (since, for CAR, Weka does not natively include the ranking based on lift). We finally select the rules with lift greater than 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Activity</head><p>Both for the generated ARs and CARs, we then manually select the most interesting rules, among those with the highest lift and conviction. Table <ref type="table" target="#tab_5">7</ref> and Table <ref type="table" target="#tab_6">8</ref> report an excerpt of the results for both scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Discussion</head><p>Association analysis results are reported in Table <ref type="table" target="#tab_5">7</ref> and Table <ref type="table" target="#tab_6">8</ref>, please notice we only consider those rules that lead to a lift and conviction greater than 1. It is worth noting like |X ∩ Y |, when divided by the size of the dataset, corresponds to the support of the given rule.</p><p>We summarise the main findings, as follows. Rule r1 states that those reviewers featuring a very low activity, considering their stay in France, select -very often -hotels with a low number of transportation means in the neighbourhood. The rule holds for 19,199 reviews, over a total of 29,837 reviews, with equal premises. Rule r2 says that males visiting US prefer hotels with a high popularity. Rule r7 says that, when the hotel has low transportation means in the neighbourhood, and the number of stars for that hotel is unknown (this may corresponds to accommodation facilities like hostels), its rating is equal to 3. Rule r10 states that Japanese people staying in a 3 stars hotels rate those hotels with a score equal to 4. Rule r14 in Table <ref type="table" target="#tab_6">8</ref> states that hotels close to entertainments, which are 37,998, are scored with the top score 5 the 50% of times.</p><p>This kind of study provides a general approach for a preliminary data exploration. While the explanation for certain rules is very intuitive, well-grounded justification for others is left to experts in the field. We argue that this kind of analysis corresponds to a preliminary step, useful for suggesting which extra-features could be exploitable to build an enhanced hotel recommendation system. Also, we acknowledge that the analysis is based on the available (direct or indirect) information, obtained from the Tripadvisor's website. More detailed features could consider elements like price or number of guests. This would allow to obtain other interesting rules, which remain an exclusive prerogative of the hoteliers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">RELATED WORK</head><p>E-advice technology offers a form of "electronic word-of-mouth", with new potential for gathering valid suggestions that guides the consumer's choice. Extensive and nationally representative surveys have been carried out in the recent past, "to evaluate the specific aspects of ratings information that affect people attitudes toward e-commerce". It is the case, e.g., of work in <ref type="bibr" target="#b9">[10]</ref>, which highlights how people, while taking into accounts the average of ratings for a product, still do not take care of the number of reviews leading to that average. Recent work showed that, instead of showing first to the users the reviews with the highest scores, a different order, based, e.g., on the user profile, could be considered <ref type="bibr" target="#b7">[8]</ref>: that work integrates new features based on the user profile into recommender systems, to deliver better recommendations and provide an improved user experience. Similarly, in <ref type="bibr" target="#b18">[19]</ref>, the authors focus on score values given by previous contributors whose preferences are close to the user's preference. Even almost one decade ago, the work in <ref type="bibr" target="#b0">[1]</ref> applies text mining tools to online reviews to define rules sets, to identify contextual information in the texts, which goes beyond a mere order of numerical scores. Similarly to our work, they rely on Tripadvisor, focusing however on text analysis only.</p><p>However, the cited literature proposes systems that recommend a service based on the intrinsic characteristics of that service (e.g., characteristics of the hotel and its facilities). Other works, similar to ours, investigate if, and how, the review data hide social and/or economic information of the reviewers. One example is mining reviews to exploit them as a textual resource for sociolinguistic studies at a large-scale, as done in <ref type="bibr" target="#b12">[13]</ref>. This work leverages the size of the reviews corpus as a more statistically solid base for the analysis, with respect to manually-collected corpora. Since reviews sites, such as Trustpilot 8 , may contain reviewer metadata like, e.g., age, gender and location, the work 8 https://www.trustpilot.com/ highlights gender-specific lexical differences, the the distribution of regional markers, spelling variations and the use of grammatical constructions across the reviewers.</p><p>The work from <ref type="bibr" target="#b16">[17]</ref>, which focused on reviews manipulation, exploits reviewer-centric and hotel-centric features to identify outliers: the work compares hotels reviews and related features across different review sites, outperforming the detection of suspicious hotels with respect to check the reviews on sites in isolation. Relying on visualization tools, the authors of <ref type="bibr" target="#b5">[6]</ref> highlight suspicious changes on reviews scores, while work in <ref type="bibr" target="#b6">[7]</ref> proposes new score aggregators to let review systems robust with respect to injection of fake scores.</p><p>Research effort has also being spent to understand which are the factors that let a review perceived as useful: in <ref type="bibr" target="#b14">[15]</ref>, the authors highlight how the reviewer history is a dominant factor to let a review be voted as useful or not. In <ref type="bibr" target="#b13">[14]</ref> propose to use the reviews as a source for demographic recommendations.</p><p>In this work we enhance the review dataset with additional features based on characteristics of the reviewer (e.g., gender) and the hotel (e.g., popularity and the neighbourhood). On the contrary, work in <ref type="bibr" target="#b17">[18]</ref> studies how, independently from the type of service or the type of reviewer, the scores may be affected by external factors, such as the whether conditions and the daylight length of the service cities. We leverage an extensive experimental campaign, addressing around 190k real reviews, which leads to the provision of statistically sound results. Addressing a large scale of data has been done also in <ref type="bibr" target="#b12">[13]</ref>, which already has targeted users' reviews as a rich source of information for sociolinguistic studies. While they achieve correlations between metadata in the reviewers' profile and the review text to let writing styles emerge, we highlight association evidence among hotels and reviewers features and the reviewer's attitude to score the hotel.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSIONS</head><p>We focused on hotel reviews to investigate which factors could impact the scores that reviewers assign to hotels throughout the world.First of all we have enriched review data with with novel hotel-centric and reviewer-centric features, obtained for example through linked data information available from the web, then we have applied association rule mining to focus on these features possibly motivating the classification scores.</p><p>The approach can help both consumers and providers: the former could achieve a better awareness on how to read the reviews (consumers), the latter on how to improve their services (providers). The providers also can query a very large segment of population, in an automatic way and without relying on standard interviews.</p><p>The proposed technique is also applicable to a various range of services: accomodation, car rental, food services, to cite a few. Being association rule mining parametric with respect to the itemsets in input, the approach is easily extensible to further features not considered here, such as, e.g., the service price.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">ACKNOWLEDGMENTS</head><p>This research is partly supported by the EU H2020 Program, grant agreement #675320 (NECS: European Network of Excellence in Cybersecurity). Funding has also been received by Fondazione Cassa di Risparmio di Lucca that partially finances the regional project ReviewLand. Vittoria Cozza is also supported by the Starting Grants Project DAKKAR (DAta benchmarK for <ref type="bibr">Keyword</ref> </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Considered information in the basic dataset</figDesc><table><row><cell>Hotel</cell></row><row><cell>Date Text Score Reviewer username Country Name Street address City Reviewer location Guest pictures Triptype Hotel ID</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Distribution of the given scores in the dataset</figDesc><table><row><cell>1 2 3 4 5</cell><cell>Value Occurrences 6,504 8,826 24,627 64,949 84,398</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Hotel-centric and reviewer-centric features augmenting the basic Tripadvisor dataset</figDesc><table><row><cell>Hotel</cell><cell>Reviewer</cell></row><row><cell>Popularity Hotel Triptype Geospatial Coordinates Points of Interest</cell><cell>Activity Gender</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 5 :</head><label>5</label><figDesc>Discretised features on hotels</figDesc><table><row><cell>Gender</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 6 :</head><label>6</label><figDesc>Discretised geolocation-based features</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 7 :</head><label>7</label><figDesc>-based Access and Retrieval) promoted by University of Padua, Italy and Excerpt of ARs where user features are premises and the consequences the features of selected hotel, results are sorted by decreasing lift</figDesc><table><row><cell cols="2">Rule Condition r1 {memberActivity=1 country=fr ==&gt; geotransp=low} r2 {gender=male country=us ==&gt; hotelPopularity=very high} r3 {memberActivity=1 country=us ==&gt; guestPics=very high} r4 {memberActivity=1 country=us ==&gt; geoenter=high} r5 {gender=male country=us ==&gt; hotelTripType=couple} r6 {memberActivity=very low revtripType=family ==&gt; hotelTripType=couple }</cell><cell>Confidence 0.64 29,837 |X| 0.59 44,703 0.34 61,155 0.54 61,155 20,4316 1.68 |X∩ Y| Lift Convinction 19,199 3.2 2.24 26,505 1.78 1.64 20,926 1.71 1.22 1.48 0.76 44,703 34,192 1.04 1.11 0.74 27,343 20,362 1.01 1.02</cell></row><row><cell>Rule Condition r7 {stars=0 country=None guestPics=very low geo-transp=low ==&gt; rating=3} r8 {stars=5 hotelPopularity=medium geofood=very high ==&gt; rating=5} r9 {memberActivity=very low gender=female guestPics=very high hotelTripType=couple geoenter=very high ==&gt; rating=5} r10 { stars=3 country=jp ==&gt; rating=4} r11 {memberActivity star=3 guestPics=low ==&gt; rat-ing=4} r12 {star=3 geofood=very high ==&gt; rating=4} r13 {country=jp hotelTripType=business ==&gt; rat-ing=4} r14 {geoenter=very high ==&gt; rating=5}</cell><cell cols="2">Confidence 0.25 0.76 0.7 0.47 0.46 0.44 0.44 0.5 37,998 19,120 1.13 |X| |X∩ Y| Lift Convinction 9,007 2,214 1.89 1.15 2,582 1,962 1.70 2.31 2744 1918 1.57 1.84 5,265 2,492 1.38 1.25 4,326 1,998 1.35 1.22 4,312 1,901 1.28 1.17 4,483 1,954 1.27 1.16 1.12</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 8 :</head><label>8</label><figDesc>Excerpt of CARs, the class is the review rating, results are sorted by decreasing lift Fondazione Cariparo, Padua, Italy. The first author would like to thank Giorgio Maria Di Nunzio, for his helpful support.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://www.comscore.com/Products/Audience-Analytics/Media-Metrix -All sites last accessed December</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="23" xml:id="foot_1">, 2017. 2 https://www.comscore.com/Insights/Rankings -Statistics updated to June 2017. © 2018 Copyright held by the owner/author(s). Published in the Workshop Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://www.tripadvisor.com</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://developers.google.com/places</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://maps.googleapis.com/maps/api/place/radarsearch</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">http://api.namsor.com/onomastics/api</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Mining context information from consumer&apos;s Reviews</title>
		<author>
			<persName><forename type="first">Silvana</forename><surname>Aciar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Prooceedings of the Context-Aware Recommender Systems (CARS) Workshop</title>
				<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Fast Algorithms for Mining Association Rules in Large Databases</title>
		<author>
			<persName><forename type="first">Rakesh</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ramakrishnan</forename><surname>Srikant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 20th International Conference on Very Large Data Bases (VLDB &apos;94)</title>
				<meeting>the 20th International Conference on Very Large Data Bases (VLDB &apos;94)<address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann Publishers Inc</publisher>
			<date type="published" when="1994">1994</date>
			<biblScope unit="page" from="487" to="499" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Dynamic Itemset Counting and Implication Rules for Market Basket Data</title>
		<author>
			<persName><forename type="first">Sergey</forename><surname>Brin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rajeev</forename><surname>Motwani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><forename type="middle">D</forename><surname>Ullman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shalom</forename><surname>Tsur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD &apos;97)</title>
				<meeting>the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD &apos;97)<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="255" to="264" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Onomastics and Big Data Mining</title>
		<author>
			<persName><forename type="first">Elian</forename><surname>Carsenat</surname></persName>
		</author>
		<idno>CoRR abs/1310.6311</idno>
		<ptr target="http://arxiv.org/abs/1310.6311" />
		<imprint>
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Social Network Data and Practices: The Case of Friendfeed</title>
		<author>
			<persName><forename type="first">Fabio</forename><surname>Celli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Marta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Di Lascio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matteo</forename><surname>Magnani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barbara</forename><surname>Pacelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luca</forename><surname>Rossi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Social Computing. LNCS</title>
				<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="volume">6007</biblScope>
			<biblScope unit="page" from="346" to="353" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Visual detection of singularities in review platforms</title>
		<author>
			<persName><forename type="first">Alessandro</forename><surname>Colantonio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Roberto</forename><surname>Di Pietro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marinella</forename><surname>Petrocchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Angelo</forename><surname>Spognardi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">30th Annual ACM Symposium on Applied Computing</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1294" to="1295" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A Lot of Slots -Outliers Confinement in Review-Based Systems</title>
		<author>
			<persName><forename type="first">Roberto</forename><surname>Di</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pietro</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Marinella</forename><surname>Petrocchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Angelo</forename><surname>Spognardi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Web Information Systems Engineering Part I</title>
		<imprint>
			<biblScope unit="page" from="15" to="30" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">From More-Like-This to Better-Than-This: Hotel Recommendations from User Generated Reviews</title>
		<author>
			<persName><forename type="first">Ruihai</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barry</forename><surname>Smyth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization (UMAP &apos;16)</title>
				<meeting>the 2016 Conference on User Modeling Adaptation and Personalization (UMAP &apos;16)<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="309" to="310" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A Study on Text-Score Disagreement in Online Reviews</title>
		<author>
			<persName><forename type="first">Michela</forename><surname>Fazzolari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vittoria</forename><surname>Cozza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marinella</forename><surname>Petrocchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Angelo</forename><surname>Spognardi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Cognitive Computation</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="689" to="701" />
			<date type="published" when="2017-10-01">2017. 01 Oct 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Mitigating risk in e-commerce transactions: perceptions of information credibility and the role of user-generated ratings in product quality and purchase intention</title>
		<author>
			<persName><surname>Andrewj</surname></persName>
		</author>
		<author>
			<persName><surname>Flanagin</surname></persName>
		</author>
		<author>
			<persName><surname>Miriamj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rebekah</forename><surname>Metzger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alex</forename><surname>Pure</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ethan</forename><surname>Markov</surname></persName>
		</author>
		<author>
			<persName><surname>Hartsell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electronic Commerce Research</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="23" />
			<date type="published" when="2014">2014. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">The WEKA data mining software: an update</title>
		<author>
			<persName><forename type="first">Mark</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eibe</forename><surname>Frank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geoffrey</forename><surname>Holmes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bernhard</forename><surname>Pfahringer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Peter</forename><surname>Reutemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ian</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM SIGKDD explorations newsletter</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="10" to="18" />
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Algorithms for Association Rule Mining: a General Survey and Comparison</title>
		<author>
			<persName><forename type="first">Jochen</forename><surname>Hipp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ulrich</forename><surname>Güntzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gholamreza</forename><surname>Nakhaeizadeh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGKDD Explor. Newsl</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="58" to="64" />
			<date type="published" when="2000-06">2000. June 2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">User Review Sites As a Resource for Large-Scale Sociolinguistic Studies</title>
		<author>
			<persName><forename type="first">Dirk</forename><surname>Hovy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anders</forename><surname>Johannsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anders</forename><surname>Søgaard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">24th International Conference on World Wide Web (WWW &apos;15)</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="452" to="461" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Using online consumer reviews as a source for demographic recommendations: A case study using online travel reviews</title>
		<author>
			<persName><forename type="first">Nikolaos</forename><surname>Korfiatis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marios</forename><surname>Poulos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="page" from="5507" to="5515" />
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">The Social Aspect of Voting for Useful Reviews</title>
		<author>
			<persName><forename type="first">Asher</forename><surname>Levi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Osnat</forename><surname>Mokryn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Social Computing, Behavioral-Cultural Modeling and Prediction</title>
				<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">8393</biblScope>
			<biblScope unit="page" from="293" to="300" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Integrating Classification and Association Rule Mining</title>
		<author>
			<persName><forename type="first">Bing</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wynne</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yiming</forename><surname>Ma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">KDD</title>
		<imprint>
			<biblScope unit="page" from="80" to="86" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">TrueView: Harnessing the Power of Multiple Review Sites</title>
		<author>
			<persName><forename type="first">Amanda</forename><forename type="middle">J</forename><surname>Minnich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikan</forename><surname>Chavoshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Abdullah</forename><surname>Mueen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shuang</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michalis</forename><surname>Faloutsos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">24th International Conference on World Wide Web (WWW &apos;15)</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="787" to="797" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Determinants of User Ratings in Online Business Rating Services</title>
		<author>
			<persName><surname>Syeda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tazin</forename><surname>Rahman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Don</forename><surname>Afrin</surname></persName>
		</author>
		<author>
			<persName><surname>Adjeroh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Social Computing, Behavioral-Cultural Modeling, and Prediction. LNCS</title>
				<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2015">2015. 9021</date>
			<biblScope unit="page" from="412" to="420" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A Hotel Recommendation System Based on Reviews: What Do You Attach Importance To?</title>
		<author>
			<persName><forename type="first">Koji</forename><surname>Takuma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Junya</forename><surname>Yamamoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sayaka</forename><surname>Kamei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Satoshi</forename><surname>Fujita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Fourth International Symposium on Computing and Networking, CANDAR 2016</title>
				<meeting><address><addrLine>Hiroshima, Japan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016-11-22">2016. November 22-25, 2016</date>
			<biblScope unit="page" from="710" to="712" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
