-

On Finding the Relevant User Reviews for Advancing Conversational Faceted Search

Eleftherios Dimitrakis

0 1

Konstantinos Sgontzos

0 1

Panagiotis Papadakos

Yannis Marketakis

Alexandros Papangelis

Yannis Stylianou

0 2

Yannis Tzitzikas

0 1 0 Computer Science Department - University of Crete , Greece 1 Institute of Computer Science - FORTH-ICS , Greece 2 Speech Technology Group - Toshiba Research Europe

22 31

Faceted Search (FS) is a widely used exploratory search paradigm which is commonly applied over multidimensional or graph data. However sometimes the structured data are not su cient for answering a user's query. User comments (or reviews) is a valuable source of information that could be exploited in such cases for aiding the user to explore the information space and to decide what options suits him/her better (either through question answering or query-oriented sentiment analysis). To this end in this paper we introduce and comparatively evaluate methods for locating the more relevant user comments that are related with the user's focus in the context of a conversational faceted search system. Speci cally we introduce a dictionary-based method, a word embedding-based method, and one combination of them. The analysis and the experimental results showed that the combined method outperforms the other methods, without signi cantly a ecting the overall response time.

Faceted Search (FS) is a widely used exploratory search paradigm. It is used whenever the user wants to nd the desired item from a list of items (either products, hotels, restaurants, publications, etc). Typically FS o ers exploratory search over multidimensional or graph data. However sometimes the structured data are not enough for answering a user's query. User comments (or reviews) is a valuable source of information that could be exploited in such cases for aiding the user to explore the information space and to decide what options suits him/her better. Indeed, user comments/reviews are available in various applications of faceted search, e.g. for hotel booking and in product catalogs.

Enabling the interaction of FS though spoken dialogue, is appropriate for situations where the user cannot (or is not convenient to) use his hands or eyes. In such cases, the user interacts using his voice and provides commands or poses questions. If a question cannot be translated to a query over the structured resources of the dataset, then the system cannot deliver any answer. In such cases it is reasonable to resort to the available unstructured data, i.e. to users' comments and reviews. Figure 1 illustrates the context. The objective is not to provide the user with a direct answer but rst to identify which of the user comments are relevant to the user's question. Direct query answering is reasonable only in cases where, there is a single and credible source of unstructured data (e.g. wikipedia). This is not the case with user comments since they can be numerous, and their content can be con icting. If we manage to nd the relevant comments, then the system could either read these comments to the user, or attempt to apply question answering if the user requests so, or any other kind of analysis, e.g. sentiment analysis as in [ 2, 14 ]. In any case spoken dialogue interaction, poses increased requirements on quality, since the system should not "read" irrelevant comments as reading costs user time.

Multidimensional

or graph data User comments

and reviews

Data Resources

Preference-enriched Faceted Search Finding Related

Comments

Mapping to Spoken Dialogue

Interaction System User

Note that instead of analyzing the user comments for estimating whether a hotel is good or bad as a whole, the interaction that we propose enables the user to get information about the particular aspects or topics that are important for him, e.g. about noise, cleanliness, the quality of the wi , parking, courtesy and helpfulness of sta , etc. The set of such topics is practically endless and we cannot make the assumption that structured data will exist for all such topics. Therefore, it is bene cial to have systems that are able to exploit associated unstructured data, e.g. user comments and reviews. The problem is challenging because user comments are usually short, meaning that it is hard to achieve an acceptable level of recall. In this paper we focus on this problem, and we introduce methods relying on hand crafted and statistical dictionaries for identifying the relevant comments. In addition we describe an evaluation collection that we have created for comparatively evaluating the introduced methods, as well as an ongoing application and evaluation over a bigger and real dataset. In a nutshell, the key contributions of this paper are: (a) we show how the FS interaction can be extended for exploiting unstructured data in the form of user comments and reviews, and (b) we introduce and comparatively evaluate four methods for identifying the more relevant user comments in datasets related to the task of hotel booking. The rest of this paper is organized as follows: Section 2 presents the required background and related work. Section 3 describes the proposed methods. Section 4 reports experimental results. Finally, Section 5 concludes the paper and discusses directions for future research and work.

Background and Related Work

2.1

Background: Faceted Search and PFS

Faceted search is the de-facto standard in e-commerce and tourism services. It is an interaction framework based on a multi-dimensional classi cation of data objects, allowing users to browse and explore the information space in a guided, yet unconstrained way through a simple visual interface [ 15 ]. Features of this framework include: (a) display of current results in multiple categorization schemes (called facets, or dimensions, or just attributes), (b) display of facets and values leading to non-empty results only, (c) display of the count information for each value (i.e. the number of results the user will get by selecting that value), and (d) ability to re ne the focus gradually, i.e. it is a session-based interaction paradigm in contrast to the stateless query-and-response dialogue of most search systems. Faceted search is currently the de facto standard in e-commerce (e.g. eBay, booking.com), and its popularity and adoption is increasing. It has been proposed and applied for web searching, for semantically enriching web search results, for patent-search, as well as for exploring RDF and Linked Data (e.g. see [ 4, 16 ], as well as [ 19 ] for a recent survey). The enrichment of faceted search with preferences, hereafter Preference-enriched Faceted Search, for short PFS, was proposed in [ 12,20 ]. PFS o ers actions that allow the user to order facets, values, and objects using best, worst, prefer to actions (i.e. relative preferences), around to actions (over a speci c value), or actions that order them lexicographically, or based on their values or count values. Furthermore, the user is able to compose object related preference actions, using Priority, Pareto, Pareto Optimal (i.e. skyline) and other. The distinctive features of PFS is that it allows expressing preferences over attributes, whose values can be hierarchically organized (and/or multi-valued), it support preference inheritance, and it o ers scope-based rules for resolving automatically the con icts that may arise. As a result the user is able to restrict his current focus by using the faceted interaction scheme (hard restrictions) that lead to non-empty results, and rank the objects of his focus according to the expressed preferences. Recently, PFS has been used in various domains, e.g. for o ering a exible process for the identi cation of sh species [ 17 ], as a Voting Advice Application [ 18 ], as well as, for data that contain also geographical information [ 6 ]. 2.2

Related Works

Conversational Faceted Search Only a few works exist that involve speech interfaces on top of the faceted search paradigm: [ 3 ] exploits a speech user interface over facets that index audio metadata associated with audio content (that system is used for the Spoken Web, an alternative to WWW based on audio content, and the associated Mediaeval Spoken Web Search Task), while a faceted browser over Linked Data is described in [ 7 ], where commands in natural language are translated to SPARQL queries. To the best of our knowledge though, the only work that combines spoken dialogue systems with faceted search is the one presented in [ 13 ], where the described LD-SDS system is limited to spoken dialogues over structured datasets (expressed in RDF). In this work we extend conversational faceted search for exploiting also available unstructured data (e.g. user reviews). Note that, tackling the same problem using only a single largescale source of unstructured data, e.g. Wikipedia (as described in [ 1 ]), is much easier since in that case we do not have the source selection problem (selection of user comments in our case), and the source contains many and long texts, therefore it is not di cult to achieve a good recall level.

Similar Tasks Two similar tasks, as regards the text size, from the area of Question Answering are: (1) Machine Comprehension (MC) which aims at identifying the answer boundaries from a given text passage and an input question (e.g. [ 1 ] performs MC over Wikipedia), and (2) Answer Sentence Selection which aims at identifying the right sentence from a list of candidate sentences, given an input question (e.g. as in [ 21 ]). 3

The Proposed Approach

In x3.1 we describe an extension of the interaction of PFS for exploiting also associated unstructured data, and in x3.2 we focus on the problem of nding the relevant comments.

3.1 The Interaction

The user interacts with the system using actions corresponding to PFS actions, i.e. actions that correspond either to hard constraints (i.e. lters), or soft constraints (i.e. preferences). We shall use the term ofocus to refer to the restricted set of objects (those after applying all lters), and pfocus to refer to the rst bucket of the focus, that contains the more preferred objects. If the cardinality of either of the above sets is below a con gurable threshold (say 10), then if the user's questions cannot be answered by the structured dataset, the system resorts to the user comments for this. Note that if at some point in the interaction, the user's focus is big (i.e. min(jof ocusj; jof ocusj) > ) and the user asks a question that cannot be answered by the structured dataset, then the system suggests the user to " rst re ne the focus" in the sense that it is not useful to ask questions of the form \quiet hotel in Rome", or \hotels with fast wi in London". In other words, we could say that the system enters this mode in the so-called \End Game" phase of faceted search [ 15 ]. This choice has several bene ts: (a) Applicability: It can applied without requiring the comments to be indexed a priori, and this enables the application of this model over RSS feeds and blog comment hosting services (e.g. Disqus). (b) E ciency: Since the analysis will be done only for the comments of the hotels in the focus, it is feasible to make this analysis at real time. (c) Less Noise, Better Quality: For the same reason, as in (b), the quality of the retrieved comments is expected to be higher in comparison to the quality of retrieval over the entire set of comments (of all hotels).

Finding the Relevant Comments

We shall use a scoring function for estimating the relevance between an input question q and each user review ri, where 1 i . Below we introduce four scoring methods: (I) a Baseline, (II) a WordNet-based, (III) a Word2vec-based, and (IV) a combination of (II) and (III).

WordNet [ 11 ] is lexical database for the English language comprising 166,000 (f; s) pairs, where f is a word-form and s the set of words that have the same sense, that also includes relations between words and senses (like Synonymy, Antonymy, Hypernymy etc.). Word2Vec [ 10 ] is a method for transforming individual words into vectors of low dimensionality (it is low in comparison a jwordsj-dimensionality), e.g. 300, so that their distances reveal their semantic association (these representations are derived by training a two-layer neural network). The motivation for the selection of the above methods is their ability to capture semantically relevant reviews beyond the trivial task of exact string matching, and their rich and domain-agnostic vocabulary.

The process for identifying the more relevant comments, in any of the four I-IV methods, consists of the following steps: 1) For each review ri we split its text into individual sentences and get a set of sentences rij , where 1 j s and s is the number of sentences in each review. In this way, we can score the reviews based on the maximal scored sentence. 2) Apply tokenization, removal of stop-words and punctuations, as well as lemmatization (using Stanford CoreNLP [ 8 ]) both to the input question q and each associated sentence rij of ri. Let denote the result by q words and rij words respectively. 3) Construct the method-related representation of q and each rij (it will be described below). 4) Score and rank each review based on the de ned relevancy formula.

Below we describe the representation and the scoring formula for each method. I) Baseline: Here we just compute the maximum Jaccard Similarity between the q words and the corresponding rij words sets: S(q; ri) = max8rij2ri J accardSim(q words; rij words) II) WordNet: In this method we construct WordNet-based representations for the q and rij sets. Speci cally, for each word in q words and rij words we take the union of the synonyms, antonyms and hypernyms, denoted by wordN et(q) and wordN et(rij ) respectively, as extracted from the WordNet. The nal score is de ned again using the maximum Jaccard Similarity as: S(q; ri) = max8rij2ri W N S(q; rij ), where W N S(q; rij ) = J accardSim(wordN et(q); wordN et(rij )). III) Word2vec: This method exploits the word2vec embeddings available in the GoogleNews 300-dimensional pre-trained model4. Speci cally, we get the 4 https://code.google.com/archive/p/word2vec/ word2vec vector representations of all words in q words and rij words, denoted by word2vec(q) and word2vec(rij) respectively. Then we apply the Word Movers Distance (WMD) [ 5 ] which calculates the minimum distance (in the vector space) between the embedded words of the two sets. The score is de ned as: S(q; ri) = max8rij2ri W M S(q; rij), where W M S(q; rij) = 1 W M Dn(word2vec(q); word2vec(rij)) and W M Dn denotes the normalized distance calculated by the division with the max WMD over all comments.

IV) WordNet and Word2vec: Here we combine the two previous methods through a weighted sum, reaching to the following de nition of score: S(q; ri) = wwN max W N S(q; rij) + ww2v 8rij2ri max W M S(q; rij) 8rij2ri where wwN ; ww2v 2 [0; 1] and wwN + ww2v = 1. 4 4.1

Evaluation Evaluation over the Collection FRUCE

We constructed a small evaluation collection in order to compare the presented methods. The collection consists of 40 hand crafted user reviews/comments related to hotels (c1; :::; c40) and 2 manually crafted queries (q1 and q2) related to the topic of noise. The complete list of comments is web accessible5 and the queries are the following: q1 = \Has anyone reported a problem about noise?", q2 = \Is this hotel quiet?".

For the needs of the evaluation we manually judged the relevance of the collection's reviews to each query. Speci cally, each review ci is labeled with 1 if it is relevant, and with 0 otherwise. The relevant/irrelevant ratio in the collection is 1=3.

Quality. We measured the mean R P recision and mean AveP over the two queries q1 and q2 for all methods. Speci cally, for the IV method we computed various weights combinations and chose the model that achieved the highest mean AveP . Note that methods II and III correspond to the pairs (wwN = 1:0; ww2v = 0:0) and (wwN = 0:0; ww2v = 1:0) respectively. In our case the maximizing weights were found to be wwN = 0:7 and ww2v = 0:3 with mean AveP = 0:569 and mean R P recision = 0:649. The corresponding scores for method II were mean AveP = 0:398 and mean R P recision = 0:449, while method III achieved mean AveP = 0:366 and mean R P recision = 0:4 (the precision of Word2vec-based methods in analogous challenges [ 9 ] is around 55%, i.e. similar to what we measured in our setting). Finally, IV outperforms both II, III, while II slightly outpoints III. As expected, all of the above models outperformed our baseline (mean AveP = 0:05 and mean R P recision = 0:05) as shown in Table 1. We have to stress though that the results of methods II 5 at http://www.ics.forth.gr/isl/sar/resources/dataset/fruce and IV could be further improved by combining other thesaurus with WordNet or an updated version of WordNet, since WordNet currently fails to provide the synonyms, hypernyms and antonyms of many words. Further, since we currently consider all possible senses of a word in the WordNet based approach, we might be introducing wrong terms in the wordN et(q) and wordN et(rij ) set. This problem can possible be avoided with proper sense identi cation methods.

In addition, we plot a 2D diagram for each of the three models II, III, IV (baseline excluded), where the y-axis represents the computed score for (ri; qi) and the x-axis indicates its true binary relevance. The plots are shown in Figure 2. We can observe that the points are not separable by a threshold in any of the gures (parallel line to x-axis ). However, it is obvious that the IV approach clearly improves the separation, preserving higher scores to the true relevant reviews, like III, and lower scores to the non-relevant ones, like II. E ciency. All experiments were performed using a 16GB RAM machine. Regarding speed e ciency, it is worth measuring: a) the required time for loading the appropriate resources (Dataset, WordNet, Word2Vec) (i.e. Init Time), and b) the required time for computing the similarity score of one query-review pair (i.e. Execution Time). Note that the Init Time cost has to be paid only once, while Execution Time a ects the user interaction.

Regarding Init Time, the most time consuming resource is Word2Vec due to its enormous size (491,061 ms), followed by the loading of the FRUCE dataset (39,149 ms). The WordNet dictionary loads almost instantly (63ms). The Execution Time on the other hand is very fast for all methods (only 13 ms on average). We only need about 1.5 seconds for analyzing and scoring 100 reviews with 15 words on average. Table 2 shows the Execution Time for computing the scores of the 40 reviews for all methods (the minimum values are in bold). 4.2

Experiments over a Real Dataset

We also evaluated the proposed methods over a real dataset, that we scrapped from a travel website. This speci c dataset contains information about 382 different hotels located in 4 di erent cities (Kyoto, Tokyo, Osaka, Kobe) of Japan. The extracted data are logically structured in facets so that they can be directly plugged into the system, containing the following types of information: (a) boolean values, used for describing the facilities of a hotel (e.g. free of charge wi , free parking, etc.), (b) numerical values (integers or oats) for describing quantitative values (e.g. price, review rating, distance from various points of interest, etc.), (c) geographic values for describing the location and (d) textual values. In the last category there are also comments that review hotels, which are categorized into comments with a positive and negative aspect. We would like to remark that almost all (more than 23 thousand) review comments that we have extracted contain both a positive and a negative part. Table 3 shows the total number of hotels and the average number of comments per hotel for the 4 di erent cities of Japan. E ciency. The time required to load the user reviews is 186,769 ms. For evaluating the execution time, we measured the required time for analysing and scoring (according to the q1 and q2) 2,000 randomly selected reviews, returning the 10 most highly ranked ones. The minimum, maximum and average times were 21 ms, 6,427 ms and 56 ms respectively (on average each review has 48 words), and the total time was 113,870 ms. It follows that the proposed method is acceptable in terms of e ciency. Speci cally, if we assume that we have 3 hotels in the current user focus and the average number of reviews per hotel is 61 (as shown in Table 3), we can score all reviews in around 10 secs. Quality. Since the reviews are not annotated with binary relevance scores for the two used queries, it is di cult to evaluate the quality of the scoring methods on this collection. Annotating the whole collection is a laborious and time consuming task. However we have started to manually annotate a part of the full reviews for the two queries that we have used in the FRUCE Collection. For the time being, we have marked 71 distinct comments, and identi ed 66 relevant and 76 irrelevant (ci; qi) pairs. The average top-2 precision of the IV method for the 2 queries by considering only the subcollection of 71 human judged comments is 0.5, while the average R-precision (R = 33) is again 0.5. We have noticed that we would get higher results if the comments were clean, in the sense that the collection has several spam comments that a ect negatively the results. Currently, we are in the process of cleaning the collection. 5

Conclusion

In the context of Faceted Search quite often the structured data are not enough for answering a users query. In such cases the system could resort to related textual comments (posed in natural language) for identifying those that could be exploited for helping the user. This requires nding the most relevant comments that (a) are associated with the most preferred objects, and (b) are related to a user's question. Moreover, spoken dialogue interaction poses increased requirements on quality, in order to avoid wasting user's time by reading irrelevant comments. To this end, we introduced a dictionary-based method that uses WordNet, a word embedding-based method, speci cally Word2vec, and one that combines both. The analysis and the experimental results showed that the key result is that without dictionaries (either human-made or statistical ones), the e ectiveness of retrieving the relevant comments is very low even in a small dataset. Speci cally, the baseline method achieved mean AveP = 0:05 and mean R precision = 0:05. However the method that uses both WordNet and Word2vec outperforms every other method with mean AveP = 0:569 and mean R precision = 0:649, taking on average only 13 ms to score a review. We believe that the proposed method can be applied in several domains and for various tasks, from booking services to product selection. As part of our future work we plan to: (a) continue the quality evaluation over the real dataset, (b) extend the system described in [ 13 ] with this functionality, (c) investigate the applicability of comparative opinion mining and query-oriented sentiment analysis, and (d) investigate how we could exploit external sources in cases where even the user comments/reviews are not su cient for answering a user's question.

Chen ,

Fisch ,

Weston , and

Bordes . Reading wikipedia to answer opendomain questions . CoRR, abs/1704.00051 , 2017 .

Cortis ,

Freitas ,

Daudert , M. Hurlimann, M. Zarrouk,

Handschuh , and

Davis . Semeval -2017 task 5: Fine-grained sentiment analysis on nancial microblogs and news . In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017 , Vancouver, Canada, August 3-4 , 2017 , pages 519 { 535 , 2017 .

Diao ,

Mukherjea ,

Rajput , and

Srivastava . Faceted search and browsing of audio content on spoken web . In Proceedings of the 19th ACM international conference on Information and knowledge management , pages 1029 { 1038 . ACM, 2010 .

Ferre . Sparklis: an expressive query builder for sparql endpoints with guidance in natural language . Semantic Web , 8 ( 3 ): 405 { 418 , 2017 .

Kusner ,

Sun ,

Kolkin , and

Weinberger . From word embeddings to document distances . In International Conference on Machine Learning , pages 957 { 966 , 2015 .

Lionakis and

Tzitzikas . Pfsgeo: Preference-enriched faceted search for geographical data . In OTM Confederated International Conferences" On the Move to Meaningful Internet Systems" , pages 125 { 143 . Springer, 2017 .

B. L.

Lopez-Ochoa ,

J. L.

Sanchez-Cervantes ,

Alor-Hernandez ,

M. A.

AbudFigueroa ,

B. A.

Olivares-Zepahua , and L. Rodr guez-Mazahua. An architecture based in voice command recognition for faceted search in linked open datasets . In International Conference on Software Process Improvement , pages 174 { 185 . Springer, 2017 .

8. C. D. Manning , M.

Surdeanu , J.

Bauer , J. R.

Finkel , S.

Bethard , and D.

McClosky . The stanford corenlp natural language processing toolkit . In ACL (System Demonstrations) , pages 55 { 60 , 2014 .

Mikolov ,

Chen , G. Corrado, and

Dean . E cient estimation of word representations in vector space . arXiv preprint arXiv:1301.3781 , 2013 .

10.

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado , and

Dean . Distributed representations of words and phrases and their compositionality . In C. J. C. Burges , L.

Bottou , M.

Welling , Z.

Ghahramani , and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages 3111 { 3119 . Curran Associates, Inc., 2013 .

11.

G. A.

Miller . Wordnet: A lexical database for english . Commun. ACM , 38 ( 11 ): 39 { 41 , Nov . 1995 .

12.

Papadakos and

Tzitzikas . Comparing the e ectiveness of intentional preferences versus preferences over speci c choices: a user study . International Journal of Information and Decision Sciences , 8 ( 4 ): 378 { 403 , 2016 .

13.

Papangelis ,

Papadakos ,

Kotti ,

Stylianou ,

Tzitzikas , and

Plexousakis . Ld-sds: Towards an expressive spoken dialogue system based on linkeddata . In Search Oriented Conversational AI , SCAI 17 Workshop (co-located with ICTIR 17) , 2017 .

14. G. Petrucci and

Dragoni . An information retrieval-based system for multidomain sentiment analysis . In F. Gandon, E. Cabrio,

Stankovic , and A . Zimmermann, editors, Semantic Web Evaluation Challenges , pages 234 { 243 , Cham , 2015 . Springer International Publishing.

15. G. M. Sacco and Y. Tzitzikas . Dynamic taxonomies and faceted search: theory, practice, and experience , volume 25 . Springer Science & Business Media , 2009 .

16. E. Sherkhonov , B. C.

Grau , E. Kharlamov, and E. V.

Kostylev . Semantic faceted search with aggregation and recursion . In International Semantic Web Conference , pages 594 { 610 . Springer, 2017 .

17.

Tzitzikas ,

Bailly ,

Papadakos ,

Minadakis , and

Nikitakis . Using preference-enriched faceted search for species identi cation . International Journal of Metadata, Semantics and Ontologies , 11 ( 3 ): 165 { 179 , 2016 .

18.

Tzitzikas and

Dimitrakis . Preference-enriched faceted search for voting aid applications . IEEE Transactions on Emerging Topics in Computing, PP(99):1{1 , 2016 .

19.

Tzitzikas ,

Manolis , and

Papadakos . Faceted exploration of rdf/s datasets: a survey . Journal of Intelligent Information Systems , pages 1 { 36 , 2016 .

20.

Tzitzikas and

Papadakos . Interactive exploration of multidimensional and hierarchical information spaces with real-time preference elicitation . Fundamenta Informaticae , 20 :1{ 42 , 2012 .

21. L. Yu , K. M. Hermann , P. Blunsom, and S. Pulman . Deep learning for answer sentence selection . CoRR, abs/1412.1632 , 2014 .