On Finding the Relevant User Reviews for Advancing Conversational Faceted Search Eleftherios Dimitrakis1,2 , Konstantinos Sgontzos1,2 , Panagiotis Papadakos1 , Yannis Marketakis1 , Alexandros Papangelis3 , Yannis Stylianou2,3 , and Yannis Tzitzikas1,2 1 Institute of Computer Science - FORTH-ICS, Greece {dimitrakis, sgontzos, papadako, marketak, tzitzik}@ics.forth.gr 2 Computer Science Department - University of Crete, Greece 3 Speech Technology Group - Toshiba Research Europe {alex.papangelis, yannis.stylianou}@crl.toshiba.co.uk Abstract. Faceted Search (FS) is a widely used exploratory search paradigm which is commonly applied over multidimensional or graph data. However sometimes the structured data are not sufficient for an- swering a user’s query. User comments (or reviews) is a valuable source of information that could be exploited in such cases for aiding the user to explore the information space and to decide what options suits him/her better (either through question answering or query-oriented sentiment analysis). To this end in this paper we introduce and comparatively evaluate methods for locating the more relevant user comments that are related with the user’s focus in the context of a conversational faceted search system. Specifically we introduce a dictionary-based method, a word embedding-based method, and one combination of them. The anal- ysis and the experimental results showed that the combined method out- performs the other methods, without significantly affecting the overall response time. 1 Introduction Faceted Search (FS) is a widely used exploratory search paradigm. It is used whenever the user wants to find the desired item from a list of items (either products, hotels, restaurants, publications, etc). Typically FS offers exploratory search over multidimensional or graph data. However sometimes the structured data are not enough for answering a user’s query. User comments (or reviews) is a valuable source of information that could be exploited in such cases for aiding the user to explore the information space and to decide what options suits him/her better. Indeed, user comments/reviews are available in various applications of faceted search, e.g. for hotel booking and in product catalogs. Enabling the interaction of FS though spoken dialogue, is appropriate for situations where the user cannot (or is not convenient to) use his hands or eyes. In such cases, the user interacts using his voice and provides commands or poses questions. If a question cannot be translated to a query over the structured Title Suppressed Due to Excessive Length 23 resources of the dataset, then the system cannot deliver any answer. In such cases it is reasonable to resort to the available unstructured data, i.e. to users’ comments and reviews. Figure 1 illustrates the context. The objective is not to provide the user with a direct answer but first to identify which of the user com- ments are relevant to the user’s question. Direct query answering is reasonable only in cases where, there is a single and credible source of unstructured data (e.g. wikipedia). This is not the case with user comments since they can be nu- merous, and their content can be conflicting. If we manage to find the relevant comments, then the system could either read these comments to the user, or attempt to apply question answering if the user requests so, or any other kind of analysis, e.g. sentiment analysis as in [2, 14]. In any case spoken dialogue in- teraction, poses increased requirements on quality, since the system should not ”read” irrelevant comments as reading costs user time. Multidimensional Preference-enriched or graph data Faceted Search Mapping to Spoken Dialogue User comments Finding Related and reviews Comments Data Resources Interaction System User Fig. 1: Finding Related Comments and Conversational Faceted Search Note that instead of analyzing the user comments for estimating whether a hotel is good or bad as a whole, the interaction that we propose enables the user to get information about the particular aspects or topics that are important for him, e.g. about noise, cleanliness, the quality of the wifi, parking, courtesy and helpfulness of staff, etc. The set of such topics is practically endless and we cannot make the assumption that structured data will exist for all such topics. Therefore, it is beneficial to have systems that are able to exploit associated unstructured data, e.g. user comments and reviews. The problem is challenging because user comments are usually short, meaning that it is hard to achieve an acceptable level of recall. In this paper we focus on this problem, and we introduce methods relying on hand crafted and statistical dictionaries for identifying the relevant comments. In addition we describe an evaluation collection that we have created for comparatively evaluating the introduced methods, as well as an ongoing application and evaluation over a bigger and real dataset. In a nutshell, the key contributions of this paper are: (a) we show how the FS interaction can be extended for exploiting unstructured data in the form of user comments and reviews, and (b) we introduce and comparatively evaluate four methods for identifying the more relevant user comments in datasets related to the task of hotel booking. The rest of this paper is organized as follows: Section 2 presents the required background and related work. Section 3 describes the proposed methods. Section 4 reports experimental results. Finally, Section 5 concludes the paper and discusses directions for future research and work. 24 ... 2 Background and Related Work 2.1 Background: Faceted Search and PFS Faceted search is the de-facto standard in e-commerce and tourism services. It is an interaction framework based on a multi-dimensional classification of data objects, allowing users to browse and explore the information space in a guided, yet unconstrained way through a simple visual interface [15]. Features of this framework include: (a) display of current results in multiple categorization schemes (called facets, or dimensions, or just attributes), (b) display of facets and values leading to non-empty results only, (c) display of the count information for each value (i.e. the number of results the user will get by selecting that value), and (d) ability to refine the focus gradually, i.e. it is a session-based interaction paradigm in contrast to the stateless query-and-response dialogue of most search systems. Faceted search is currently the de facto standard in e-commerce (e.g. eBay, booking.com), and its popularity and adoption is increasing. It has been proposed and applied for web searching, for semantically enriching web search results, for patent-search, as well as for exploring RDF and Linked Data (e.g. see [4,16], as well as [19] for a recent survey). The enrichment of faceted search with preferences, hereafter Preference-enriched Faceted Search, for short PFS, was proposed in [12,20]. PFS offers actions that allow the user to order facets, values, and objects using best, worst, prefer to actions (i.e. relative preferences), around to actions (over a specific value), or actions that order them lexicographically, or based on their values or count values. Furthermore, the user is able to compose object related preference actions, using Priority, Pareto, Pareto Optimal (i.e. skyline) and other. The distinctive features of PFS is that it allows expressing preferences over attributes, whose values can be hierarchically organized (and/or multi-valued), it support preference inheritance, and it offers scope-based rules for resolving automatically the conflicts that may arise. As a result the user is able to restrict his current focus by using the faceted interaction scheme (hard restrictions) that lead to non-empty results, and rank the objects of his focus according to the expressed preferences. Recently, PFS has been used in various domains, e.g. for offering a flexible process for the identification of fish species [17], as a Voting Advice Application [18], as well as, for data that contain also geographical information [6]. 2.2 Related Works Conversational Faceted Search Only a few works exist that involve speech interfaces on top of the faceted search paradigm: [3] exploits a speech user inter- face over facets that index audio metadata associated with audio content (that system is used for the Spoken Web, an alternative to WWW based on audio con- tent, and the associated Mediaeval Spoken Web Search Task), while a faceted browser over Linked Data is described in [7], where commands in natural lan- guage are translated to SPARQL queries. To the best of our knowledge though, the only work that combines spoken dialogue systems with faceted search is the Title Suppressed Due to Excessive Length 25 one presented in [13], where the described LD-SDS system is limited to spoken dialogues over structured datasets (expressed in RDF). In this work we extend conversational faceted search for exploiting also available unstructured data (e.g. user reviews). Note that, tackling the same problem using only a single large- scale source of unstructured data, e.g. Wikipedia (as described in [1]), is much easier since in that case we do not have the source selection problem (selection of user comments in our case), and the source contains many and long texts, therefore it is not difficult to achieve a good recall level. Similar Tasks Two similar tasks, as regards the text size, from the area of Question Answering are: (1) Machine Comprehension (MC) which aims at iden- tifying the answer boundaries from a given text passage and an input question (e.g. [1] performs MC over Wikipedia), and (2) Answer Sentence Selection which aims at identifying the right sentence from a list of candidate sentences, given an input question (e.g. as in [21]). 3 The Proposed Approach In §3.1 we describe an extension of the interaction of PFS for exploiting also associated unstructured data, and in §3.2 we focus on the problem of finding the relevant comments. 3.1 The Interaction The user interacts with the system using actions corresponding to PFS actions, i.e. actions that correspond either to hard constraints (i.e. filters), or soft con- straints (i.e. preferences). We shall use the term ofocus to refer to the restricted set of objects (those after applying all filters), and pfocus to refer to the first bucket of the focus, that contains the more preferred objects. If the cardinality of either of the above sets is below a configurable threshold θ (say 10), then if the user’s questions cannot be answered by the structured dataset, the system resorts to the user comments for this. Note that if at some point in the interac- tion, the user’s focus is big (i.e. min(|of ocus|, |of ocus|) > θ) and the user asks a question that cannot be answered by the structured dataset, then the system suggests the user to ”first refine the focus” in the sense that it is not useful to ask questions of the form “quiet hotel in Rome”, or “hotels with fast wifi in London”. In other words, we could say that the system enters this mode in the so-called “End Game” phase of faceted search [15]. This choice has several benefits: (a) Applicability: It can applied without requiring the comments to be indexed a priori, and this enables the application of this model over RSS feeds and blog comment hosting services (e.g. Disqus). (b) Efficiency: Since the analysis will be done only for the comments of the ho- tels in the focus, it is feasible to make this analysis at real time. (c) Less Noise, Better Quality: For the same reason, as in (b), the quality of the retrieved comments is expected to be higher in comparison to the quality of retrieval over the entire set of comments (of all hotels). 26 ... 3.2 Finding the Relevant Comments We shall use a scoring function for estimating the relevance between an input question q and each user review ri , where 1 ≤ i ≤ θ. Below we introduce four scoring methods: (I) a Baseline, (II) a WordNet-based, (III) a Word2vec-based, and (IV) a combination of (II) and (III). WordNet [11] is lexical database for the English language comprising 166,000 (f, s) pairs, where f is a word-form and s the set of words that have the same sense, that also includes relations between words and senses (like Synonymy, Antonymy, Hypernymy etc.). Word2Vec [10] is a method for transforming in- dividual words into vectors of low dimensionality (it is low in comparison a |words|-dimensionality), e.g. 300, so that their distances reveal their semantic association (these representations are derived by training a two-layer neural net- work). The motivation for the selection of the above methods is their ability to capture semantically relevant reviews beyond the trivial task of exact string matching, and their rich and domain-agnostic vocabulary. The process for identifying the more relevant comments, in any of the four I-IV methods, consists of the following steps: 1) For each review ri we split its text into individual sentences and get a set of sentences rij , where 1 ≤ j ≤ s and s is the number of sentences in each re- view. In this way, we can score the reviews based on the maximal scored sentence. 2) Apply tokenization, removal of stop-words and punctuations, as well as lemma- tization (using Stanford CoreNLP [8]) both to the input question q and each associated sentence rij of ri . Let denote the result by q words and rij words respectively. 3) Construct the method-related representation of q and each rij (it will be de- scribed below). 4) Score and rank each review based on the defined relevancy formula. Below we describe the representation and the scoring formula for each method. I) Baseline: Here we just compute the maximum Jaccard Similarity between the q words and the corresponding rij words sets: S(q, ri ) = max∀rij ∈ri JaccardSim(q words, rij words) II) WordNet: In this method we construct WordNet-based representations for the q and rij sets. Specifically, for each word in q words and rij words we take the union of the synonyms, antonyms and hypernyms, denoted by wordN et(q) and wordN et(rij ) respectively, as extracted from the WordNet. The final score is defined again using the maximum Jaccard Similarity as: S(q, ri ) = max∀rij ∈ri W N S(q, rij ), where W N S(q, rij ) = JaccardSim(wordN et(q), wordN et(rij )). III) Word2vec: This method exploits the word2vec embeddings available in the GoogleNews 300-dimensional pre-trained model4 . Specifically, we get the 4 https://code.google.com/archive/p/word2vec/ Title Suppressed Due to Excessive Length 27 word2vec vector representations of all words in q words and rij words, denoted by word2vec(q) and word2vec(rij ) respectively. Then we apply the Word Movers Distance (WMD) [5] which calculates the minimum distance (in the vector space) between the embedded words of the two sets. The score is defined as: S(q, ri ) = max∀rij ∈ri W M S(q, rij ), where W M S(q, rij ) = 1 − W M Dn (word2vec(q), word2vec(rij )) and W M Dn denotes the normalized distance calculated by the division with the max WMD over all comments. IV) WordNet and Word2vec: Here we combine the two previous methods through a weighted sum, reaching to the following definition of score: S(q, ri ) = wwN ∗ max W N S(q, rij ) + ww2v ∗ max W M S(q, rij ) ∀rij ∈ri ∀rij ∈ri where wwN , ww2v ∈ [0, 1] and wwN + ww2v = 1. 4 Evaluation 4.1 Evaluation over the Collection FRUCE We constructed a small evaluation collection in order to compare the presented methods. The collection consists of 40 hand crafted user reviews/comments re- lated to hotels (c1 , ..., c40 ) and 2 manually crafted queries (q1 and q2 ) related to the topic of noise. The complete list of comments is web accessible5 and the queries are the following: q1 = “Has anyone reported a problem about noise?”, q2 = “Is this hotel quiet?”. For the needs of the evaluation we manually judged the relevance of the collection’s reviews to each query. Specifically, each review ci is labeled with 1 if it is relevant, and with 0 otherwise. The relevant/irrelevant ratio in the collection is 1/3. Quality. We measured the mean R − P recision and mean AveP over the two queries q1 and q2 for all methods. Specifically, for the IV method we com- puted various weights combinations and chose the model that achieved the highest mean AveP . Note that methods II and III correspond to the pairs (wwN = 1.0, ww2v = 0.0) and (wwN = 0.0, ww2v = 1.0) respectively. In our case the maximizing weights were found to be wwN = 0.7 and ww2v = 0.3 with mean AveP = 0.569 and mean R−P recision = 0.649. The corresponding scores for method II were mean AveP = 0.398 and mean R − P recision = 0.449, while method III achieved mean AveP = 0.366 and mean R − P recision = 0.4 (the precision of Word2vec-based methods in analogous challenges [9] is around 55%, i.e. similar to what we measured in our setting). Finally, IV outperforms both II, III, while II slightly outpoints III. As expected, all of the above models out- performed our baseline (mean AveP = 0.05 and mean R − P recision = 0.05) as shown in Table 1. We have to stress though that the results of methods II 5 at http://www.ics.forth.gr/isl/sar/resources/dataset/fruce 28 ... and IV could be further improved by combining other thesaurus with WordNet or an updated version of WordNet, since WordNet currently fails to provide the synonyms, hypernyms and antonyms of many words. Further, since we cur- rently consider all possible senses of a word in the WordNet based approach, we might be introducing wrong terms in the wordN et(q) and wordN et(rij ) set. This problem can possible be avoided with proper sense identification methods. Method Mean AveP Mean R-Precision Method Total time (ms) Aver. time (ms) I 0.05 0.05 I 141 3 II 0.398 0.449 II 797 19 III 0.366 0.4 III 47 1 IV 0.569 0.649 IV 546 13 Table 1: Mean Average Precision Table 2: Time for computing the score of methods I-IV. of 40 reviews for each method. In addition, we plot a 2D diagram for each of the three models II, III, IV (baseline excluded), where the y-axis represents the computed score for (ri , qi ) and the x-axis indicates its true binary relevance. The plots are shown in Figure 2. We can observe that the points are not separable by a threshold in any of the figures (parallel line to x-axis). However, it is obvious that the IV approach clearly improves the separation, preserving higher scores to the true relevant reviews, like III, and lower scores to the non-relevant ones, like II. Fig. 2: Distribution of query-review pairs as a function of their calculated (float- ing point) and true binary relevance score for methods II, III, IV . Efficiency. All experiments were performed using a 16GB RAM machine. Re- garding speed efficiency, it is worth measuring: a) the required time for loading the appropriate resources (Dataset, WordNet, Word2Vec) (i.e. Init Time), and b) the required time for computing the similarity score of one query-review pair (i.e. Execution Time). Note that the Init Time cost has to be paid only once, while Execution Time affects the user interaction. Regarding Init Time, the most time consuming resource is Word2Vec due to its enormous size (491,061 ms), followed by the loading of the FRUCE dataset (39,149 ms). The WordNet dictionary loads almost instantly (63ms). The Ex- ecution Time on the other hand is very fast for all methods (only 13 ms on average). We only need about 1.5 seconds for analyzing and scoring 100 reviews Title Suppressed Due to Excessive Length 29 with 15 words on average. Table 2 shows the Execution Time for computing the scores of the 40 reviews for all methods (the minimum values are in bold). 4.2 Experiments over a Real Dataset We also evaluated the proposed methods over a real dataset, that we scrapped from a travel website. This specific dataset contains information about 382 dif- ferent hotels located in 4 different cities (Kyoto, Tokyo, Osaka, Kobe) of Japan. The extracted data are logically structured in facets so that they can be di- rectly plugged into the system, containing the following types of information: (a) boolean values, used for describing the facilities of a hotel (e.g. free of charge wifi, free parking, etc.), (b) numerical values (integers or floats) for describing quantitative values (e.g. price, review rating, distance from various points of in- terest, etc.), (c) geographic values for describing the location and (d) textual values. In the last category there are also comments that review hotels, which are categorized into comments with a positive and negative aspect. We would like to remark that almost all (more than 23 thousand) review comments that we have extracted contain both a positive and a negative part. Table 3 shows the total number of hotels and the average number of comments per hotel for the 4 different cities of Japan. City hotels avg num. of comments per hotel Kyoto 100 71 Osaka 100 65 Tokyo 100 71 Kobe 82 33 Total 382 61 Table 3: The Japan hotels dataset containing more than 23,000 comments Efficiency. The time required to load the user reviews is 186,769 ms. For eval- uating the execution time, we measured the required time for analysing and scoring (according to the q1 and q2 ) 2,000 randomly selected reviews, returning the 10 most highly ranked ones. The minimum, maximum and average times were 21 ms, 6,427 ms and 56 ms respectively (on average each review has 48 words), and the total time was 113,870 ms. It follows that the proposed method is acceptable in terms of efficiency. Specifically, if we assume that we have 3 hotels in the current user focus and the average number of reviews per hotel is 61 (as shown in Table 3), we can score all reviews in around 10 secs. Quality. Since the reviews are not annotated with binary relevance scores for the two used queries, it is difficult to evaluate the quality of the scoring methods on this collection. Annotating the whole collection is a laborious and time con- suming task. However we have started to manually annotate a part of the full reviews for the two queries that we have used in the FRUCE Collection. For the time being, we have marked 71 distinct comments, and identified 66 relevant and 76 irrelevant (ci , qi ) pairs. The average top-2 precision of the IV method for the 2 queries by considering only the subcollection of 71 human judged comments is 0.5, while the average R-precision (R = 33) is again 0.5. We have noticed that 30 ... we would get higher results if the comments were clean, in the sense that the col- lection has several spam comments that affect negatively the results. Currently, we are in the process of cleaning the collection. 5 Conclusion In the context of Faceted Search quite often the structured data are not enough for answering a users query. In such cases the system could resort to related textual comments (posed in natural language) for identifying those that could be exploited for helping the user. This requires finding the most relevant com- ments that (a) are associated with the most preferred objects, and (b) are re- lated to a user’s question. Moreover, spoken dialogue interaction poses increased requirements on quality, in order to avoid wasting user’s time by reading irrele- vant comments. To this end, we introduced a dictionary-based method that uses WordNet, a word embedding-based method, specifically Word2vec, and one that combines both. The analysis and the experimental results showed that the key result is that without dictionaries (either human-made or statistical ones), the effectiveness of retrieving the relevant comments is very low even in a small dataset. Specifically, the baseline method achieved mean AveP = 0.05 and mean R − precision = 0.05. However the method that uses both WordNet and Word2vec outperforms every other method with mean AveP = 0.569 and mean R − precision = 0.649, taking on average only 13 ms to score a review. We believe that the proposed method can be applied in several domains and for various tasks, from booking services to product selection. As part of our future work we plan to: (a) continue the quality evaluation over the real dataset, (b) extend the system described in [13] with this functionality, (c) investigate the applicability of comparative opinion mining and query-oriented sentiment analy- sis, and (d) investigate how we could exploit external sources in cases where even the user comments/reviews are not sufficient for answering a user’s question. References 1. D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading wikipedia to answer open- domain questions. CoRR, abs/1704.00051, 2017. 2. K. Cortis, A. Freitas, T. Daudert, M. Hürlimann, M. Zarrouk, S. Handschuh, and B. Davis. Semeval-2017 task 5: Fine-grained sentiment analysis on financial mi- croblogs and news. In Proceedings of the 11th International Workshop on Seman- tic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017, pages 519–535, 2017. 3. M. Diao, S. Mukherjea, N. Rajput, and K. Srivastava. Faceted search and browsing of audio content on spoken web. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 1029–1038. ACM, 2010. 4. S. Ferré. Sparklis: an expressive query builder for sparql endpoints with guidance in natural language. Semantic Web, 8(3):405–418, 2017. 5. M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to document distances. In International Conference on Machine Learning, pages 957– 966, 2015. Title Suppressed Due to Excessive Length 31 6. P. Lionakis and Y. Tzitzikas. Pfsgeo: Preference-enriched faceted search for geo- graphical data. In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”, pages 125–143. Springer, 2017. 7. B. L. López-Ochoa, J. L. Sánchez-Cervantes, G. Alor-Hernández, M. A. Abud- Figueroa, B. A. Olivares-Zepahua, and L. Rodrı́guez-Mazahua. An architecture based in voice command recognition for faceted search in linked open datasets. In International Conference on Software Process Improvement, pages 174–185. Springer, 2017. 8. C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. Mc- Closky. The stanford corenlp natural language processing toolkit. In ACL (System Demonstrations), pages 55–60, 2014. 9. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 10. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed rep- resentations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013. 11. G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39– 41, Nov. 1995. 12. P. Papadakos and Y. Tzitzikas. Comparing the effectiveness of intentional prefer- ences versus preferences over specific choices: a user study. International Journal of Information and Decision Sciences, 8(4):378–403, 2016. 13. A. Papangelis, P. Papadakos, M. Kotti, Y. Stylianou, Y. Tzitzikas, and D. Plex- ousakis. Ld-sds: Towards an expressive spoken dialogue system based on linked- data. In Search Oriented Conversational AI, SCAI 17 Workshop (co-located with ICTIR 17), 2017. 14. G. Petrucci and M. Dragoni. An information retrieval-based system for multi- domain sentiment analysis. In F. Gandon, E. Cabrio, M. Stankovic, and A. Zim- mermann, editors, Semantic Web Evaluation Challenges, pages 234–243, Cham, 2015. Springer International Publishing. 15. G. M. Sacco and Y. Tzitzikas. Dynamic taxonomies and faceted search: theory, practice, and experience, volume 25. Springer Science & Business Media, 2009. 16. E. Sherkhonov, B. C. Grau, E. Kharlamov, and E. V. Kostylev. Semantic faceted search with aggregation and recursion. In International Semantic Web Conference, pages 594–610. Springer, 2017. 17. Y. Tzitzikas, N. Bailly, P. Papadakos, N. Minadakis, and G. Nikitakis. Using preference-enriched faceted search for species identification. International Journal of Metadata, Semantics and Ontologies, 11(3):165–179, 2016. 18. Y. Tzitzikas and E. Dimitrakis. Preference-enriched faceted search for voting aid applications. IEEE Transactions on Emerging Topics in Computing, PP(99):1–1, 2016. 19. Y. Tzitzikas, N. Manolis, and P. Papadakos. Faceted exploration of rdf/s datasets: a survey. Journal of Intelligent Information Systems, pages 1–36, 2016. 20. Y. Tzitzikas and P. Papadakos. Interactive exploration of multidimensional and hierarchical information spaces with real-time preference elicitation. Fundamenta Informaticae, 20:1–42, 2012. 21. L. Yu, K. M. Hermann, P. Blunsom, and S. Pulman. Deep learning for answer sentence selection. CoRR, abs/1412.1632, 2014.