Towards Analogy-based Recommendation Benchmarking of Perceived Analogy Semantics Christoph Lofi Nava Tintarev Web Information Systems - TU Delft Web Information Systems - TU Delft Mekelweg 4 Mekelweg 4 Delft, Netherlands 2628CD Delft, Netherlands 2628CD c.lofi@tudelft.nl n.tintarev@tudelft.nl ABSTRACT Stanley Kubrick’? a) The Fifth Element to Christopher Nolan, b) Requests for recommendation can be seen as a form of query for Memento to Christopher Nolan, c) Dunkirk to Christopher Nolan). candidate items, ranked by relevance. Users are however often Analogy completion queries might be seen an extension on classical unable to crisply define what they are looking for. One of the core critiquing in recommender systems which can be formulated in concepts of natural communication for describing and explaining terms of “like x, but with properties y modified”. In critiquing the complex information needs in an intuitive fashion are analogies: feature (price) and the modification (cheaper) needs to be explicit, e.g., “What is to Christopher Nolan as is 2001: A Space Odyssey to whereas in an analogy, the semantics are implicitly given by setting Stanley Kubrick?”. Analogies allow users to explore the item space terms in relation, which is interpreted based on both communi- by formulating queries in terms of items rather than explicitly cation partners’ conceptualization of that domain (e.g., “The Fifth specifying the properties that they find attractive. One of the core Element is like 2001: A Space Odyssey but by Scorsese” carries a challenges which hamper research on analogy-enabled queries is lot of implicit information). that analogy semantics rely on consensus on human perception, Analogies typically are represented as rhetorical figures of speech which is not well represented in current benchmark data sets. There- which need to be interpreted and reasoned on using the receiver’s fore, in this paper we introduce a new benchmark dataset focusing background knowledge - which is difficult for information systems on the human aspects for analogy semantics. Furthermore, we eval- to mimic. This complex semantic task is further complicated by the uate a popular technique for analogy semantics (word2vec neuronal lack of useful benchmark datasets. Most current benchmarks are embeddings) using our dataset. The results show that current word restricted in scope, usually focusing on syntactic features instead embedding approaches are still not not suitable to sufficiently deal of relevant properties of analogies as for example their suitability with deeper analogy semantics. We discuss future directions includ- for transferring information (which is central when one wants to ing hybrid algorithms also incorporating structural or crowd-based use analogies for querying, or explaining recommendations). approaches, and the potential for analogy-based explanations. Therefore, one of our core contributions is an improved bench- mark dataset for analogy semantics which focuses specifically on KEYWORDS the usefulness of an analogy with respect to querying and explain- ing, and providing it as a tool for guiding future research into Analogy-Enabled Recommendation, Relational Similarity, Analogy analogy queries in recommender systems. In this work, we are Benchmarking focusing on general domain analogies. This allows us the choose the right technologies for future adaption in a domain-specific 1 INTRODUCTION recommender system. In detail, our contributions are as follows: In this paper we explore one method of efficiently communicating • Discuss different properties of analogy semantics, and highlight information needs or for explaining results of an inquiry used in their importance for querying and explaining recommendations. natural human interaction, namely analogies. Using analogies in • Introduce a new benchmark dataset systematically built on top natural speech allows communicating dense information easily and of existing sets, rectifying many current limitations. Especially, naturally by implying that the “essence” of two concepts is similar we focus on perceived difficulty of analogies, and the quality and or at least perceived similarly. Thus, analogies can be used to map usefulness of analogies. factual and behavioural properties from one (usually better known • Showcase and discuss the performance of an established word concept, the source) to another (usually less well known, the target) embedding-based algorithm on our test set. concept. This is particularly effective for natural querying and explaining when only vague domain knowledge is available. In this paper we consider two types of analogy queries, 4-term 2 DISCUSSIONS ON ANALOGY SEMANTICS analogy completion queries (like “What is to Christopher Nolan as The semantics of analogies have been researched in depth in the is 2001: A Space Odyssey to Stanley Kubrick?”), and 4-term anal- fields of philosophy, linguistics, and in the cognitive sciences, such ogon ranking (like “What is similar to ’2001: A Space Odyssey to as [11], [12], or [21]. There have been several models for analogical ComplexRec 2017, Como, Italy. reasoning from philosophy (like works by Plato, Aristotle, or Kant 2017. Copyright for the individual papers remains with the authors. Copying permitted [13]), while other approaches see analogies as a variant of induction for private and academic purposes. This volume is published and copyrighted by its editors. Published on CEUR-WS, Volume 1892.. in formal logic [21], or mapping the structures of relationships and properties of entity pairs (structure mapping theory [8]). However, ComplexRec 2017, August 31, 2017, Como, Italy. C. Lofi et al. those analogy definitions are rather complex and hard to grasp com- (2) Analogon ranking multiple-choice putationally, and thus most recent works on computational analogy ? : [a 1 , a 2 ] ::?{[b1 , b2 ], [c 1 , c 2 ] ...} processing rely on the simple 4-term analogy model which is given A simpler version of the general analogon ranking query are by two sets of word pairs (the so-called analogons), with one pair multiple choice ranking queries as they are for example used in being the source and one pair being the target. A 4-term analogy the SAT benchmark dataset (discussed below). Here, the set of holds true if there is a high degree of relational similarity between potential result analogons is restricted, and an algorithm would those two pairs. This is denoted by [a 1 , a 2 ] :: [b1 , b2 ], where the rela- simply need to rank the provided choices (e.g.,[b1 , b2 ]; [c 1 , c 2 ]) tionship between a 1 and a 2 is similar to the relation between b1 and instead of freely discovering the missing analogon. b2 , as for example in [StarW ars, Sci f i] :: [ForrestGump, Comedy] Previous approaches to processing analogies algorithmically (both are defining movies within their genres). This model has cover prototype systems operating on Linked Open Data (LOD), as several limitations, as is discussed by Lofi in [15]: the semantics of for example [5], but also approaches which mine analogies from “a high degree of relational similarity” from an ontological point natural text [18]. A very popular recent trend in Natural Language of view is unclear, and the model ignores human perception and Processing (NLP) is training neuronal word embeddings [20]. The abstractions (e.g., analogies are usually not right or wrong, but popular word2vec implementation [1] learns relational similarity better or worse based on how well humans understand them). between word pairs from large natural language corpora by ex- Therefore, in this paper we promote an improved interpreta- ploiting the distributional hypothesis [9] using neuronal networks. tion of the 4-term analogy model [15], and assume that there Word-embeddings are particularly interesting for use in analogy- can be multiple relationships between the concepts of an anal- enabled recommender systems as they can be easily trained on ogon, some of them being relevant for the semantics of an analogy big text corpora (like for example user reviews), and do not re- (the defining relationships), and some of them not. An analogy quire structured ontologies and taxonomies like other approaches holds true if the sets of defining relationships of both analogons (e.g., [22]). In this paper, we evaluate the analogy reasoning per- show a high degree of relational similarity. For illustrating the formance of word embeddings using our new benchmark dataset difference and importance of this change in semantics, consider which represents analogy semantics more naturally. the analogy statement: [StanleyKubrick, 2001 : ASpaceOdyssey] :: [T axiDriver, MartinScorsese]. Kubrick is the director of 2001: A 3 BENCHMARK DATASETS Space Odyssey, and Scorsese is the director of Taxi Driver. Both anal- For benchmarking the effectiveness of analogy algorithms, there are ogons contain the same “movie is directed by person” relationship, several established Gold standard datasets for general-knowledge and this could be considered a valid analogy with respect to the analogies (i.e. not tailored to a specific domain). However, all of simple 4-term analogy model. Still, this is a poor analogy statement them are lacking in some respect, as discussed in the following from a human communication point of view because 2001: A Space sections. Odyssey is not like Taxi Driver at all. Therefore, this statement does neither describe the essence of 2001: A Space Odyssey nor the essence of Taxi Driver particularly well: both movies are iconic 3.1 Mikolov Benchmark Dataset & Wordrep for their respective directors, but they do not for example describe The Mikolov Benchmark set [19] is one of the most popular bench- that one movie is in the science fiction genre, and the other could mark sets for testing the analogy reasoning capabilities of neuronal be classified as a (violent) crime movie. Understanding which re- word embeddings covering 19,558 4-term analogies with 14 distinct lationships actually define the essence of an analogon from the relationship types, focusing exclusively on analogy completion viewpoint of human perception is a very challenging problem, but queries [a, b] :: [c, ?]. Nine of these relationships focus on grammat- this understanding is crucial for judging the usefulness and value of ical properties (e.g., the relationship “is plural for a noun”), while an analogy statement. Furthermore, the degree to which relation- five relationships are of a semantic nature (e.g., “is capital city of a ships are defining an analogon may vary with different contexts. country” like [Athens, Greece] :: [Oslo, Norway] ). The benchmark In short, there can be better or worse analogies based on two core set is generated by collecting pairs of entities which are members factors (we will later encode the combined overall quality with an of the selected relationship type from Wikipedia and DBpedia, and analogy rating): the degree of how well the relationships shared then combining all these pairs into 4-term analogy tuples. The Wor- by both analogons are defining them (i.e., are the relationships drep dataset [7] extends the Mikolov set by adding more challenges, shared between both analogons indeed the defining relationships and expanding to 25 different relationship types. which describe the intended semantic essence), and the relational A core weakness of this type of test set is that it does not include similarity of the shared relationships. Based on these observations, any human judgment with respect to the defining relationship types we define two basic types of analogy queries (loosely adopted from and the usefulness as an analogy for querying or explaining, but [16]) which can be used for analogy-enabled recommender systems: instead focuses only on “correct” relationships which do usually (1) Analogy completion ? : [a 1 , a 2 ] :: [b1 , ?] not carry any of the subtleties of rhetorical analogies, as e.g., “is This query can be used to find the missing concept in a 4-term city in” or ”is plural of”. In short, the Mikolov test set does not analogy. This is therefore the most useful query type in a future benchmark if algorithms can capture analogy semantics from a hu- analogy-enabled information systems [15]. Solving this query man perspective, but instead focuses purely on relational similarity requires identifying the set of defining relationships between for a very limited selection of relationship types. Thus, we feel that a 1 and a 2 , and then finding a b2 such that the set of defining this benchmark dataset does not represent the challenge of real relationships between b1 and b2 is similar. world analogy queries well. Towards Analogy-based Recommendation ComplexRec 2017, August 31, 2017, Como, Italy. Table 1: Example AGS challenge Source Analogon Target Analogon Rating scallops : Italy 2.57 currywurst : Germany 4.00 tacos : Mexico 4.67 sushi : Japan curry : India 4.00 sombrero : Mexico 2.00 hamburger : ship 1.33 Figure 1: Difficulty of challenges for SAT and AGS-2 dataset for challenges which could be handled by most crowd workers (7-8 correct votes), medium (5-6 correct votes), difficult (3-4 correct 3.2 SAT Analogy Challenges votes), and advanced (0-2 correct votes). Our SAT difficulty ratings The SAT analogy challenge [14] plays an important role in real- can be downloaded on our companion page [3], and the resulting world college admission tests in the United States to assess a prospec- difficulty distribution is shown in Figure 1. tive student’s vocabulary depth and general analytical skills by focusing on analogon ranking queries. As the challenge’s original 3.3 Analogy Gold Standard AGS-1 intent is to assess the vocabulary and reasoning skills of prospective Lofi et al. introduced a benchmark dataset which aims at rectifying students, it contains many rare words. In order to be able to evalu- the shortcomings of aforementioned sets, the Analogy Gold Stan- ate the test without dispute, there is only a single correct answer dard AGS dataset [17]. In this dataset, each challenge can be used while all other answers are definitely wrong (and cannot also be to benchmark both completion queries, but also ranking queries. argued for). The SAT dataset contains 374 challenges like this one: AGS was systematically build using different seed datasets like the “legend is to map as is: a) subtitle to translation, b) bar to graph, c) SAT dataset or the WordSim-353 dataset [6]. One of the core contri- figure to blue-print, d) key to chart, e) footnote to information.” Here, bution of AGS was to include human judgments of how “good” an the correct answer is d) as a key helps to interpret the symbols in a analogy is from a subjective point of view (as compared to previous chart as does the legend with the symbols of a map. test sets where analogy statements are either correct or wrong). While it is easy to see that this answer is correct when the so- The quality of an analogy is influenced by the relational similarity lution is provided, solving these challenges is a difficult task for of the analogons, combined with how well the shared relationships aspiring high school students as the correctness rates of the anal- represent the essence of the analogy (see section 2 for a discussion). ogy section of SAT tests is usually reported to be around 57% [23]. As previous experiments showed that human workers had prob- An interesting aspect of the SAT analogy test set is that a large lems to individually quantify those aspects, for AGS, a new analogy variety of different relationship types are covered, and that the rating was introduced which implicitly encodes both relational contained challenges have a high degree of variance of difficulty similarity and representativeness, i.e. the analogy rating represents from a humans’ perspective. how “useful” and “good” the analogy is perceived by humans on Some analogy challenges are harder than others for the average a scale of 1 to 5. The judgments of at least 8 humans recruited via person, usually based on the rareness of the used vocabulary and the crowd-sourcing platforms was used for the AGS analogy ratings of complexity of the reasoning process required in order to grasp the each analogon pair in a challenge. intended semantics. Understanding the difficulty of challenges is thus An example AGS challenge is given in Table 1: in this example, important when judging the effectiveness of an algorithmic solution, while sombreros are typically considered to come from Mexico, as this provides us with a sense for ”human-level performance”. and in the source analogon sushi typically comes from Japan (i.e. However, while the SAT dataset can be obtained easily for re- there is similar relationship between both analogons), the resulting search purposes, there is no publicly available assessment of the analogy [sushi, Japan] :: [sombrero, Mexico] still has a low analogy difficultly of different challenges. Lofi et al examined the perfor- rating because the the defining relationship in the source analogon mance of crowd workers recruited from Amazon Mechanical Turk [sushi, Japan] is usually understood by people as “stereotypical when faced with SAT challenges [16]. They recruited 99 workers food from a country” - and thus the analogy is deemed not useful from English-speaking countries, and each challenge was solved by most humans. by 8 crowd workers each (in average, each worker solved 29 chal- lenges.) It turned out that workers are rarely in agreement, and 3.4 Improved Analogy Gold Standard AGS-2 that most challenges (> 67%) received 3 or 4 different answers from just 8 crowd workers. Only 3.6% challenges are easy enough to In this paper we introduce the AGS-2 dataset which significantly garner unequivocal agreement from all 8 crowd workers, while improves on AGS-1 with respect to several aspects. The AGS-2 most challenges only had an agreement of 4 to 5 workers. dataset can be obtained at our companion page [3], thus providing In this paper, we use those results to classify each SAT challenge tangible benefits to ongoing analogy-enabled information system by their crowd difficulty, and make this data publicly available. research. Notably, the core improvements of AGS-2 are as follows: This allows us to discuss the performance of analogy algorithms • Added difficulty rating for each challenge based on the crowd in comparison to human performance (see section 4). To this end, feedback of 5 workers. The scale is similar to our difficulty rating we classified each challenge into one of four difficulty groups, easy for SAT challenges introduced in section 3.2: advanced, hard, ComplexRec 2017, August 31, 2017, Como, Italy. C. Lofi et al. medium, easy. The resulting difficulty distribution of AGS-2 is also shown in Figure 1. • Extended size and scope: The initial seed analogons for creating AGS-1 were extracted from the the Simlex [10] and Wordsim datasets [6], resulting in 93 challenges overall. For AGS-2, this was extended using suggestions of potentially interesting analogy pairs by a subset of trusted crowd workers. We manually selected a subset of these suggestions, resulting in 168 challenges overall. • Improved balance of analogy ratings: In AGS-1, challenges could be imbalanced with respect to the analogy ratings (i.e., a given Figure 2: Accuracy of w2v solving SAT and AGS-2 challenge could contain many analogons with high analogy rat- ings, but only few with low ratings). For AGS-2 we ensure that 5 SUMMARY AND OUTLOOK each challenge covers 1-2 analogons each for high, medium, and In this paper, we introduced the challenge of analogy semantics low analogy ratings. Note that an analogy with high difficulty and for recommender systems. Analogies are a natural communication high analogy rating might still not be understandable by many paradigm to efficiently state an information need or to explain a people (due to being difficult by using rarely known concepts or complex concept. This relies on exploiting the background knowl- complex reasoning), but still will be accepted as a good analogy edge and reasoning capabilities of both communication partners, a by the same people after the semantics are explained to them. task challenging even for human judges. We introduced the AGS-2 benchmark data set overcoming many shortcomings of previous datasets, such as being usable for both analogy ranking and analogy 4 EVALUATION completion queries, and is based on human perception for rating In this section, we give a preview on how word-embeddings perform the quality and difficulty of a benchmark analogy. We evaluated the on analogy challenges of varying difficulty as previous work only performance of neuronal word embeddings (using word2vec as a relied on the aforementioned Mikolov dataset where they showed a representative) on previous datasets and our new benchmark (AGS- comparably good accuracy of 53.3%. The embedding we evaluated 2). While the method worked worse than human-level performance is Gensim’s skip-gram word2vec [2], trained on a Wikipedia dump. for simple queries, it was comparable for more complex queries We use the SAT and AGS-2 datasets, and split the results by the with rare vocabulary and requiring extensive common knowledge. new difficulty measurements introduced in section 3.2. In our next steps we plan to investigate the performance of hybrid As the SAT dataset only supports analogy ranking queries, for methods, using both embeddings as well as structural reasoning to brevity we only report ranking query results in the following. We enable analogy queries for recommender systems. This particularly followed the test protocol outlined in [17]: we rank each analogon involves adopting analogy semantics to specific domains like music, of a challenge by their relational similarity score with respect to books, or movies - while current analogy systems and benchmark the source as computed by word2vec, and consider a challenge datasets (including AGS-2) focus on common-knowledge analogies successfully solved if the top-ranked analogon is the correct one (which is slightly easier due to the ready availability of both large (SAT) or has an analogy score higher than a chosen threshold (4.0 text corpora and ontologies). in our case for AGS-2). In this context, we also plan to evaluate the performance of The results are summarized in Figure 2, and are rather disap- analogy based explanations for supporting people in making deci- pointing: for SAT, the average accuracy is 23.4% (random guess- sions about recommended items. This is a particularly interesting ing achieves 20% as each challenge has 5 answer options, average challenge as approaches based only on distributed semantics do im- human-level performance is 57% [23], while one of the current best plicitly encode analogy semantics, but have now explicit knowledge performing analogy systems based on both distributional semantics on the type of relationships which would be required to explain and structural reasoning using ontologies (like DBpedia or Word- the semantics to a human user, thus again underlining the need for Net) performs at 56.1% [22]). For AGS-2 the word2vec accuracy is explicit information on the relationships used in an analogy. 25% (random guessing achieves 16.7%). Interestingly, the perfor- mance of word2vec is consistent with respect to difficulty levels, and it performs worse than humans for easy challenges, but compa- rably well for advanced challenges. From this preliminary result we conclude that the analogy reasoning capabilities of neuronal em- beddings, despite some of their advantages like ease of use and easy training just relying on text collections, are inferior than current anecdotal and empirical evidence suggests. However, specialized analogy reasoning algorithms have been shown to achieve up to 56.1% on SAT [4], and this has mostly been realized by also incorpo- rating ontological knowledge (as suggested by [22]). Unfortunately, this could be challenging for some domain-specific recommender systems where such ontologies are not easily available, thus pro- moting future research to overcome this issue. Towards Analogy-based Recommendation ComplexRec 2017, August 31, 2017, Como, Italy. REFERENCES [13] Immanuel Kant. 1790. Critique of Judgement. [1] 2013. Word2Vec. https://code.google.com/archive/p/word2vec/. (2013). Accessed: [14] Michael Littman and Peter Turney. 2016. SAT Aanalogy Challange Dataset. (2016). 2017-06-01. https://www.aclweb.org/aclwiki/index.php?title=Analogy (State of the art) [2] 2014. Gensim Word2Vec. https://radimrehurek.com/gensim/models/word2vec. [15] Christoph Lofi. 2013. Analogy Queries in Information Systems: A New Challenge. html. (2014). Accessed: 2017-06-01. Journal of Information & Knowledge Management (JIKM) 12, 3 (2013). [3] 2017. Analogy Semantics Companion Page. https://github.com/WISDelft/ [16] Christoph Lofi. 2013. Just ask a human? – Controlling Quality in Relational analogy semantics. (2017). Accessed: 2017-08-01. Similarity and Analogy Processing using the Crowd. In CDIM Workshop at [4] 2017. SAT Analogy Questions: State of the Art. https://www.aclweb.org/aclwiki/ Database Systems for Business Technology and Web (BTW). Magdeburg, Germany. index.php?title=SAT Analogy Questions (State of the art). (2017). Accessed: [17] Christoph Lofi, Athiq Ahamed, Pratima Kulkarni, and Ravi Thakkar. 2016. Bench- 2017-06-01. marking Semantic Capabilities of Analogy Querying Algorithms. In Int. Conf. on [5] Danushka Bollegala, Tomokazu Goto, Nguyen Tuan Duc, and Mitsuru Ishizuka. Database Systems for Advanced Applications (DASFAA). Dallas, TX, USA. 2012. Improving Relational Similarity Measurement using Symmetries in Pro- [18] C. Lofi, C. Nieke, and N. Collier. 2014. Discriminating Rhetorical Analogies in portional Word Analogies. Information Processing & Management 49, 1 (2012), Social Media. In Conf. of the Europ. Chapter of the Association for Computational 355–369. Linguistics (EACL). Gothenburg, Sweden. [6] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, [19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: the concept estimation of word representations in vector space. International Conference on revisited. In Int. Conf. on World Wide Web (WWW). Hong Kong, China. Learning Representations (ICLR) (2013). [7] Bin Gao, Jiang Bian, and Tie-Yan Liu. 2014. WordRep: A Benchmark for Research [20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. on Learning Word Representations. In ICML Workshop on Knowledge-Powered Distributed Representations of Words and Phrases and their Compositionality. Deep Learning for Text Mining. Beijing, China. Advances in Neural Information Processing Systems 21 (2013), 3111–3119. [8] D Gentner. 1983. Structure-mapping: A theoretical framework for analogy. [21] Cameron Shelley. 2003. Multiple Analogies In Science And Philosophy. John Cognitive science 7 (1983), 155–170. Benjamins Pub. [9] Z. Harris. 1954. Distributional Structure. Word 10 (1954), 146–162. [22] Robert Speer, Joshua Chin, and Catherine Havasi. 2016. ConceptNet 5.5: An [10] F. Hill, R. Reichart, and A. Korhonen. 2014. SimLex-999: Evaluating Seman- Open Multilingual Graph of General Knowledge. AAAI Conference on Artificial tic Models with (Genuine) Similarity Estimation. Preprint published on arXiv. Intelligence (2016). arXiv:1408:3456 (2014). [23] P. Turney and M. Littman. 2005. Corpus-based learning of analogies and semantic [11] Douglas R. Hofstadter. 2001. Analogy as the Core of Cognition. In The Analogical relations. Machine Learning 60 (2005), 251–278. Mind. 499–538. [12] Esa Itkonen. 2005. Analogy as structure and process: Approaches in linguistics, cognitive psychology and philosophy of science. John Benjamins Pub Co.