=Paper=
{{Paper
|id=Vol-1892/paper2
|storemode=property
|title=Towards Analogy-based Recommendation: Benchmarking of Perceived Analogy Semantics
|pdfUrl=https://ceur-ws.org/Vol-1892/paper2.pdf
|volume=Vol-1892
|authors=Christoph Lofi,Nava Tintarev
|dblpUrl=https://dblp.org/rec/conf/recsys/LofiT17
}}
==Towards Analogy-based Recommendation: Benchmarking of Perceived Analogy Semantics==
Towards Analogy-based Recommendation
Benchmarking of Perceived Analogy Semantics
Christoph Lofi Nava Tintarev
Web Information Systems - TU Delft Web Information Systems - TU Delft
Mekelweg 4 Mekelweg 4
Delft, Netherlands 2628CD Delft, Netherlands 2628CD
c.lofi@tudelft.nl n.tintarev@tudelft.nl
ABSTRACT Stanley Kubrick’? a) The Fifth Element to Christopher Nolan, b)
Requests for recommendation can be seen as a form of query for Memento to Christopher Nolan, c) Dunkirk to Christopher Nolan).
candidate items, ranked by relevance. Users are however often Analogy completion queries might be seen an extension on classical
unable to crisply define what they are looking for. One of the core critiquing in recommender systems which can be formulated in
concepts of natural communication for describing and explaining terms of “like x, but with properties y modified”. In critiquing the
complex information needs in an intuitive fashion are analogies: feature (price) and the modification (cheaper) needs to be explicit,
e.g., “What is to Christopher Nolan as is 2001: A Space Odyssey to whereas in an analogy, the semantics are implicitly given by setting
Stanley Kubrick?”. Analogies allow users to explore the item space terms in relation, which is interpreted based on both communi-
by formulating queries in terms of items rather than explicitly cation partners’ conceptualization of that domain (e.g., “The Fifth
specifying the properties that they find attractive. One of the core Element is like 2001: A Space Odyssey but by Scorsese” carries a
challenges which hamper research on analogy-enabled queries is lot of implicit information).
that analogy semantics rely on consensus on human perception, Analogies typically are represented as rhetorical figures of speech
which is not well represented in current benchmark data sets. There- which need to be interpreted and reasoned on using the receiver’s
fore, in this paper we introduce a new benchmark dataset focusing background knowledge - which is difficult for information systems
on the human aspects for analogy semantics. Furthermore, we eval- to mimic. This complex semantic task is further complicated by the
uate a popular technique for analogy semantics (word2vec neuronal lack of useful benchmark datasets. Most current benchmarks are
embeddings) using our dataset. The results show that current word restricted in scope, usually focusing on syntactic features instead
embedding approaches are still not not suitable to sufficiently deal of relevant properties of analogies as for example their suitability
with deeper analogy semantics. We discuss future directions includ- for transferring information (which is central when one wants to
ing hybrid algorithms also incorporating structural or crowd-based use analogies for querying, or explaining recommendations).
approaches, and the potential for analogy-based explanations. Therefore, one of our core contributions is an improved bench-
mark dataset for analogy semantics which focuses specifically on
KEYWORDS the usefulness of an analogy with respect to querying and explain-
ing, and providing it as a tool for guiding future research into
Analogy-Enabled Recommendation, Relational Similarity, Analogy
analogy queries in recommender systems. In this work, we are
Benchmarking
focusing on general domain analogies. This allows us the choose
the right technologies for future adaption in a domain-specific
1 INTRODUCTION recommender system. In detail, our contributions are as follows:
In this paper we explore one method of efficiently communicating • Discuss different properties of analogy semantics, and highlight
information needs or for explaining results of an inquiry used in their importance for querying and explaining recommendations.
natural human interaction, namely analogies. Using analogies in • Introduce a new benchmark dataset systematically built on top
natural speech allows communicating dense information easily and of existing sets, rectifying many current limitations. Especially,
naturally by implying that the “essence” of two concepts is similar we focus on perceived difficulty of analogies, and the quality and
or at least perceived similarly. Thus, analogies can be used to map usefulness of analogies.
factual and behavioural properties from one (usually better known • Showcase and discuss the performance of an established word
concept, the source) to another (usually less well known, the target) embedding-based algorithm on our test set.
concept. This is particularly effective for natural querying and
explaining when only vague domain knowledge is available.
In this paper we consider two types of analogy queries, 4-term 2 DISCUSSIONS ON ANALOGY SEMANTICS
analogy completion queries (like “What is to Christopher Nolan as
The semantics of analogies have been researched in depth in the
is 2001: A Space Odyssey to Stanley Kubrick?”), and 4-term anal-
fields of philosophy, linguistics, and in the cognitive sciences, such
ogon ranking (like “What is similar to ’2001: A Space Odyssey to
as [11], [12], or [21]. There have been several models for analogical
ComplexRec 2017, Como, Italy. reasoning from philosophy (like works by Plato, Aristotle, or Kant
2017. Copyright for the individual papers remains with the authors. Copying permitted [13]), while other approaches see analogies as a variant of induction
for private and academic purposes. This volume is published and copyrighted by its
editors. Published on CEUR-WS, Volume 1892.. in formal logic [21], or mapping the structures of relationships and
properties of entity pairs (structure mapping theory [8]). However,
ComplexRec 2017, August 31, 2017, Como, Italy. C. Lofi et al.
those analogy definitions are rather complex and hard to grasp com- (2) Analogon ranking multiple-choice
putationally, and thus most recent works on computational analogy ? : [a 1 , a 2 ] ::?{[b1 , b2 ], [c 1 , c 2 ] ...}
processing rely on the simple 4-term analogy model which is given A simpler version of the general analogon ranking query are
by two sets of word pairs (the so-called analogons), with one pair multiple choice ranking queries as they are for example used in
being the source and one pair being the target. A 4-term analogy the SAT benchmark dataset (discussed below). Here, the set of
holds true if there is a high degree of relational similarity between potential result analogons is restricted, and an algorithm would
those two pairs. This is denoted by [a 1 , a 2 ] :: [b1 , b2 ], where the rela- simply need to rank the provided choices (e.g.,[b1 , b2 ]; [c 1 , c 2 ])
tionship between a 1 and a 2 is similar to the relation between b1 and instead of freely discovering the missing analogon.
b2 , as for example in [StarW ars, Sci f i] :: [ForrestGump, Comedy] Previous approaches to processing analogies algorithmically
(both are defining movies within their genres). This model has cover prototype systems operating on Linked Open Data (LOD), as
several limitations, as is discussed by Lofi in [15]: the semantics of for example [5], but also approaches which mine analogies from
“a high degree of relational similarity” from an ontological point natural text [18]. A very popular recent trend in Natural Language
of view is unclear, and the model ignores human perception and Processing (NLP) is training neuronal word embeddings [20]. The
abstractions (e.g., analogies are usually not right or wrong, but popular word2vec implementation [1] learns relational similarity
better or worse based on how well humans understand them). between word pairs from large natural language corpora by ex-
Therefore, in this paper we promote an improved interpreta- ploiting the distributional hypothesis [9] using neuronal networks.
tion of the 4-term analogy model [15], and assume that there Word-embeddings are particularly interesting for use in analogy-
can be multiple relationships between the concepts of an anal- enabled recommender systems as they can be easily trained on
ogon, some of them being relevant for the semantics of an analogy big text corpora (like for example user reviews), and do not re-
(the defining relationships), and some of them not. An analogy quire structured ontologies and taxonomies like other approaches
holds true if the sets of defining relationships of both analogons (e.g., [22]). In this paper, we evaluate the analogy reasoning per-
show a high degree of relational similarity. For illustrating the formance of word embeddings using our new benchmark dataset
difference and importance of this change in semantics, consider which represents analogy semantics more naturally.
the analogy statement: [StanleyKubrick, 2001 : ASpaceOdyssey] ::
[T axiDriver, MartinScorsese]. Kubrick is the director of 2001: A 3 BENCHMARK DATASETS
Space Odyssey, and Scorsese is the director of Taxi Driver. Both anal-
For benchmarking the effectiveness of analogy algorithms, there are
ogons contain the same “movie is directed by person” relationship,
several established Gold standard datasets for general-knowledge
and this could be considered a valid analogy with respect to the
analogies (i.e. not tailored to a specific domain). However, all of
simple 4-term analogy model. Still, this is a poor analogy statement
them are lacking in some respect, as discussed in the following
from a human communication point of view because 2001: A Space
sections.
Odyssey is not like Taxi Driver at all. Therefore, this statement
does neither describe the essence of 2001: A Space Odyssey nor
the essence of Taxi Driver particularly well: both movies are iconic 3.1 Mikolov Benchmark Dataset & Wordrep
for their respective directors, but they do not for example describe The Mikolov Benchmark set [19] is one of the most popular bench-
that one movie is in the science fiction genre, and the other could mark sets for testing the analogy reasoning capabilities of neuronal
be classified as a (violent) crime movie. Understanding which re- word embeddings covering 19,558 4-term analogies with 14 distinct
lationships actually define the essence of an analogon from the relationship types, focusing exclusively on analogy completion
viewpoint of human perception is a very challenging problem, but queries [a, b] :: [c, ?]. Nine of these relationships focus on grammat-
this understanding is crucial for judging the usefulness and value of ical properties (e.g., the relationship “is plural for a noun”), while
an analogy statement. Furthermore, the degree to which relation- five relationships are of a semantic nature (e.g., “is capital city of a
ships are defining an analogon may vary with different contexts. country” like [Athens, Greece] :: [Oslo, Norway] ). The benchmark
In short, there can be better or worse analogies based on two core set is generated by collecting pairs of entities which are members
factors (we will later encode the combined overall quality with an of the selected relationship type from Wikipedia and DBpedia, and
analogy rating): the degree of how well the relationships shared then combining all these pairs into 4-term analogy tuples. The Wor-
by both analogons are defining them (i.e., are the relationships drep dataset [7] extends the Mikolov set by adding more challenges,
shared between both analogons indeed the defining relationships and expanding to 25 different relationship types.
which describe the intended semantic essence), and the relational A core weakness of this type of test set is that it does not include
similarity of the shared relationships. Based on these observations, any human judgment with respect to the defining relationship types
we define two basic types of analogy queries (loosely adopted from and the usefulness as an analogy for querying or explaining, but
[16]) which can be used for analogy-enabled recommender systems: instead focuses only on “correct” relationships which do usually
(1) Analogy completion ? : [a 1 , a 2 ] :: [b1 , ?] not carry any of the subtleties of rhetorical analogies, as e.g., “is
This query can be used to find the missing concept in a 4-term city in” or ”is plural of”. In short, the Mikolov test set does not
analogy. This is therefore the most useful query type in a future benchmark if algorithms can capture analogy semantics from a hu-
analogy-enabled information systems [15]. Solving this query man perspective, but instead focuses purely on relational similarity
requires identifying the set of defining relationships between for a very limited selection of relationship types. Thus, we feel that
a 1 and a 2 , and then finding a b2 such that the set of defining this benchmark dataset does not represent the challenge of real
relationships between b1 and b2 is similar. world analogy queries well.
Towards Analogy-based Recommendation ComplexRec 2017, August 31, 2017, Como, Italy.
Table 1: Example AGS challenge
Source Analogon Target Analogon Rating
scallops : Italy 2.57
currywurst : Germany 4.00
tacos : Mexico 4.67
sushi : Japan
curry : India 4.00
sombrero : Mexico 2.00
hamburger : ship 1.33
Figure 1: Difficulty of challenges for SAT and AGS-2 dataset
for challenges which could be handled by most crowd workers
(7-8 correct votes), medium (5-6 correct votes), difficult (3-4 correct
3.2 SAT Analogy Challenges votes), and advanced (0-2 correct votes). Our SAT difficulty ratings
The SAT analogy challenge [14] plays an important role in real- can be downloaded on our companion page [3], and the resulting
world college admission tests in the United States to assess a prospec- difficulty distribution is shown in Figure 1.
tive student’s vocabulary depth and general analytical skills by
focusing on analogon ranking queries. As the challenge’s original 3.3 Analogy Gold Standard AGS-1
intent is to assess the vocabulary and reasoning skills of prospective
Lofi et al. introduced a benchmark dataset which aims at rectifying
students, it contains many rare words. In order to be able to evalu-
the shortcomings of aforementioned sets, the Analogy Gold Stan-
ate the test without dispute, there is only a single correct answer
dard AGS dataset [17]. In this dataset, each challenge can be used
while all other answers are definitely wrong (and cannot also be
to benchmark both completion queries, but also ranking queries.
argued for). The SAT dataset contains 374 challenges like this one:
AGS was systematically build using different seed datasets like the
“legend is to map as is: a) subtitle to translation, b) bar to graph, c)
SAT dataset or the WordSim-353 dataset [6]. One of the core contri-
figure to blue-print, d) key to chart, e) footnote to information.” Here,
bution of AGS was to include human judgments of how “good” an
the correct answer is d) as a key helps to interpret the symbols in a
analogy is from a subjective point of view (as compared to previous
chart as does the legend with the symbols of a map.
test sets where analogy statements are either correct or wrong).
While it is easy to see that this answer is correct when the so-
The quality of an analogy is influenced by the relational similarity
lution is provided, solving these challenges is a difficult task for
of the analogons, combined with how well the shared relationships
aspiring high school students as the correctness rates of the anal-
represent the essence of the analogy (see section 2 for a discussion).
ogy section of SAT tests is usually reported to be around 57% [23].
As previous experiments showed that human workers had prob-
An interesting aspect of the SAT analogy test set is that a large
lems to individually quantify those aspects, for AGS, a new analogy
variety of different relationship types are covered, and that the
rating was introduced which implicitly encodes both relational
contained challenges have a high degree of variance of difficulty
similarity and representativeness, i.e. the analogy rating represents
from a humans’ perspective.
how “useful” and “good” the analogy is perceived by humans on
Some analogy challenges are harder than others for the average
a scale of 1 to 5. The judgments of at least 8 humans recruited via
person, usually based on the rareness of the used vocabulary and the
crowd-sourcing platforms was used for the AGS analogy ratings of
complexity of the reasoning process required in order to grasp the
each analogon pair in a challenge.
intended semantics. Understanding the difficulty of challenges is thus
An example AGS challenge is given in Table 1: in this example,
important when judging the effectiveness of an algorithmic solution,
while sombreros are typically considered to come from Mexico,
as this provides us with a sense for ”human-level performance”.
and in the source analogon sushi typically comes from Japan (i.e.
However, while the SAT dataset can be obtained easily for re-
there is similar relationship between both analogons), the resulting
search purposes, there is no publicly available assessment of the
analogy [sushi, Japan] :: [sombrero, Mexico] still has a low analogy
difficultly of different challenges. Lofi et al examined the perfor-
rating because the the defining relationship in the source analogon
mance of crowd workers recruited from Amazon Mechanical Turk
[sushi, Japan] is usually understood by people as “stereotypical
when faced with SAT challenges [16]. They recruited 99 workers
food from a country” - and thus the analogy is deemed not useful
from English-speaking countries, and each challenge was solved
by most humans.
by 8 crowd workers each (in average, each worker solved 29 chal-
lenges.) It turned out that workers are rarely in agreement, and
3.4 Improved Analogy Gold Standard AGS-2
that most challenges (> 67%) received 3 or 4 different answers from
just 8 crowd workers. Only 3.6% challenges are easy enough to In this paper we introduce the AGS-2 dataset which significantly
garner unequivocal agreement from all 8 crowd workers, while improves on AGS-1 with respect to several aspects. The AGS-2
most challenges only had an agreement of 4 to 5 workers. dataset can be obtained at our companion page [3], thus providing
In this paper, we use those results to classify each SAT challenge tangible benefits to ongoing analogy-enabled information system
by their crowd difficulty, and make this data publicly available. research. Notably, the core improvements of AGS-2 are as follows:
This allows us to discuss the performance of analogy algorithms • Added difficulty rating for each challenge based on the crowd
in comparison to human performance (see section 4). To this end, feedback of 5 workers. The scale is similar to our difficulty rating
we classified each challenge into one of four difficulty groups, easy for SAT challenges introduced in section 3.2: advanced, hard,
ComplexRec 2017, August 31, 2017, Como, Italy. C. Lofi et al.
medium, easy. The resulting difficulty distribution of AGS-2 is
also shown in Figure 1.
• Extended size and scope: The initial seed analogons for creating
AGS-1 were extracted from the the Simlex [10] and Wordsim
datasets [6], resulting in 93 challenges overall. For AGS-2, this
was extended using suggestions of potentially interesting analogy
pairs by a subset of trusted crowd workers. We manually selected
a subset of these suggestions, resulting in 168 challenges overall.
• Improved balance of analogy ratings: In AGS-1, challenges could
be imbalanced with respect to the analogy ratings (i.e., a given
Figure 2: Accuracy of w2v solving SAT and AGS-2
challenge could contain many analogons with high analogy rat-
ings, but only few with low ratings). For AGS-2 we ensure that 5 SUMMARY AND OUTLOOK
each challenge covers 1-2 analogons each for high, medium, and
In this paper, we introduced the challenge of analogy semantics
low analogy ratings. Note that an analogy with high difficulty and
for recommender systems. Analogies are a natural communication
high analogy rating might still not be understandable by many
paradigm to efficiently state an information need or to explain a
people (due to being difficult by using rarely known concepts or
complex concept. This relies on exploiting the background knowl-
complex reasoning), but still will be accepted as a good analogy
edge and reasoning capabilities of both communication partners, a
by the same people after the semantics are explained to them.
task challenging even for human judges. We introduced the AGS-2
benchmark data set overcoming many shortcomings of previous
datasets, such as being usable for both analogy ranking and analogy
4 EVALUATION
completion queries, and is based on human perception for rating
In this section, we give a preview on how word-embeddings perform the quality and difficulty of a benchmark analogy. We evaluated the
on analogy challenges of varying difficulty as previous work only performance of neuronal word embeddings (using word2vec as a
relied on the aforementioned Mikolov dataset where they showed a representative) on previous datasets and our new benchmark (AGS-
comparably good accuracy of 53.3%. The embedding we evaluated 2). While the method worked worse than human-level performance
is Gensim’s skip-gram word2vec [2], trained on a Wikipedia dump. for simple queries, it was comparable for more complex queries
We use the SAT and AGS-2 datasets, and split the results by the with rare vocabulary and requiring extensive common knowledge.
new difficulty measurements introduced in section 3.2. In our next steps we plan to investigate the performance of hybrid
As the SAT dataset only supports analogy ranking queries, for methods, using both embeddings as well as structural reasoning to
brevity we only report ranking query results in the following. We enable analogy queries for recommender systems. This particularly
followed the test protocol outlined in [17]: we rank each analogon involves adopting analogy semantics to specific domains like music,
of a challenge by their relational similarity score with respect to books, or movies - while current analogy systems and benchmark
the source as computed by word2vec, and consider a challenge datasets (including AGS-2) focus on common-knowledge analogies
successfully solved if the top-ranked analogon is the correct one (which is slightly easier due to the ready availability of both large
(SAT) or has an analogy score higher than a chosen threshold (4.0 text corpora and ontologies).
in our case for AGS-2). In this context, we also plan to evaluate the performance of
The results are summarized in Figure 2, and are rather disap- analogy based explanations for supporting people in making deci-
pointing: for SAT, the average accuracy is 23.4% (random guess- sions about recommended items. This is a particularly interesting
ing achieves 20% as each challenge has 5 answer options, average challenge as approaches based only on distributed semantics do im-
human-level performance is 57% [23], while one of the current best plicitly encode analogy semantics, but have now explicit knowledge
performing analogy systems based on both distributional semantics on the type of relationships which would be required to explain
and structural reasoning using ontologies (like DBpedia or Word- the semantics to a human user, thus again underlining the need for
Net) performs at 56.1% [22]). For AGS-2 the word2vec accuracy is explicit information on the relationships used in an analogy.
25% (random guessing achieves 16.7%). Interestingly, the perfor-
mance of word2vec is consistent with respect to difficulty levels,
and it performs worse than humans for easy challenges, but compa-
rably well for advanced challenges. From this preliminary result we
conclude that the analogy reasoning capabilities of neuronal em-
beddings, despite some of their advantages like ease of use and easy
training just relying on text collections, are inferior than current
anecdotal and empirical evidence suggests. However, specialized
analogy reasoning algorithms have been shown to achieve up to
56.1% on SAT [4], and this has mostly been realized by also incorpo-
rating ontological knowledge (as suggested by [22]). Unfortunately,
this could be challenging for some domain-specific recommender
systems where such ontologies are not easily available, thus pro-
moting future research to overcome this issue.
Towards Analogy-based Recommendation ComplexRec 2017, August 31, 2017, Como, Italy.
REFERENCES [13] Immanuel Kant. 1790. Critique of Judgement.
[1] 2013. Word2Vec. https://code.google.com/archive/p/word2vec/. (2013). Accessed: [14] Michael Littman and Peter Turney. 2016. SAT Aanalogy Challange Dataset. (2016).
2017-06-01. https://www.aclweb.org/aclwiki/index.php?title=Analogy (State of the art)
[2] 2014. Gensim Word2Vec. https://radimrehurek.com/gensim/models/word2vec. [15] Christoph Lofi. 2013. Analogy Queries in Information Systems: A New Challenge.
html. (2014). Accessed: 2017-06-01. Journal of Information & Knowledge Management (JIKM) 12, 3 (2013).
[3] 2017. Analogy Semantics Companion Page. https://github.com/WISDelft/ [16] Christoph Lofi. 2013. Just ask a human? – Controlling Quality in Relational
analogy semantics. (2017). Accessed: 2017-08-01. Similarity and Analogy Processing using the Crowd. In CDIM Workshop at
[4] 2017. SAT Analogy Questions: State of the Art. https://www.aclweb.org/aclwiki/ Database Systems for Business Technology and Web (BTW). Magdeburg, Germany.
index.php?title=SAT Analogy Questions (State of the art). (2017). Accessed: [17] Christoph Lofi, Athiq Ahamed, Pratima Kulkarni, and Ravi Thakkar. 2016. Bench-
2017-06-01. marking Semantic Capabilities of Analogy Querying Algorithms. In Int. Conf. on
[5] Danushka Bollegala, Tomokazu Goto, Nguyen Tuan Duc, and Mitsuru Ishizuka. Database Systems for Advanced Applications (DASFAA). Dallas, TX, USA.
2012. Improving Relational Similarity Measurement using Symmetries in Pro- [18] C. Lofi, C. Nieke, and N. Collier. 2014. Discriminating Rhetorical Analogies in
portional Word Analogies. Information Processing & Management 49, 1 (2012), Social Media. In Conf. of the Europ. Chapter of the Association for Computational
355–369. Linguistics (EACL). Gothenburg, Sweden.
[6] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, [19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: the concept estimation of word representations in vector space. International Conference on
revisited. In Int. Conf. on World Wide Web (WWW). Hong Kong, China. Learning Representations (ICLR) (2013).
[7] Bin Gao, Jiang Bian, and Tie-Yan Liu. 2014. WordRep: A Benchmark for Research [20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013.
on Learning Word Representations. In ICML Workshop on Knowledge-Powered Distributed Representations of Words and Phrases and their Compositionality.
Deep Learning for Text Mining. Beijing, China. Advances in Neural Information Processing Systems 21 (2013), 3111–3119.
[8] D Gentner. 1983. Structure-mapping: A theoretical framework for analogy. [21] Cameron Shelley. 2003. Multiple Analogies In Science And Philosophy. John
Cognitive science 7 (1983), 155–170. Benjamins Pub.
[9] Z. Harris. 1954. Distributional Structure. Word 10 (1954), 146–162. [22] Robert Speer, Joshua Chin, and Catherine Havasi. 2016. ConceptNet 5.5: An
[10] F. Hill, R. Reichart, and A. Korhonen. 2014. SimLex-999: Evaluating Seman- Open Multilingual Graph of General Knowledge. AAAI Conference on Artificial
tic Models with (Genuine) Similarity Estimation. Preprint published on arXiv. Intelligence (2016).
arXiv:1408:3456 (2014). [23] P. Turney and M. Littman. 2005. Corpus-based learning of analogies and semantic
[11] Douglas R. Hofstadter. 2001. Analogy as the Core of Cognition. In The Analogical relations. Machine Learning 60 (2005), 251–278.
Mind. 499–538.
[12] Esa Itkonen. 2005. Analogy as structure and process: Approaches in linguistics,
cognitive psychology and philosophy of science. John Benjamins Pub Co.