=Paper= {{Paper |id=Vol-1892/paper2 |storemode=property |title=Towards Analogy-based Recommendation: Benchmarking of Perceived Analogy Semantics |pdfUrl=https://ceur-ws.org/Vol-1892/paper2.pdf |volume=Vol-1892 |authors=Christoph Lofi,Nava Tintarev |dblpUrl=https://dblp.org/rec/conf/recsys/LofiT17 }} ==Towards Analogy-based Recommendation: Benchmarking of Perceived Analogy Semantics== https://ceur-ws.org/Vol-1892/paper2.pdf
                              Towards Analogy-based Recommendation
                                                Benchmarking of Perceived Analogy Semantics

                                 Christoph Lofi                                                                Nava Tintarev
                   Web Information Systems - TU Delft                                              Web Information Systems - TU Delft
                               Mekelweg 4                                                                      Mekelweg 4
                      Delft, Netherlands 2628CD                                                       Delft, Netherlands 2628CD
                            c.lofi@tudelft.nl                                                             n.tintarev@tudelft.nl
ABSTRACT                                                                                 Stanley Kubrick’? a) The Fifth Element to Christopher Nolan, b)
Requests for recommendation can be seen as a form of query for                           Memento to Christopher Nolan, c) Dunkirk to Christopher Nolan).
candidate items, ranked by relevance. Users are however often                           Analogy completion queries might be seen an extension on classical
unable to crisply define what they are looking for. One of the core                      critiquing in recommender systems which can be formulated in
concepts of natural communication for describing and explaining                          terms of “like x, but with properties y modified”. In critiquing the
complex information needs in an intuitive fashion are analogies:                         feature (price) and the modification (cheaper) needs to be explicit,
e.g., “What is to Christopher Nolan as is 2001: A Space Odyssey to                      whereas in an analogy, the semantics are implicitly given by setting
Stanley Kubrick?”. Analogies allow users to explore the item space                       terms in relation, which is interpreted based on both communi-
by formulating queries in terms of items rather than explicitly                          cation partners’ conceptualization of that domain (e.g., “The Fifth
specifying the properties that they find attractive. One of the core                     Element is like 2001: A Space Odyssey but by Scorsese” carries a
challenges which hamper research on analogy-enabled queries is                           lot of implicit information).
that analogy semantics rely on consensus on human perception,                               Analogies typically are represented as rhetorical figures of speech
which is not well represented in current benchmark data sets. There-                    which need to be interpreted and reasoned on using the receiver’s
fore, in this paper we introduce a new benchmark dataset focusing                        background knowledge - which is difficult for information systems
on the human aspects for analogy semantics. Furthermore, we eval-                        to mimic. This complex semantic task is further complicated by the
uate a popular technique for analogy semantics (word2vec neuronal                        lack of useful benchmark datasets. Most current benchmarks are
embeddings) using our dataset. The results show that current word                        restricted in scope, usually focusing on syntactic features instead
embedding approaches are still not not suitable to sufficiently deal                     of relevant properties of analogies as for example their suitability
with deeper analogy semantics. We discuss future directions includ-                      for transferring information (which is central when one wants to
ing hybrid algorithms also incorporating structural or crowd-based                       use analogies for querying, or explaining recommendations).
approaches, and the potential for analogy-based explanations.                               Therefore, one of our core contributions is an improved bench-
                                                                                         mark dataset for analogy semantics which focuses specifically on
KEYWORDS                                                                                 the usefulness of an analogy with respect to querying and explain-
                                                                                         ing, and providing it as a tool for guiding future research into
Analogy-Enabled Recommendation, Relational Similarity, Analogy
                                                                                         analogy queries in recommender systems. In this work, we are
Benchmarking
                                                                                         focusing on general domain analogies. This allows us the choose
                                                                                         the right technologies for future adaption in a domain-specific
1    INTRODUCTION                                                                        recommender system. In detail, our contributions are as follows:
In this paper we explore one method of efficiently communicating                        • Discuss different properties of analogy semantics, and highlight
information needs or for explaining results of an inquiry used in                          their importance for querying and explaining recommendations.
natural human interaction, namely analogies. Using analogies in                         • Introduce a new benchmark dataset systematically built on top
natural speech allows communicating dense information easily and                           of existing sets, rectifying many current limitations. Especially,
naturally by implying that the “essence” of two concepts is similar                        we focus on perceived difficulty of analogies, and the quality and
or at least perceived similarly. Thus, analogies can be used to map                        usefulness of analogies.
factual and behavioural properties from one (usually better known                       • Showcase and discuss the performance of an established word
concept, the source) to another (usually less well known, the target)                      embedding-based algorithm on our test set.
concept. This is particularly effective for natural querying and
explaining when only vague domain knowledge is available.
   In this paper we consider two types of analogy queries, 4-term                       2    DISCUSSIONS ON ANALOGY SEMANTICS
analogy completion queries (like “What is to Christopher Nolan as
                                                                                        The semantics of analogies have been researched in depth in the
is 2001: A Space Odyssey to Stanley Kubrick?”), and 4-term anal-
                                                                                        fields of philosophy, linguistics, and in the cognitive sciences, such
ogon ranking (like “What is similar to ’2001: A Space Odyssey to
                                                                                        as [11], [12], or [21]. There have been several models for analogical
ComplexRec 2017, Como, Italy.                                                           reasoning from philosophy (like works by Plato, Aristotle, or Kant
2017. Copyright for the individual papers remains with the authors. Copying permitted   [13]), while other approaches see analogies as a variant of induction
for private and academic purposes. This volume is published and copyrighted by its
editors. Published on CEUR-WS, Volume 1892..                                            in formal logic [21], or mapping the structures of relationships and
                                                                                        properties of entity pairs (structure mapping theory [8]). However,
ComplexRec 2017, August 31, 2017, Como, Italy.                                                                                                  C. Lofi et al.


those analogy definitions are rather complex and hard to grasp com-               (2) Analogon ranking multiple-choice
putationally, and thus most recent works on computational analogy                      ? : [a 1 , a 2 ] ::?{[b1 , b2 ], [c 1 , c 2 ] ...}
processing rely on the simple 4-term analogy model which is given                      A simpler version of the general analogon ranking query are
by two sets of word pairs (the so-called analogons), with one pair                     multiple choice ranking queries as they are for example used in
being the source and one pair being the target. A 4-term analogy                       the SAT benchmark dataset (discussed below). Here, the set of
holds true if there is a high degree of relational similarity between                  potential result analogons is restricted, and an algorithm would
those two pairs. This is denoted by [a 1 , a 2 ] :: [b1 , b2 ], where the rela-        simply need to rank the provided choices (e.g.,[b1 , b2 ]; [c 1 , c 2 ])
tionship between a 1 and a 2 is similar to the relation between b1 and                 instead of freely discovering the missing analogon.
b2 , as for example in [StarW ars, Sci f i] :: [ForrestGump, Comedy]                 Previous approaches to processing analogies algorithmically
(both are defining movies within their genres). This model has                    cover prototype systems operating on Linked Open Data (LOD), as
several limitations, as is discussed by Lofi in [15]: the semantics of            for example [5], but also approaches which mine analogies from
“a high degree of relational similarity” from an ontological point                natural text [18]. A very popular recent trend in Natural Language
of view is unclear, and the model ignores human perception and                    Processing (NLP) is training neuronal word embeddings [20]. The
abstractions (e.g., analogies are usually not right or wrong, but                 popular word2vec implementation [1] learns relational similarity
better or worse based on how well humans understand them).                        between word pairs from large natural language corpora by ex-
    Therefore, in this paper we promote an improved interpreta-                   ploiting the distributional hypothesis [9] using neuronal networks.
tion of the 4-term analogy model [15], and assume that there                      Word-embeddings are particularly interesting for use in analogy-
can be multiple relationships between the concepts of an anal-                    enabled recommender systems as they can be easily trained on
ogon, some of them being relevant for the semantics of an analogy                 big text corpora (like for example user reviews), and do not re-
(the defining relationships), and some of them not. An analogy                    quire structured ontologies and taxonomies like other approaches
holds true if the sets of defining relationships of both analogons                (e.g., [22]). In this paper, we evaluate the analogy reasoning per-
show a high degree of relational similarity. For illustrating the                 formance of word embeddings using our new benchmark dataset
difference and importance of this change in semantics, consider                   which represents analogy semantics more naturally.
the analogy statement: [StanleyKubrick, 2001 : ASpaceOdyssey] ::
[T axiDriver, MartinScorsese]. Kubrick is the director of 2001: A                 3     BENCHMARK DATASETS
Space Odyssey, and Scorsese is the director of Taxi Driver. Both anal-
                                                                                  For benchmarking the effectiveness of analogy algorithms, there are
ogons contain the same “movie is directed by person” relationship,
                                                                                  several established Gold standard datasets for general-knowledge
and this could be considered a valid analogy with respect to the
                                                                                  analogies (i.e. not tailored to a specific domain). However, all of
simple 4-term analogy model. Still, this is a poor analogy statement
                                                                                  them are lacking in some respect, as discussed in the following
from a human communication point of view because 2001: A Space
                                                                                  sections.
Odyssey is not like Taxi Driver at all. Therefore, this statement
does neither describe the essence of 2001: A Space Odyssey nor
the essence of Taxi Driver particularly well: both movies are iconic              3.1     Mikolov Benchmark Dataset & Wordrep
for their respective directors, but they do not for example describe              The Mikolov Benchmark set [19] is one of the most popular bench-
that one movie is in the science fiction genre, and the other could               mark sets for testing the analogy reasoning capabilities of neuronal
be classified as a (violent) crime movie. Understanding which re-                 word embeddings covering 19,558 4-term analogies with 14 distinct
lationships actually define the essence of an analogon from the                   relationship types, focusing exclusively on analogy completion
viewpoint of human perception is a very challenging problem, but                  queries [a, b] :: [c, ?]. Nine of these relationships focus on grammat-
this understanding is crucial for judging the usefulness and value of             ical properties (e.g., the relationship “is plural for a noun”), while
an analogy statement. Furthermore, the degree to which relation-                  five relationships are of a semantic nature (e.g., “is capital city of a
ships are defining an analogon may vary with different contexts.                  country” like [Athens, Greece] :: [Oslo, Norway] ). The benchmark
In short, there can be better or worse analogies based on two core                set is generated by collecting pairs of entities which are members
factors (we will later encode the combined overall quality with an                of the selected relationship type from Wikipedia and DBpedia, and
analogy rating): the degree of how well the relationships shared                  then combining all these pairs into 4-term analogy tuples. The Wor-
by both analogons are defining them (i.e., are the relationships                  drep dataset [7] extends the Mikolov set by adding more challenges,
shared between both analogons indeed the defining relationships                   and expanding to 25 different relationship types.
which describe the intended semantic essence), and the relational                    A core weakness of this type of test set is that it does not include
similarity of the shared relationships. Based on these observations,              any human judgment with respect to the defining relationship types
we define two basic types of analogy queries (loosely adopted from                and the usefulness as an analogy for querying or explaining, but
[16]) which can be used for analogy-enabled recommender systems:                  instead focuses only on “correct” relationships which do usually
(1) Analogy completion ? : [a 1 , a 2 ] :: [b1 , ?]                               not carry any of the subtleties of rhetorical analogies, as e.g., “is
      This query can be used to find the missing concept in a 4-term              city in” or ”is plural of”. In short, the Mikolov test set does not
      analogy. This is therefore the most useful query type in a future           benchmark if algorithms can capture analogy semantics from a hu-
      analogy-enabled information systems [15]. Solving this query                man perspective, but instead focuses purely on relational similarity
      requires identifying the set of defining relationships between              for a very limited selection of relationship types. Thus, we feel that
     a 1 and a 2 , and then finding a b2 such that the set of defining            this benchmark dataset does not represent the challenge of real
      relationships between b1 and b2 is similar.                                 world analogy queries well.
Towards Analogy-based Recommendation                                                           ComplexRec 2017, August 31, 2017, Como, Italy.

                                                                                            Table 1: Example AGS challenge

                                                                                     Source Analogon      Target Analogon         Rating
                                                                                                          scallops : Italy        2.57
                                                                                                          currywurst : Germany    4.00
                                                                                                          tacos : Mexico          4.67
                                                                                     sushi : Japan
                                                                                                          curry : India           4.00
                                                                                                          sombrero : Mexico       2.00
                                                                                                          hamburger : ship        1.33


Figure 1: Difficulty of challenges for SAT and AGS-2 dataset
                                                                            for challenges which could be handled by most crowd workers
                                                                            (7-8 correct votes), medium (5-6 correct votes), difficult (3-4 correct
3.2    SAT Analogy Challenges                                               votes), and advanced (0-2 correct votes). Our SAT difficulty ratings
The SAT analogy challenge [14] plays an important role in real-             can be downloaded on our companion page [3], and the resulting
world college admission tests in the United States to assess a prospec-     difficulty distribution is shown in Figure 1.
tive student’s vocabulary depth and general analytical skills by
focusing on analogon ranking queries. As the challenge’s original           3.3    Analogy Gold Standard AGS-1
intent is to assess the vocabulary and reasoning skills of prospective
                                                                            Lofi et al. introduced a benchmark dataset which aims at rectifying
students, it contains many rare words. In order to be able to evalu-
                                                                            the shortcomings of aforementioned sets, the Analogy Gold Stan-
ate the test without dispute, there is only a single correct answer
                                                                            dard AGS dataset [17]. In this dataset, each challenge can be used
while all other answers are definitely wrong (and cannot also be
                                                                            to benchmark both completion queries, but also ranking queries.
argued for). The SAT dataset contains 374 challenges like this one:
                                                                            AGS was systematically build using different seed datasets like the
“legend is to map as is: a) subtitle to translation, b) bar to graph, c)
                                                                            SAT dataset or the WordSim-353 dataset [6]. One of the core contri-
figure to blue-print, d) key to chart, e) footnote to information.” Here,
                                                                            bution of AGS was to include human judgments of how “good” an
the correct answer is d) as a key helps to interpret the symbols in a
                                                                            analogy is from a subjective point of view (as compared to previous
chart as does the legend with the symbols of a map.
                                                                            test sets where analogy statements are either correct or wrong).
   While it is easy to see that this answer is correct when the so-
                                                                            The quality of an analogy is influenced by the relational similarity
lution is provided, solving these challenges is a difficult task for
                                                                            of the analogons, combined with how well the shared relationships
aspiring high school students as the correctness rates of the anal-
                                                                            represent the essence of the analogy (see section 2 for a discussion).
ogy section of SAT tests is usually reported to be around 57% [23].
                                                                               As previous experiments showed that human workers had prob-
An interesting aspect of the SAT analogy test set is that a large
                                                                            lems to individually quantify those aspects, for AGS, a new analogy
variety of different relationship types are covered, and that the
                                                                            rating was introduced which implicitly encodes both relational
contained challenges have a high degree of variance of difficulty
                                                                            similarity and representativeness, i.e. the analogy rating represents
from a humans’ perspective.
                                                                            how “useful” and “good” the analogy is perceived by humans on
   Some analogy challenges are harder than others for the average
                                                                            a scale of 1 to 5. The judgments of at least 8 humans recruited via
person, usually based on the rareness of the used vocabulary and the
                                                                            crowd-sourcing platforms was used for the AGS analogy ratings of
complexity of the reasoning process required in order to grasp the
                                                                            each analogon pair in a challenge.
intended semantics. Understanding the difficulty of challenges is thus
                                                                               An example AGS challenge is given in Table 1: in this example,
important when judging the effectiveness of an algorithmic solution,
                                                                            while sombreros are typically considered to come from Mexico,
as this provides us with a sense for ”human-level performance”.
                                                                            and in the source analogon sushi typically comes from Japan (i.e.
   However, while the SAT dataset can be obtained easily for re-
                                                                            there is similar relationship between both analogons), the resulting
search purposes, there is no publicly available assessment of the
                                                                            analogy [sushi, Japan] :: [sombrero, Mexico] still has a low analogy
difficultly of different challenges. Lofi et al examined the perfor-
                                                                            rating because the the defining relationship in the source analogon
mance of crowd workers recruited from Amazon Mechanical Turk
                                                                            [sushi, Japan] is usually understood by people as “stereotypical
when faced with SAT challenges [16]. They recruited 99 workers
                                                                            food from a country” - and thus the analogy is deemed not useful
from English-speaking countries, and each challenge was solved
                                                                            by most humans.
by 8 crowd workers each (in average, each worker solved 29 chal-
lenges.) It turned out that workers are rarely in agreement, and
                                                                            3.4    Improved Analogy Gold Standard AGS-2
that most challenges (> 67%) received 3 or 4 different answers from
just 8 crowd workers. Only 3.6% challenges are easy enough to               In this paper we introduce the AGS-2 dataset which significantly
garner unequivocal agreement from all 8 crowd workers, while                improves on AGS-1 with respect to several aspects. The AGS-2
most challenges only had an agreement of 4 to 5 workers.                    dataset can be obtained at our companion page [3], thus providing
   In this paper, we use those results to classify each SAT challenge       tangible benefits to ongoing analogy-enabled information system
by their crowd difficulty, and make this data publicly available.           research. Notably, the core improvements of AGS-2 are as follows:
This allows us to discuss the performance of analogy algorithms             • Added difficulty rating for each challenge based on the crowd
in comparison to human performance (see section 4). To this end,              feedback of 5 workers. The scale is similar to our difficulty rating
we classified each challenge into one of four difficulty groups, easy         for SAT challenges introduced in section 3.2: advanced, hard,
ComplexRec 2017, August 31, 2017, Como, Italy.                                                                                  C. Lofi et al.


  medium, easy. The resulting difficulty distribution of AGS-2 is
  also shown in Figure 1.
• Extended size and scope: The initial seed analogons for creating
  AGS-1 were extracted from the the Simlex [10] and Wordsim
  datasets [6], resulting in 93 challenges overall. For AGS-2, this
  was extended using suggestions of potentially interesting analogy
  pairs by a subset of trusted crowd workers. We manually selected
  a subset of these suggestions, resulting in 168 challenges overall.
• Improved balance of analogy ratings: In AGS-1, challenges could
  be imbalanced with respect to the analogy ratings (i.e., a given
                                                                             Figure 2: Accuracy of w2v solving SAT and AGS-2
  challenge could contain many analogons with high analogy rat-
  ings, but only few with low ratings). For AGS-2 we ensure that        5   SUMMARY AND OUTLOOK
  each challenge covers 1-2 analogons each for high, medium, and
                                                                        In this paper, we introduced the challenge of analogy semantics
  low analogy ratings. Note that an analogy with high difficulty and
                                                                        for recommender systems. Analogies are a natural communication
  high analogy rating might still not be understandable by many
                                                                        paradigm to efficiently state an information need or to explain a
  people (due to being difficult by using rarely known concepts or
                                                                        complex concept. This relies on exploiting the background knowl-
  complex reasoning), but still will be accepted as a good analogy
                                                                        edge and reasoning capabilities of both communication partners, a
  by the same people after the semantics are explained to them.
                                                                        task challenging even for human judges. We introduced the AGS-2
                                                                        benchmark data set overcoming many shortcomings of previous
                                                                        datasets, such as being usable for both analogy ranking and analogy
4    EVALUATION
                                                                        completion queries, and is based on human perception for rating
In this section, we give a preview on how word-embeddings perform       the quality and difficulty of a benchmark analogy. We evaluated the
on analogy challenges of varying difficulty as previous work only       performance of neuronal word embeddings (using word2vec as a
relied on the aforementioned Mikolov dataset where they showed a        representative) on previous datasets and our new benchmark (AGS-
comparably good accuracy of 53.3%. The embedding we evaluated           2). While the method worked worse than human-level performance
is Gensim’s skip-gram word2vec [2], trained on a Wikipedia dump.        for simple queries, it was comparable for more complex queries
We use the SAT and AGS-2 datasets, and split the results by the         with rare vocabulary and requiring extensive common knowledge.
new difficulty measurements introduced in section 3.2.                      In our next steps we plan to investigate the performance of hybrid
   As the SAT dataset only supports analogy ranking queries, for        methods, using both embeddings as well as structural reasoning to
brevity we only report ranking query results in the following. We       enable analogy queries for recommender systems. This particularly
followed the test protocol outlined in [17]: we rank each analogon      involves adopting analogy semantics to specific domains like music,
of a challenge by their relational similarity score with respect to     books, or movies - while current analogy systems and benchmark
the source as computed by word2vec, and consider a challenge            datasets (including AGS-2) focus on common-knowledge analogies
successfully solved if the top-ranked analogon is the correct one       (which is slightly easier due to the ready availability of both large
(SAT) or has an analogy score higher than a chosen threshold (4.0       text corpora and ontologies).
in our case for AGS-2).                                                     In this context, we also plan to evaluate the performance of
   The results are summarized in Figure 2, and are rather disap-        analogy based explanations for supporting people in making deci-
pointing: for SAT, the average accuracy is 23.4% (random guess-         sions about recommended items. This is a particularly interesting
ing achieves 20% as each challenge has 5 answer options, average        challenge as approaches based only on distributed semantics do im-
human-level performance is 57% [23], while one of the current best      plicitly encode analogy semantics, but have now explicit knowledge
performing analogy systems based on both distributional semantics       on the type of relationships which would be required to explain
and structural reasoning using ontologies (like DBpedia or Word-        the semantics to a human user, thus again underlining the need for
Net) performs at 56.1% [22]). For AGS-2 the word2vec accuracy is        explicit information on the relationships used in an analogy.
25% (random guessing achieves 16.7%). Interestingly, the perfor-
mance of word2vec is consistent with respect to difficulty levels,
and it performs worse than humans for easy challenges, but compa-
rably well for advanced challenges. From this preliminary result we
conclude that the analogy reasoning capabilities of neuronal em-
beddings, despite some of their advantages like ease of use and easy
training just relying on text collections, are inferior than current
anecdotal and empirical evidence suggests. However, specialized
analogy reasoning algorithms have been shown to achieve up to
56.1% on SAT [4], and this has mostly been realized by also incorpo-
rating ontological knowledge (as suggested by [22]). Unfortunately,
this could be challenging for some domain-specific recommender
systems where such ontologies are not easily available, thus pro-
moting future research to overcome this issue.
Towards Analogy-based Recommendation                                                                           ComplexRec 2017, August 31, 2017, Como, Italy.


REFERENCES                                                                              [13] Immanuel Kant. 1790. Critique of Judgement.
 [1] 2013. Word2Vec. https://code.google.com/archive/p/word2vec/. (2013). Accessed:     [14] Michael Littman and Peter Turney. 2016. SAT Aanalogy Challange Dataset. (2016).
     2017-06-01.                                                                             https://www.aclweb.org/aclwiki/index.php?title=Analogy (State of the art)
 [2] 2014. Gensim Word2Vec. https://radimrehurek.com/gensim/models/word2vec.            [15] Christoph Lofi. 2013. Analogy Queries in Information Systems: A New Challenge.
     html. (2014). Accessed: 2017-06-01.                                                     Journal of Information & Knowledge Management (JIKM) 12, 3 (2013).
 [3] 2017. Analogy Semantics Companion Page. https://github.com/WISDelft/               [16] Christoph Lofi. 2013. Just ask a human? – Controlling Quality in Relational
     analogy semantics. (2017). Accessed: 2017-08-01.                                        Similarity and Analogy Processing using the Crowd. In CDIM Workshop at
 [4] 2017. SAT Analogy Questions: State of the Art. https://www.aclweb.org/aclwiki/          Database Systems for Business Technology and Web (BTW). Magdeburg, Germany.
     index.php?title=SAT Analogy Questions (State of the art). (2017). Accessed:        [17] Christoph Lofi, Athiq Ahamed, Pratima Kulkarni, and Ravi Thakkar. 2016. Bench-
     2017-06-01.                                                                             marking Semantic Capabilities of Analogy Querying Algorithms. In Int. Conf. on
 [5] Danushka Bollegala, Tomokazu Goto, Nguyen Tuan Duc, and Mitsuru Ishizuka.               Database Systems for Advanced Applications (DASFAA). Dallas, TX, USA.
     2012. Improving Relational Similarity Measurement using Symmetries in Pro-         [18] C. Lofi, C. Nieke, and N. Collier. 2014. Discriminating Rhetorical Analogies in
     portional Word Analogies. Information Processing & Management 49, 1 (2012),             Social Media. In Conf. of the Europ. Chapter of the Association for Computational
     355–369.                                                                                Linguistics (EACL). Gothenburg, Sweden.
 [6] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan,       [19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
     Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: the concept            estimation of word representations in vector space. International Conference on
     revisited. In Int. Conf. on World Wide Web (WWW). Hong Kong, China.                     Learning Representations (ICLR) (2013).
 [7] Bin Gao, Jiang Bian, and Tie-Yan Liu. 2014. WordRep: A Benchmark for Research      [20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013.
     on Learning Word Representations. In ICML Workshop on Knowledge-Powered                 Distributed Representations of Words and Phrases and their Compositionality.
     Deep Learning for Text Mining. Beijing, China.                                          Advances in Neural Information Processing Systems 21 (2013), 3111–3119.
 [8] D Gentner. 1983. Structure-mapping: A theoretical framework for analogy.           [21] Cameron Shelley. 2003. Multiple Analogies In Science And Philosophy. John
     Cognitive science 7 (1983), 155–170.                                                    Benjamins Pub.
 [9] Z. Harris. 1954. Distributional Structure. Word 10 (1954), 146–162.                [22] Robert Speer, Joshua Chin, and Catherine Havasi. 2016. ConceptNet 5.5: An
[10] F. Hill, R. Reichart, and A. Korhonen. 2014. SimLex-999: Evaluating Seman-              Open Multilingual Graph of General Knowledge. AAAI Conference on Artificial
     tic Models with (Genuine) Similarity Estimation. Preprint published on arXiv.           Intelligence (2016).
     arXiv:1408:3456 (2014).                                                            [23] P. Turney and M. Littman. 2005. Corpus-based learning of analogies and semantic
[11] Douglas R. Hofstadter. 2001. Analogy as the Core of Cognition. In The Analogical        relations. Machine Learning 60 (2005), 251–278.
     Mind. 499–538.
[12] Esa Itkonen. 2005. Analogy as structure and process: Approaches in linguistics,
     cognitive psychology and philosophy of science. John Benjamins Pub Co.