Generalizing Representations of Lexical Semantic Relations

              Anupama Chingacham                                 Denis Paperno
           SFB 1102, Saarland University                   CNRS, LORIA, UMR 7503
           Saarbrucken ,66123, Germany                Vandoeuvre-lès-Nancy, F-54500, France
         anu.vgopal2009@gmail.com                        denis.paperno@loria.fr


                    Abstract                           data in an unsupervised fashion and indeed the re-
                                                       sulting vectors contain a lot of information about
    English. We propose a new method for               the semantic properties of words and objects they
    unsupervised learning of embeddings for            refer to, cf. for instance Herbelot and Vecchi
    lexical relations in word pairs. The model         (2015). Based on the distributional hypothesis
    is trained on predicting the contexts in           coined by Z. S. Harris (1954), word embedding
    which a word pair appears together in cor-         models, which construct word meaning repre-
    pora, then generalized to account for new          sentations as numeric vectors based on the co-
    and unseen word pairs. This allows us to           occurrence statistics on the word’s context, have
    overcome the data sparsity issues inherent         been gaining ground due to their quality and sim-
    in existing relation embedding learning se-        plicity. Produced by efficient and robust im-
    tups without the need to go back to the            plementations such as word2vec (Mikolov et al.,
    corpora to collect additional data for new         2013) and GloVe (Pennington et al., 2014), mod-
    pairs.                                             ern word vector models are able to predict whether
                                                       two words are related in meaning, reaching human
    Italiano. Proponiamo un nuovo metodo               performance on benchmarks like WordSim353
    per l’apprendimento non supervision-               (Agirre et al., 2009) and MEN (Bruni et al., 2014).
    ato delle rappresentazioni delle relazioni            On the other hand, lexical knowledge includes
    lessicali fra coppie di parole (word pair          not only properties of individual words but also
    embeddings). Il modello viene allenato             relations between words. To some extent, lexical
    a prevedere i contesti in cui compare uns          semantic relations can be recovered from the word
    coppia di parole, e successivamente viene          representations via the vector offset method as ev-
    generalizzato a coppie di parole nuove o           idenced by various applications including analogy
    non attestate. Questo ci consente di su-           solving, but already on this task it has multiple
    perare i problemi dovuti alla scarsità di         drawbacks (Linzen, 2016) and has a better unsu-
    dati tipica dei sistemi di apprendimento           pervised alternative (Levy and Goldberg, 2014).
    di rappresentazioni, senza la necessità di
                                                          Just like a word representation is inferred from
    tornare ai corpora per raccogliere dati per
                                                       the contexts in which the word occurs, informa-
    nuove coppie di parole.
                                                       tion about the relation in a given word pair can be
                                                       extracted from the statistics of contexts in which
                                                       the two words of the pair appear together. In our
1   Introduction
                                                       model, we use this principle to learn high-quality
In this paper we address the problem of unsuper-       pair embeddings from frequent noun pairs, and on
vised learning of lexical relations between any two    their basis, build a way to construct a relation rep-
words. We take the approach of unsupervised rep-       resentation for an arbitrary pair.
resentation learning from distribution in corpora,        Note that we approach the problem from the
as familiar from word embedding methods, and           viewpoint of lerning general-purpose semantic
enhance it with an additional technique to over-       knowledge. Our goal is to provide a vector rep-
come data sparsity.                                    resentation for an arbitrary pair of words w1 , w2 .
   Word embedding models give a promise of             This is a more general task than relation extrac-
learning word meaning from easily available text       tion, which aims at identifying the semantic rela-
tion between the two words in a particular con-         (Baroni and Zamparelli, 2010; Guevara, 2010).
text. Modeling such general relational knowledge           The kind of relation representations we aim at
is crucial for natural language understanding in        learning are meant to encode general relational
realistic settings. It may be especially useful for     knowledge and are produced in an unsupervised
recovering the notoriously difficult bridging rela-     way, even though it can be useful for identifica-
tions in discourse since they involve understanding     tion of specific relations like hypernymy and for
implicit links between words in the text.               relation extraction from text occurrences (Jameel
   Representations of word relations have applica-      et al., 2018). The latter paper documents a model
tions in many NLP tasks. For example, they could        that produces word pair embeddings by concate-
be extremely useful for resolving bridging, espe-       nating Glove-based word vectors with relation em-
cially of the lexical type (Rösiger et al., 2018).     beddings trained to predict the contexts in which
But in order to be useful in practice, word relation    the two words of the pair co-occur. The main issue
models must generalize to rare or unseen cases.         with Jameel et al.’s models is scalability: as the au-
                                                        thors admit, it is prohibitively expensive to collect
2   Related Work                                        all the data needed to train all the relation embed-
                                                        dings. Instead, their implementation requires, for
Our project is related to the task of relation ex-      each individual word pair, going back to the train-
traction that has been in focus of various com-         ing corpus via an inverse index and collecting the
plex models (Mintz et al., 2009; Zelenko et al.,        data needed to estimate the embedding of the pair.
2003) including recurrent (Takase et al., 2016) and     This strategy might not be efficient for practical
convolutional neural network architectures (Xu et       applications.
al., 2015; Nguyen and Grishman, 2015; Zeng et
al., 2014), although the simple averaging or sum-       3   Proposed Model
mation of the context word vectors seems to pro-
duce good results for the task (Fan et al., 2015;       We propose a simple solution to the scalabil-
Hashimoto et al., 2015). The latter work by             ity problem inherent in word relation embedding
Hashimoto et al. bears the greatest resemblance         learning from joint cooccurrence data, which also
to the approach to learning semantic relation rep-      allows the model to generalize to word pairs that
resentations that we utilize here. Hashimoto et         never occur together in the corpus, or occur too
al. train noun embeddings on the task of predict-       rarely to accumulate significant relational cues in-
ing words occurring in between the two nouns in         formation. The model is trained in two steps.
text corpora and use these embeddings along with           First, we apply the skip-gram with negative
averaging-based context embeddings as input to          sampling algorithm to learn relation vectors for
relation classification.                                pairs of nouns n1 , n2 with high individual and
   There are numerous studies dedicated to char-        joint occurrence frequencies. In our experiments,
acterizing relations in word pairs abstracted away      all word pairs with pair frequency more than 100
from the specific context in which the word pair        and its individual word frequency more than 500
appears. Much of this literature focuses on one         are considered as frequent pairs. To estimate the
specific lexical semantic relation at a time. Among     SkipRel vector of the pair, we adapted the learn-
these, lexical entailment (hypernymy) has prob-         ing objective of skip-gram with negative sampling,
ably been the most popular since Hearst (1992)          maximizing
with various representation learning approaches
specifically targeting lexical entailment (Fu et al.,   logσ(vc0T .un1 :n2 )+Σki=1 Ec∗i ∼Pn (c) [logσ(−vc0T∗i .un1 :n2 )]
2014; Anh et al., 2016; Roller and Erk, 2016;                                                                  (1)
Bowman, 2016; Kruszewski et al., 2015) and the          where un1 :n2 is the SkipRel embedding of a word
antonymy relation has also received considerable        pair, vc0 is the embedding of a context word occur-
attention (Ono et al., 2015; Pham et al., 2015;         ring between n1 and n2 , and k is the number of
Shwartz et al., 2016; Santus et al., 2014). An-         negative samples.
other line of work in representing the composition-        High-quality SkipRel embeddings can only ob-
ality of meaning of words using syntactic struc-        tained for noun pairs that co-occur frequently. To
tures(like Adjective-Noun pairs) is another ap-         allow the model to generalize to noun pairs that do
proach towards semantic relation representations.       not co-occur in our corpus, we estimated an inter-
polation ũn1 :n2 of the word pair embedding                  Model       BLESS     EVAL      EACL
                                                               Diff        81.15     57.83    71.25
          ũn1 :n2 = relU (Avn1 + Bvn2 )        (2)
                                                            g-SkipRel      59.07     48.06    70.31
where vn1 , vn2 are pretrained word embeddings              RelWord        80.94     59.05    73.88
for the two nouns and the matrices A, B encode               Random        12.5      11.11      50
systematic correspondences between the embed-                Majority      24.71     25.67     50.4
dings of a word and the relations it participates
in. Matrices A, B were estimated using stochastic      Table 1: Semantic relation classification accuracy
gradient descent with the objective of minimizing
the square error for the SkipRel vectors of frequent
noun pairs n1 , n2                                     relata spanned across 8 classes of semantic rela-
                                                       tion and EVALuation1.0 has 7.5k datasets spanned
          1                                            across 9 unique relation types. From EACL 2017
             Σn :n ∈P (ũn1 :n2 − un1 :n2 )     (3)
         |P | 1 2                                      dataset, we used a list of 4062 noun pairs.
                                                          Since we aim at recognizing whether the in-
  We call ũn1 :n2 the generalized SkipRel embed-
                                                       formation relevant for relation identification is
ding (g-SkipRel) for the noun pair n1 , n2 . Rel-
                                                       present in the representations in an easily accessi-
Word, the proposed relation embedding, is the
                                                       ble form, we choose to employ a simple, one-layer
concatenation of the g-SkipRel vector ũn1 :n2 and
                                                       SoftMax classifier. The classifier was trained for
the Diff vector vn1 − vn2 .
                                                       100 epochs, and the learning rate for the model is
4   Experimental setup                                 defined through crossvalidation. L2 regularization
                                                       is employed to avoid over-fitting and the l2 factor
We trained relation vectors on the ukWAC corpus        is decided through empirical analysis. The clas-
(Baroni et al., 2009) containing 2 bln tokens of       sifier is trained with mini-batches of size 16 for
web-crawled English text. SkipRel is trained on        BLESS & EVALuation1.0 and 8 for EACL 2017.
noun pair instances separated by at most 10 con-       SGD is utilized for optimizing model weights.
text tokens with embedding size of 400 and mini-          We prove the efficiency of RelWord vectors, we
batch size of 32. Frequency filtering is performed     contrast them with the simpler representations of
to control the size of pair vocabulary (|P |). Fre-    (g-)SkipRel and to Diff, the difference of the two
quent pairs are pre-selected using pair and word       word vectors in a pair, which is a commonly used
frequency thresholds. For pretrained word em-          simple method. We also include two simple base-
beddings we used the best model from Baroni et         lines: random choice between the classes and the
al. (2014).                                            constant classifier that always predicts the major-
   The experimental setup is built and main-           ity class.
tained on GPU clusters provided by GRID5000
(Cappello et al., 2005).           The code for        6   Results
model implementation and evaluation is pub-            All models outperform the baselines by a wide
licly available at https://github.com/                 margin (Table 1). RelWord model compares favor-
Chingcham/SemRelationExtraction                        ably with the other options, outperforming them
                                                       on EVAL and EACL datasets and being on par
5   Evaluation
                                                       with the vector difference model for BLESS. This
If our relation representations are rich enough in     result signifies a success of our generalization
the information they encode, they will prove use-      strategy, because in each dataset only a minority of
ful for any relation classification task regardless    examples had pair representations directly trained
of the nature of the classes involved. We evaluate     from corpora; most WordRel vectors were inter-
the model with a supervised softmax classifier on      polated from word emeddings.
2 labeled multiclass datasets, BLESS (Baroni and          Now let us restrict our attention to word pairs
Lenci, 2011) and EVALuation1.0 (Santus et al.,         that frequently co-occur (Table 2). Note that the
2015), as well as the binary classification EACL       composition of classes, and by consequence the
antonym-synonym dataset (Nguyen et al., 2017).         majority baseline, is different from Table 1, so
BLESS set consists of 26k triples of concept and       the accuracy figures in the two tables are not di-
                                                                   pair          gold          Diff        RelWord
       Model       BLESS     EVAL      EACL
                                                               bottle, can     antonym      hasproperty      hasa
                                                               race, time     hasproperty      hasa        antonym
        Diff        77.13     44.61    66.07                balloon, hollow   hasproperty    antonym         hasa
      SkipRel       73.37     48.40    83.03                  clear, settle       isa        antonym       synonym
                                                             develop, grow        isa        antonym       synonym
      RelWord       83.27     54.47    79.46                exercise, move      entails      antonym          isa
                                                                fact, true    hasproperty    antonym       synonym
      Random         12.5     11.11     50                   human, male          isa        synonym      hasproperty
      Majority      33.22     26.37    63.63                  respect, see        isa        antonym       synonym
                                                                slice, hit        isa        antonym       synonym


Table 2: Semantic relation classification accuracy      Table 3: Ten random examples in which RelWord
for frequent pairs                                      and Diff make different errors. In the first one, the
                                                        two models make predictions of comparable qual-
                                                        ity. In the second one, Diff makes a more intuitive
rectly comparable. For these frequent pairs we can      error. In the remaining examples, RelWord’s pre-
rely on SkipRel relation vectors that have been es-     diction is comparatively more adequate.
timated directly from corpora and have a higher
quality; we also use SkipRel vectors instead of g-
SkipRel as a component of RelWord. We note that         that are both different from the gold standard label.
for these pairs the performance of the Diff method      We manually annotated for each of the 53 exam-
dropped uniformly. This presumably happened in          ples of this kind, which model is more a acceptable
part because the classifier could no longer rely on     according to a human’s judgment. In a majority
the information on relative frequencies of the two      of cases (28) the WordRel model makes a predic-
words which is implicitly present in Diff represen-     tion that is more human-like than that of Diff. For
tations; for example, it is possible that antonyms      example, WordRel predicts that shade is part of
have more similar frequencies than synonyms in          shadow rather than its synonym (gold label); in-
the EACL dataset. For BLESS and EVAL, the               deed, any part of a shadow can be called shade.
drop in the performance of Diff could have hap-         The Diff model in this case and in many other
pened in part because the classes that include more     examples bets on the antonym class, which does
frequent pairs such as isA, antonyms and syn-           not make any sense semantically; the reason why
onyms are inherently harder to distinguish than         antonym is a common false label is probably that
classes that tend to contain rare pairs. In contrast,   it is simply the second biggest class in the dataset.
the comparative effectiveness of RelWord is more        The examples where Diff makes a more meaning-
pronounced after frequency filtering. The useful-       ful error than RelWord are less numerous (6 out
ness of relation embeddings is especially impres-       of 53). There are also 15 examples where both
sive for the EACL dataset. In this case, vanilla        system’s predictions are equally bad (for example,
SkipRel emerges as the best model, confirming           for Nice,France Diff predict isa label and Wor-
that word embeddings per se are not particularly        dRel predicts synonym) and 4 examples where
useful for detecting the synonymy-antonymy dis-         the two predictions are equally reasonable. For
tinction for this subset of EACL, getting an accu-      more examples, see Table 3. We note that some-
racy just above the majority baseline, while pair       times our model’s prediction seems more correct
embeddings go a long way.                               than the gold standard, for example in assigning
                                                        hasproperty rather than isa label to the pair
   Finally, quantitative evaluation in terms of clas-
                                                        human, male.
sification accuracy or other measures does not
fully characterize the relative performance of the
                                                        7   Conclusion
models; among other things, certain types of mis-
classification might be worse than others. For ex-      The proposed model is simple in design and train-
ample, a human annotator would rarely confuse           ing, learning word relation vectors based on co-
synonyms with antonyms, while mistaking has a           occurrence with unigram contexts and extending
for has property could be a common point of             to rare or unseen words via a non-linear map-
disagreement between annotators. To do a quali-         ping. Despite its simplicity, the model is capa-
tative analysis of errors made by different models,     ble of capturing lexical relation patterns in vector
we selected the elements of EVAL test partition         representations. Most importantly, RelWord ex-
where Diff and RelWord make distinct predictions        tends straightforwardly to novel word pairs in a
manner that does not require recomputing cooc-              ference on Empirical Methods in Natural Language
currence counts from the corpus as in related ap-           Processing, pages 403–413.
proaches (Jameel et al., 2018). This allows for an        Marco Baroni and Alessandro Lenci. 2011. How we
easy integration of the pretrained model into vari-        blessed distributional semantic evaluation. In Pro-
ous downstream applications.                               ceedings of the GEMS 2011 Workshop on GEometri-
                                                           cal Models of Natural Language Semantics, GEMS
    In our evaluation, we observed that learning
                                                           ’11, pages 1–10, Stroudsburg, PA, USA. Association
word pair relation embeddings improves on the se-          for Computational Linguistics.
mantic information already present in word em-
                                                          Marco Baroni and Roberto Zamparelli. 2010. Nouns
beddings. With respect to certain semantic re-             are vectors, adjectives are matrices: Representing
lations like synonyms, the performance of rela-            adjective-noun constructions in semantic space. In
tion embedding is comparable to that of word em-           Proceedings of the 2010 Conference on Empirical
beddings but with an additional cost of training a         Methods in Natural Language Processing, EMNLP
                                                           ’10, pages 1183–1193, Stroudsburg, PA, USA. As-
representation for a significant number of pair of
                                                           sociation for Computational Linguistics.
words. For other relation types like antonyms or
hypernyms, in which words differ semantically but         Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
                                                           and Eros Zanchetta. 2009. The wacky wide
share similar contexts, learned word pair relation
                                                           web: a collection of very large linguistically pro-
embeddings have an edge over those derived from            cessed web-crawled corpora. Language resources
word embeddings via simple subtraction. While in           and evaluation, 43(3):209–226.
practice one has to make a choice based on the task
                                                          Marco Baroni, Georgiana Dinu, and Germn
requirements, it is generally beneficial to combine        Kruszewski. 2014. Don’t count, predict! a
both types of relation embeddings for best results         systematic comparison of context-counting vs.
in a model like RelWord.                                   context-predicting semantic vectors.     In 52nd
    Our current model employs pretrained word              Annual Meeting of the Association for Computa-
                                                           tional Linguistics, ACL 2014 - Proceedings of the
embeddings and learns the word pair embeddings             Conference, volume 1, pages 238–247, 06.
and a word-to-relation embedding mapping sep-
                                                          Samuel Ryan Bowman. 2016. Modeling natural lan-
arately. In the future, we plan to train a version
                                                            guage semantics in learned representations. Ph.D.
of the model end-to-end, with word embeddings               thesis, Ph. D. thesis, Stanford University.
and the mapping trained simultaneously. As liter-
                                                          Elia Bruni, Nam-Khanh Tran, and Marco Baroni.
ature suggests (Hashimoto et al., 2015; Takase et
                                                             2014. Multimodal distributional semantics. Journal
al., 2016), such joint training might not only bene-         of Artificial Intelligence Research, 49:1–47.
fit the model but also improve the performance of
                                                          Franck Cappello, Eddy Caron, Michel J. Dayd, Frdric
the resulting word embeddings on other tasks.               Desprez, Yvon Jgou, Pascale Vicat-Blanc Primet,
                                                            Emmanuel Jeannot, Stphane Lanteri, Julien Leduc,
Acknowledgments                                             Nouredine Melab, Guillaume Mornet, Raymond
                                                            Namyst, Benjamin Qutier, and Olivier Richard.
This research is supported by CNRS PEPS grant               2005. Grid’5000: a large scale and highly recon-
ReSeRVe. We thank Roberto Zamparelli, Germán               figurable grid experimental testbed. In GRID, pages
Kruszewski, Luca Ducceschi and anonymous re-                99–106. IEEE Computer Society.
viewers who gave feedback on previous versions            Miao Fan, Kai Cao, Yifan He, and Ralph Grish-
of this work.                                               man. 2015. Jointly embedding relations and men-
                                                            tions for knowledge population. arXiv preprint
                                                            arXiv:1504.01683.
References
                                                          Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng
Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana           Wang, and Ting Liu. 2014. Learning semantic hier-
  Kravalova, Marius Paşca, and Aitor Soroa. 2009.          archies via word embeddings. In Proceedings of the
  A study on similarity and relatedness using distribu-     52nd Annual Meeting of the Association for Compu-
  tional and wordnet-based approaches. In Proceed-          tational Linguistics (Volume 1: Long Papers), vol-
  ings of Human Language Technologies: The 2009             ume 1, pages 1199–1209.
  Annual Conference of the North American Chapter
  of the Association for Computational Linguistics.       Emiliano Guevara. 2010. A regression model of
                                                            adjective-noun compositionality in distributional se-
Tuan Luu Anh, Yi Tay, Siu Cheung Hui, and See Kiong         mantics. In Proceedings of the 2010 Workshop on
  Ng. 2016. Learning term embeddings for taxo-              GEometrical Models of Natural Language Seman-
  nomic relation identification using dynamic weight-       tics, GEMS ’10, pages 33–37, Stroudsburg, PA,
  ing neural network. In Proceedings of the 2016 Con-       USA. Association for Computational Linguistics.
Zellig Harris. 1954. Distributional structure. Word,        In Proceedings of the 15th Conference of the Euro-
  10(23):146–162.                                           pean Chapter of the Association for Computational
                                                            Linguistics, pages 76–85, Valencia, Spain.
Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa,
  and Yoshimasa Tsuruoka. 2015. Task-oriented             Masataka Ono, Makoto Miwa, and Yutaka Sasaki.
  learning of word embeddings for semantic relation        2015. Word embedding-based antonym detection
  classification. arXiv preprint arXiv:1503.00095.         using thesauri and distributional information. In
                                                           HLT-NAACL, pages 984–989.
Marti A. Hearst. 1992. Automatic acquisition of hy-
 ponyms from large text corpora. Technical Report         Jeffrey Pennington, Richard Socher, and Christopher
 S2K-92-09.                                                  Manning. 2014. Glove: Global vectors for word
                                                             representation. In Proceedings of the 2014 confer-
Aurélie Herbelot and Eva Maria Vecchi. 2015. Build-         ence on empirical methods in natural language pro-
  ing a shared world: Mapping distributional to              cessing (EMNLP), pages 1532–1543.
  model-theoretic semantic spaces. In Proceedings of
  the 2015 Conference on Empirical Methods in Nat-        Nghia The Pham, Angeliki Lazaridou, Marco Baroni,
  ural Language Processing, pages 22–32.                    et al. 2015. A multitask objective to inject lexical
                                                            contrast into distributional semantics. In Proceed-
Shoaib Jameel, Zied Bouraoui, and Steven Schockaert.        ings of the 53rd Annual Meeting of the Association
  2018. Unsupervised learning of distributional re-         for Computational Linguistics and the 7th Interna-
  lation vectors. In Proceedings of the 56th Annual         tional Joint Conference on Natural Language Pro-
  Meeting of the Association for Computational Lin-         cessing (Volume 2: Short Papers), volume 2, pages
  guistics (Volume 1: Long Papers), pages 23–33. As-        21–26.
  sociation for Computational Linguistics.
                                                          Stephen Roller and Katrin Erk. 2016. Relations such
German Kruszewski, Denis Paperno, and Marco Ba-              as hypernymy: Identifying and exploiting hearst pat-
  roni. 2015. Deriving boolean structures from distri-       terns in distributional vectors for lexical entailment.
  butional vectors. Transactions of the Association for      CoRR, abs/1605.05433.
  Computational Linguistics, 3:375–388.
                                                          Ina Rösiger, Arndt Riester, and Jonas Kuhn. 2018.
Omer Levy and Yoav Goldberg. 2014. Linguistic reg-           Bridging resolution: Task definition, corpus re-
 ularities in sparse and explicit word representations.      sources and rule-based experiments. In Proceedings
 In Proceedings of the eighteenth conference on com-         of the 27th International Conference on Computa-
 putational natural language learning, pages 171–            tional Linguistics, pages 3516–3528.
 180.
                                                          Enrico Santus, Qin Lu, Alessandro Lenci, and Churen
Tal Linzen. 2016. Issues in evaluating seman-               Huang. 2014. Unsupervised antonym-synonym dis-
  tic spaces using word analogies. arXiv preprint           crimination in vector space.
  arXiv:1606.07736.
                                                          Enrico Santus, Frances Yung, Alessandro Lenci, and
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-             Chu-Ren Huang. 2015. Evalution 1.0: an evolving
  frey Dean. 2013. Efficient estimation of word             semantic dataset for training and evaluation of distri-
  representations in vector space. arXiv preprint           butional semantic models. In Proceedings of the 4th
  arXiv:1301.3781.                                          Workshop on Linked Data in Linguistics: Resources
                                                            and Applications, pages 64–69.
Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-
  rafsky. 2009. Distant supervision for relation ex-      Vered Shwartz, Enrico Santus, and Dominik
  traction without labeled data. In Proceedings of the      Schlechtweg. 2016. Hypernyms under siege:
  Joint Conference of the 47th Annual Meeting of the        Linguistically-motivated artillery for hypernymy
  ACL and the 4th International Joint Conference on         detection. arXiv preprint arXiv:1612.04460.
  Natural Language Processing of the AFNLP: Vol-
  ume 2 - Volume 2, ACL ’09, pages 1003–1011,             Sho Takase, Naoaki Okazaki, and Kentaro Inui. 2016.
  Stroudsburg, PA, USA. Association for Computa-            Modeling semantic compositionality of relational
  tional Linguistics.                                       patterns. Engineering Applications of Artificial In-
                                                            telligence, 50:256–264.
Thien Huu Nguyen and Ralph Grishman. 2015. Re-
  lation extraction: Perspective from convolutional       Kun Xu, Yansong Feng, Songfang Huang, and
  neural networks. In Phil Blunsom, Shay B. Co-             Dongyan Zhao. 2015. Semantic relation classifica-
  hen, Paramveer S. Dhillon, and Percy Liang, edi-          tion via convolutional neural networks with simple
  tors, VS@HLT-NAACL, pages 39–48. The Associa-             negative sampling. CoRR, abs/1506.07650.
  tion for Computational Linguistics.
                                                          Dmitry Zelenko, Chinatsu Aone, and Anthony
Kim Anh Nguyen, Sabine Schulte im Walde, and               Richardella. 2003. Kernel methods for relation ex-
  Ngoc Thang Vu. 2017. Distinguishing Antonyms             traction. Journal of Machine Learning Research,
  and Synonyms in a Pattern-based Neural Network.          3:1083–1106.
Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,
  Jun Zhao, et al. 2014. Relation classification via
  convolutional deep neural network. In COLING,
  pages 2335–2344.