Generalizing Representations of Lexical Semantic Relations Anupama Chingacham Denis Paperno SFB 1102, Saarland University CNRS, LORIA, UMR 7503 Saarbrucken ,66123, Germany Vandoeuvre-lès-Nancy, F-54500, France anu.vgopal2009@gmail.com denis.paperno@loria.fr Abstract data in an unsupervised fashion and indeed the re- sulting vectors contain a lot of information about English. We propose a new method for the semantic properties of words and objects they unsupervised learning of embeddings for refer to, cf. for instance Herbelot and Vecchi lexical relations in word pairs. The model (2015). Based on the distributional hypothesis is trained on predicting the contexts in coined by Z. S. Harris (1954), word embedding which a word pair appears together in cor- models, which construct word meaning repre- pora, then generalized to account for new sentations as numeric vectors based on the co- and unseen word pairs. This allows us to occurrence statistics on the word’s context, have overcome the data sparsity issues inherent been gaining ground due to their quality and sim- in existing relation embedding learning se- plicity. Produced by efficient and robust im- tups without the need to go back to the plementations such as word2vec (Mikolov et al., corpora to collect additional data for new 2013) and GloVe (Pennington et al., 2014), mod- pairs. ern word vector models are able to predict whether two words are related in meaning, reaching human Italiano. Proponiamo un nuovo metodo performance on benchmarks like WordSim353 per l’apprendimento non supervision- (Agirre et al., 2009) and MEN (Bruni et al., 2014). ato delle rappresentazioni delle relazioni On the other hand, lexical knowledge includes lessicali fra coppie di parole (word pair not only properties of individual words but also embeddings). Il modello viene allenato relations between words. To some extent, lexical a prevedere i contesti in cui compare uns semantic relations can be recovered from the word coppia di parole, e successivamente viene representations via the vector offset method as ev- generalizzato a coppie di parole nuove o idenced by various applications including analogy non attestate. Questo ci consente di su- solving, but already on this task it has multiple perare i problemi dovuti alla scarsità di drawbacks (Linzen, 2016) and has a better unsu- dati tipica dei sistemi di apprendimento pervised alternative (Levy and Goldberg, 2014). di rappresentazioni, senza la necessità di Just like a word representation is inferred from tornare ai corpora per raccogliere dati per the contexts in which the word occurs, informa- nuove coppie di parole. tion about the relation in a given word pair can be extracted from the statistics of contexts in which the two words of the pair appear together. In our 1 Introduction model, we use this principle to learn high-quality In this paper we address the problem of unsuper- pair embeddings from frequent noun pairs, and on vised learning of lexical relations between any two their basis, build a way to construct a relation rep- words. We take the approach of unsupervised rep- resentation for an arbitrary pair. resentation learning from distribution in corpora, Note that we approach the problem from the as familiar from word embedding methods, and viewpoint of lerning general-purpose semantic enhance it with an additional technique to over- knowledge. Our goal is to provide a vector rep- come data sparsity. resentation for an arbitrary pair of words w1 , w2 . Word embedding models give a promise of This is a more general task than relation extrac- learning word meaning from easily available text tion, which aims at identifying the semantic rela- tion between the two words in a particular con- (Baroni and Zamparelli, 2010; Guevara, 2010). text. Modeling such general relational knowledge The kind of relation representations we aim at is crucial for natural language understanding in learning are meant to encode general relational realistic settings. It may be especially useful for knowledge and are produced in an unsupervised recovering the notoriously difficult bridging rela- way, even though it can be useful for identifica- tions in discourse since they involve understanding tion of specific relations like hypernymy and for implicit links between words in the text. relation extraction from text occurrences (Jameel Representations of word relations have applica- et al., 2018). The latter paper documents a model tions in many NLP tasks. For example, they could that produces word pair embeddings by concate- be extremely useful for resolving bridging, espe- nating Glove-based word vectors with relation em- cially of the lexical type (Rösiger et al., 2018). beddings trained to predict the contexts in which But in order to be useful in practice, word relation the two words of the pair co-occur. The main issue models must generalize to rare or unseen cases. with Jameel et al.’s models is scalability: as the au- thors admit, it is prohibitively expensive to collect 2 Related Work all the data needed to train all the relation embed- dings. Instead, their implementation requires, for Our project is related to the task of relation ex- each individual word pair, going back to the train- traction that has been in focus of various com- ing corpus via an inverse index and collecting the plex models (Mintz et al., 2009; Zelenko et al., data needed to estimate the embedding of the pair. 2003) including recurrent (Takase et al., 2016) and This strategy might not be efficient for practical convolutional neural network architectures (Xu et applications. al., 2015; Nguyen and Grishman, 2015; Zeng et al., 2014), although the simple averaging or sum- 3 Proposed Model mation of the context word vectors seems to pro- duce good results for the task (Fan et al., 2015; We propose a simple solution to the scalabil- Hashimoto et al., 2015). The latter work by ity problem inherent in word relation embedding Hashimoto et al. bears the greatest resemblance learning from joint cooccurrence data, which also to the approach to learning semantic relation rep- allows the model to generalize to word pairs that resentations that we utilize here. Hashimoto et never occur together in the corpus, or occur too al. train noun embeddings on the task of predict- rarely to accumulate significant relational cues in- ing words occurring in between the two nouns in formation. The model is trained in two steps. text corpora and use these embeddings along with First, we apply the skip-gram with negative averaging-based context embeddings as input to sampling algorithm to learn relation vectors for relation classification. pairs of nouns n1 , n2 with high individual and There are numerous studies dedicated to char- joint occurrence frequencies. In our experiments, acterizing relations in word pairs abstracted away all word pairs with pair frequency more than 100 from the specific context in which the word pair and its individual word frequency more than 500 appears. Much of this literature focuses on one are considered as frequent pairs. To estimate the specific lexical semantic relation at a time. Among SkipRel vector of the pair, we adapted the learn- these, lexical entailment (hypernymy) has prob- ing objective of skip-gram with negative sampling, ably been the most popular since Hearst (1992) maximizing with various representation learning approaches specifically targeting lexical entailment (Fu et al., logσ(vc0T .un1 :n2 )+Σki=1 Ec∗i ∼Pn (c) [logσ(−vc0T∗i .un1 :n2 )] 2014; Anh et al., 2016; Roller and Erk, 2016; (1) Bowman, 2016; Kruszewski et al., 2015) and the where un1 :n2 is the SkipRel embedding of a word antonymy relation has also received considerable pair, vc0 is the embedding of a context word occur- attention (Ono et al., 2015; Pham et al., 2015; ring between n1 and n2 , and k is the number of Shwartz et al., 2016; Santus et al., 2014). An- negative samples. other line of work in representing the composition- High-quality SkipRel embeddings can only ob- ality of meaning of words using syntactic struc- tained for noun pairs that co-occur frequently. To tures(like Adjective-Noun pairs) is another ap- allow the model to generalize to noun pairs that do proach towards semantic relation representations. not co-occur in our corpus, we estimated an inter- polation ũn1 :n2 of the word pair embedding Model BLESS EVAL EACL Diff 81.15 57.83 71.25 ũn1 :n2 = relU (Avn1 + Bvn2 ) (2) g-SkipRel 59.07 48.06 70.31 where vn1 , vn2 are pretrained word embeddings RelWord 80.94 59.05 73.88 for the two nouns and the matrices A, B encode Random 12.5 11.11 50 systematic correspondences between the embed- Majority 24.71 25.67 50.4 dings of a word and the relations it participates in. Matrices A, B were estimated using stochastic Table 1: Semantic relation classification accuracy gradient descent with the objective of minimizing the square error for the SkipRel vectors of frequent noun pairs n1 , n2 relata spanned across 8 classes of semantic rela- tion and EVALuation1.0 has 7.5k datasets spanned 1 across 9 unique relation types. From EACL 2017 Σn :n ∈P (ũn1 :n2 − un1 :n2 ) (3) |P | 1 2 dataset, we used a list of 4062 noun pairs. Since we aim at recognizing whether the in- We call ũn1 :n2 the generalized SkipRel embed- formation relevant for relation identification is ding (g-SkipRel) for the noun pair n1 , n2 . Rel- present in the representations in an easily accessi- Word, the proposed relation embedding, is the ble form, we choose to employ a simple, one-layer concatenation of the g-SkipRel vector ũn1 :n2 and SoftMax classifier. The classifier was trained for the Diff vector vn1 − vn2 . 100 epochs, and the learning rate for the model is 4 Experimental setup defined through crossvalidation. L2 regularization is employed to avoid over-fitting and the l2 factor We trained relation vectors on the ukWAC corpus is decided through empirical analysis. The clas- (Baroni et al., 2009) containing 2 bln tokens of sifier is trained with mini-batches of size 16 for web-crawled English text. SkipRel is trained on BLESS & EVALuation1.0 and 8 for EACL 2017. noun pair instances separated by at most 10 con- SGD is utilized for optimizing model weights. text tokens with embedding size of 400 and mini- We prove the efficiency of RelWord vectors, we batch size of 32. Frequency filtering is performed contrast them with the simpler representations of to control the size of pair vocabulary (|P |). Fre- (g-)SkipRel and to Diff, the difference of the two quent pairs are pre-selected using pair and word word vectors in a pair, which is a commonly used frequency thresholds. For pretrained word em- simple method. We also include two simple base- beddings we used the best model from Baroni et lines: random choice between the classes and the al. (2014). constant classifier that always predicts the major- The experimental setup is built and main- ity class. tained on GPU clusters provided by GRID5000 (Cappello et al., 2005). The code for 6 Results model implementation and evaluation is pub- All models outperform the baselines by a wide licly available at https://github.com/ margin (Table 1). RelWord model compares favor- Chingcham/SemRelationExtraction ably with the other options, outperforming them on EVAL and EACL datasets and being on par 5 Evaluation with the vector difference model for BLESS. This If our relation representations are rich enough in result signifies a success of our generalization the information they encode, they will prove use- strategy, because in each dataset only a minority of ful for any relation classification task regardless examples had pair representations directly trained of the nature of the classes involved. We evaluate from corpora; most WordRel vectors were inter- the model with a supervised softmax classifier on polated from word emeddings. 2 labeled multiclass datasets, BLESS (Baroni and Now let us restrict our attention to word pairs Lenci, 2011) and EVALuation1.0 (Santus et al., that frequently co-occur (Table 2). Note that the 2015), as well as the binary classification EACL composition of classes, and by consequence the antonym-synonym dataset (Nguyen et al., 2017). majority baseline, is different from Table 1, so BLESS set consists of 26k triples of concept and the accuracy figures in the two tables are not di- pair gold Diff RelWord Model BLESS EVAL EACL bottle, can antonym hasproperty hasa race, time hasproperty hasa antonym Diff 77.13 44.61 66.07 balloon, hollow hasproperty antonym hasa SkipRel 73.37 48.40 83.03 clear, settle isa antonym synonym develop, grow isa antonym synonym RelWord 83.27 54.47 79.46 exercise, move entails antonym isa fact, true hasproperty antonym synonym Random 12.5 11.11 50 human, male isa synonym hasproperty Majority 33.22 26.37 63.63 respect, see isa antonym synonym slice, hit isa antonym synonym Table 2: Semantic relation classification accuracy Table 3: Ten random examples in which RelWord for frequent pairs and Diff make different errors. In the first one, the two models make predictions of comparable qual- ity. In the second one, Diff makes a more intuitive rectly comparable. For these frequent pairs we can error. In the remaining examples, RelWord’s pre- rely on SkipRel relation vectors that have been es- diction is comparatively more adequate. timated directly from corpora and have a higher quality; we also use SkipRel vectors instead of g- SkipRel as a component of RelWord. We note that that are both different from the gold standard label. for these pairs the performance of the Diff method We manually annotated for each of the 53 exam- dropped uniformly. This presumably happened in ples of this kind, which model is more a acceptable part because the classifier could no longer rely on according to a human’s judgment. In a majority the information on relative frequencies of the two of cases (28) the WordRel model makes a predic- words which is implicitly present in Diff represen- tion that is more human-like than that of Diff. For tations; for example, it is possible that antonyms example, WordRel predicts that shade is part of have more similar frequencies than synonyms in shadow rather than its synonym (gold label); in- the EACL dataset. For BLESS and EVAL, the deed, any part of a shadow can be called shade. drop in the performance of Diff could have hap- The Diff model in this case and in many other pened in part because the classes that include more examples bets on the antonym class, which does frequent pairs such as isA, antonyms and syn- not make any sense semantically; the reason why onyms are inherently harder to distinguish than antonym is a common false label is probably that classes that tend to contain rare pairs. In contrast, it is simply the second biggest class in the dataset. the comparative effectiveness of RelWord is more The examples where Diff makes a more meaning- pronounced after frequency filtering. The useful- ful error than RelWord are less numerous (6 out ness of relation embeddings is especially impres- of 53). There are also 15 examples where both sive for the EACL dataset. In this case, vanilla system’s predictions are equally bad (for example, SkipRel emerges as the best model, confirming for Nice,France Diff predict isa label and Wor- that word embeddings per se are not particularly dRel predicts synonym) and 4 examples where useful for detecting the synonymy-antonymy dis- the two predictions are equally reasonable. For tinction for this subset of EACL, getting an accu- more examples, see Table 3. We note that some- racy just above the majority baseline, while pair times our model’s prediction seems more correct embeddings go a long way. than the gold standard, for example in assigning hasproperty rather than isa label to the pair Finally, quantitative evaluation in terms of clas- human, male. sification accuracy or other measures does not fully characterize the relative performance of the 7 Conclusion models; among other things, certain types of mis- classification might be worse than others. For ex- The proposed model is simple in design and train- ample, a human annotator would rarely confuse ing, learning word relation vectors based on co- synonyms with antonyms, while mistaking has a occurrence with unigram contexts and extending for has property could be a common point of to rare or unseen words via a non-linear map- disagreement between annotators. To do a quali- ping. Despite its simplicity, the model is capa- tative analysis of errors made by different models, ble of capturing lexical relation patterns in vector we selected the elements of EVAL test partition representations. Most importantly, RelWord ex- where Diff and RelWord make distinct predictions tends straightforwardly to novel word pairs in a manner that does not require recomputing cooc- ference on Empirical Methods in Natural Language currence counts from the corpus as in related ap- Processing, pages 403–413. proaches (Jameel et al., 2018). This allows for an Marco Baroni and Alessandro Lenci. 2011. How we easy integration of the pretrained model into vari- blessed distributional semantic evaluation. In Pro- ous downstream applications. ceedings of the GEMS 2011 Workshop on GEometri- cal Models of Natural Language Semantics, GEMS In our evaluation, we observed that learning ’11, pages 1–10, Stroudsburg, PA, USA. Association word pair relation embeddings improves on the se- for Computational Linguistics. mantic information already present in word em- Marco Baroni and Roberto Zamparelli. 2010. Nouns beddings. With respect to certain semantic re- are vectors, adjectives are matrices: Representing lations like synonyms, the performance of rela- adjective-noun constructions in semantic space. In tion embedding is comparable to that of word em- Proceedings of the 2010 Conference on Empirical beddings but with an additional cost of training a Methods in Natural Language Processing, EMNLP ’10, pages 1183–1193, Stroudsburg, PA, USA. As- representation for a significant number of pair of sociation for Computational Linguistics. words. For other relation types like antonyms or hypernyms, in which words differ semantically but Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The wacky wide share similar contexts, learned word pair relation web: a collection of very large linguistically pro- embeddings have an edge over those derived from cessed web-crawled corpora. Language resources word embeddings via simple subtraction. While in and evaluation, 43(3):209–226. practice one has to make a choice based on the task Marco Baroni, Georgiana Dinu, and Germn requirements, it is generally beneficial to combine Kruszewski. 2014. Don’t count, predict! a both types of relation embeddings for best results systematic comparison of context-counting vs. in a model like RelWord. context-predicting semantic vectors. In 52nd Our current model employs pretrained word Annual Meeting of the Association for Computa- tional Linguistics, ACL 2014 - Proceedings of the embeddings and learns the word pair embeddings Conference, volume 1, pages 238–247, 06. and a word-to-relation embedding mapping sep- Samuel Ryan Bowman. 2016. Modeling natural lan- arately. In the future, we plan to train a version guage semantics in learned representations. Ph.D. of the model end-to-end, with word embeddings thesis, Ph. D. thesis, Stanford University. and the mapping trained simultaneously. As liter- Elia Bruni, Nam-Khanh Tran, and Marco Baroni. ature suggests (Hashimoto et al., 2015; Takase et 2014. Multimodal distributional semantics. Journal al., 2016), such joint training might not only bene- of Artificial Intelligence Research, 49:1–47. fit the model but also improve the performance of Franck Cappello, Eddy Caron, Michel J. Dayd, Frdric the resulting word embeddings on other tasks. Desprez, Yvon Jgou, Pascale Vicat-Blanc Primet, Emmanuel Jeannot, Stphane Lanteri, Julien Leduc, Acknowledgments Nouredine Melab, Guillaume Mornet, Raymond Namyst, Benjamin Qutier, and Olivier Richard. This research is supported by CNRS PEPS grant 2005. Grid’5000: a large scale and highly recon- ReSeRVe. We thank Roberto Zamparelli, Germán figurable grid experimental testbed. In GRID, pages Kruszewski, Luca Ducceschi and anonymous re- 99–106. IEEE Computer Society. viewers who gave feedback on previous versions Miao Fan, Kai Cao, Yifan He, and Ralph Grish- of this work. man. 2015. Jointly embedding relations and men- tions for knowledge population. arXiv preprint arXiv:1504.01683. References Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Wang, and Ting Liu. 2014. Learning semantic hier- Kravalova, Marius Paşca, and Aitor Soroa. 2009. archies via word embeddings. In Proceedings of the A study on similarity and relatedness using distribu- 52nd Annual Meeting of the Association for Compu- tional and wordnet-based approaches. In Proceed- tational Linguistics (Volume 1: Long Papers), vol- ings of Human Language Technologies: The 2009 ume 1, pages 1199–1209. Annual Conference of the North American Chapter of the Association for Computational Linguistics. Emiliano Guevara. 2010. A regression model of adjective-noun compositionality in distributional se- Tuan Luu Anh, Yi Tay, Siu Cheung Hui, and See Kiong mantics. In Proceedings of the 2010 Workshop on Ng. 2016. Learning term embeddings for taxo- GEometrical Models of Natural Language Seman- nomic relation identification using dynamic weight- tics, GEMS ’10, pages 33–37, Stroudsburg, PA, ing neural network. In Proceedings of the 2016 Con- USA. Association for Computational Linguistics. Zellig Harris. 1954. Distributional structure. Word, In Proceedings of the 15th Conference of the Euro- 10(23):146–162. pean Chapter of the Association for Computational Linguistics, pages 76–85, Valencia, Spain. Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa, and Yoshimasa Tsuruoka. 2015. Task-oriented Masataka Ono, Makoto Miwa, and Yutaka Sasaki. learning of word embeddings for semantic relation 2015. Word embedding-based antonym detection classification. arXiv preprint arXiv:1503.00095. using thesauri and distributional information. In HLT-NAACL, pages 984–989. Marti A. Hearst. 1992. Automatic acquisition of hy- ponyms from large text corpora. Technical Report Jeffrey Pennington, Richard Socher, and Christopher S2K-92-09. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 confer- Aurélie Herbelot and Eva Maria Vecchi. 2015. Build- ence on empirical methods in natural language pro- ing a shared world: Mapping distributional to cessing (EMNLP), pages 1532–1543. model-theoretic semantic spaces. In Proceedings of the 2015 Conference on Empirical Methods in Nat- Nghia The Pham, Angeliki Lazaridou, Marco Baroni, ural Language Processing, pages 22–32. et al. 2015. A multitask objective to inject lexical contrast into distributional semantics. In Proceed- Shoaib Jameel, Zied Bouraoui, and Steven Schockaert. ings of the 53rd Annual Meeting of the Association 2018. Unsupervised learning of distributional re- for Computational Linguistics and the 7th Interna- lation vectors. In Proceedings of the 56th Annual tional Joint Conference on Natural Language Pro- Meeting of the Association for Computational Lin- cessing (Volume 2: Short Papers), volume 2, pages guistics (Volume 1: Long Papers), pages 23–33. As- 21–26. sociation for Computational Linguistics. Stephen Roller and Katrin Erk. 2016. Relations such German Kruszewski, Denis Paperno, and Marco Ba- as hypernymy: Identifying and exploiting hearst pat- roni. 2015. Deriving boolean structures from distri- terns in distributional vectors for lexical entailment. butional vectors. Transactions of the Association for CoRR, abs/1605.05433. Computational Linguistics, 3:375–388. Ina Rösiger, Arndt Riester, and Jonas Kuhn. 2018. Omer Levy and Yoav Goldberg. 2014. Linguistic reg- Bridging resolution: Task definition, corpus re- ularities in sparse and explicit word representations. sources and rule-based experiments. In Proceedings In Proceedings of the eighteenth conference on com- of the 27th International Conference on Computa- putational natural language learning, pages 171– tional Linguistics, pages 3516–3528. 180. Enrico Santus, Qin Lu, Alessandro Lenci, and Churen Tal Linzen. 2016. Issues in evaluating seman- Huang. 2014. Unsupervised antonym-synonym dis- tic spaces using word analogies. arXiv preprint crimination in vector space. arXiv:1606.07736. Enrico Santus, Frances Yung, Alessandro Lenci, and Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Chu-Ren Huang. 2015. Evalution 1.0: an evolving frey Dean. 2013. Efficient estimation of word semantic dataset for training and evaluation of distri- representations in vector space. arXiv preprint butional semantic models. In Proceedings of the 4th arXiv:1301.3781. Workshop on Linked Data in Linguistics: Resources and Applications, pages 64–69. Mike Mintz, Steven Bills, Rion Snow, and Dan Ju- rafsky. 2009. Distant supervision for relation ex- Vered Shwartz, Enrico Santus, and Dominik traction without labeled data. In Proceedings of the Schlechtweg. 2016. Hypernyms under siege: Joint Conference of the 47th Annual Meeting of the Linguistically-motivated artillery for hypernymy ACL and the 4th International Joint Conference on detection. arXiv preprint arXiv:1612.04460. Natural Language Processing of the AFNLP: Vol- ume 2 - Volume 2, ACL ’09, pages 1003–1011, Sho Takase, Naoaki Okazaki, and Kentaro Inui. 2016. Stroudsburg, PA, USA. Association for Computa- Modeling semantic compositionality of relational tional Linguistics. patterns. Engineering Applications of Artificial In- telligence, 50:256–264. Thien Huu Nguyen and Ralph Grishman. 2015. Re- lation extraction: Perspective from convolutional Kun Xu, Yansong Feng, Songfang Huang, and neural networks. In Phil Blunsom, Shay B. Co- Dongyan Zhao. 2015. Semantic relation classifica- hen, Paramveer S. Dhillon, and Percy Liang, edi- tion via convolutional neural networks with simple tors, VS@HLT-NAACL, pages 39–48. The Associa- negative sampling. CoRR, abs/1506.07650. tion for Computational Linguistics. Dmitry Zelenko, Chinatsu Aone, and Anthony Kim Anh Nguyen, Sabine Schulte im Walde, and Richardella. 2003. Kernel methods for relation ex- Ngoc Thang Vu. 2017. Distinguishing Antonyms traction. Journal of Machine Learning Research, and Synonyms in a Pattern-based Neural Network. 3:1083–1106. Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, Jun Zhao, et al. 2014. Relation classification via convolutional deep neural network. In COLING, pages 2335–2344.