Affect Enriched Word Embeddings for News Information Retrieval Tommaso Teofili Niyati Chhaya Adobe Adobe teofili@adobe.com nchhaya@adobe.com GloVe [PSM14] are among the most widely adopted word embedding models because of their effectiveness Abstract in capturing word semantics. One of the advantage of using word embeddings in information retrieval is Distributed representations of words have that they are more effective in capturing query intent shown to be useful to improve the effective- and document topics than other local vector repre- ness of IR systems in many sub-tasks like sentations traditionally used in IR (like TF-iDF vec- query expansion, retrieval and ranking. Al- tors). Text tokens in IR don’t always overlap with gorithms like word2vec, GloVe and others are exact words; tokens often coincide with subwords (e.g. also key factors in many improvements in dif- generated by stemmers), ngrams, shingles, etc. There- ferent NLP tasks. One common issue with fore word embeddings are also often referred to as term such embedding models is that words like embeddings in the context of IR. Term embeddings happy and sad appear in similar contexts and can be used to rank queries and documents; in such hence are wrongly clustered close in the em- context a dense vector representation for the query is bedding space. In this paper we leverage derived and scored against corresponding dense vec- Aff2Vec, a set of word embeddings models tor representations for documents in the IR system. which include affect information, in order to Query and document vector representations are gen- better capture the affect aspect in news text erated by aggregating term or word embeddings asso- to achieve better results in information re- ciated with their respective text terms from the query trieval tasks, also such embeddings are less and document texts. Word embeddings can also be hit by the synonym/antonym issue. We evalu- used in the query expansion task. Term embeddings ate their effectiveness on two IR related tasks are used in such contexts to find good expansion can- (query expansion and ranking) over the New didates from a global vocabulary of terms (by com- York Times dataset (TREC-core ’17) com- paring word vectors), such enriched queries are used paring them against other word embeddings to retrieve the documents. Most of recent good per- based models and classic ranking models. forming word embedding models are generated in an unsupervised manner by learning word representations 1 Introduction looking at their surrounding contexts. However one Distributed representations of words, also known as issue with word embeddings is that words with about word embeddings, have played a key role in various opposite meanings can have very similar contexts, so downstream NLP tasks. Such vector representations that, for example, ‘happy’ and ‘sad’ may lie closer than place vectors of semantically similar words close in the they should be in the embedding space, see related ef- embedding space, allowing for efficient and effective es- forts in [CLC+ 15] and [NWV16]. In order to mitigate timation of word similarity. Word2vec [MCCD13] and this semantic understanding issue, we propose to use affect-enriched word embedding models (also known Copyright c 2019 for the individual papers by the papers’ au- as Aff2Vec[KCC18]) for IR tasks, as they outperform thors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. baseline word embedding models on word-similarity In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, task and sentiment analysis. Our contribution is the M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the usage of Aff2Vec models as term embeddings for in- NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, formation retrieval in the news domain. Beyond the published at http://ceur-ws.org synonym-antonym issue we except Aff2Vec models to queries and related relevant and non-relevant results. work well for news IR because of their capability of In [FFJ+ 16], word vectors in combination with bilin- better capturing writers’ affective attitude towards ar- gual dictionaries are used to extract synonyms so that ticles’ text (see section 1.1). We present experiments they can be used to expand queries. Documents are against standard IR datasets, empirically establishing represented as bags of vectors generated as mixture of the utility of the proposed approach. distributions in [RPMG16]. Efforts like [CLC+ 15] and [NWV16] are related to our work in the fact that they 1.1 Affect scores in news datasets can be incorporated in usage of term embeddings in IR tasks. For our ranking scenario, [RGMJ16] is rele- In order to assess the potential applicability of Aff2Vec vant as documents and queries are represented by mix- embeddings in the context of information retrieval, we tures of Gaussians over word embeddings, each of the run preliminary evaluation of the amount of formality, Gaussians centered around centroid learned via e.g. a politeness and frustration contained in common text k-means algorithm. The likelihood of a query with re- collections used in information retrieval experiments. spect to a document is measured by the distance of For this purpose we leverage the affect scoring algo- the query vector from each centroid that document rithm that is used for building Aff2Vec embeddings. belongs to, using centroid similarity or average inter- We extract mean affect scores for formality, polite- similarity. ness and frustration on each dataset. Such an evalua- tion involves two collection of news: the datasets from 2.1 Aff2Vec: Affect-enriched embed- TREC core 2018 track, Washington Post articles, and dings [KCC18] TREC core 2017 track, New York Times articles. Also we extract affect scores from the ClueWeb09 dataset Word representations historically have only captured [CHYZ09], containing text of HTML pages crawled semantic or contextual information, but ignored other from the Web, and the CACM dataset, a collection of subtle word relationships such as difference in sen- titles and abstracts from the CACM journal. Results timent. Affect refers to the feeling of an emotion are reported in table 1. or a feeling [Pic97]. Words such as ‘glad’, ‘awe- some’, ‘happy’, ‘disgust’ or ‘sad’ can be referred to Dataset affect scoring as affective words. Aff2Vec introduces a post-training Dataset formality politeness frustration approach that introduces ‘emotion’-sensitivity or af- NYT 0.7087 0.6291 0.6248 fect information in word embeddings. Aff2Vec lever- WP 0.7788 0.7456 0.6510 ages existing affect lexicon such as Warriner’s lexi- CACM 0.3619 0.1229 0.3511 con [WKB13] which has a list of over 14,000 English ClueWeb09 0.4319 0.2708 0.6216 words tagged with valence (V), arousal (A), and dom- inance (D) scores. The affect-enriched embeddings in- Table 1: Mean affect scores on some common IR troduced by Aff2Vec are either built on top of vanilla datasets word embeddings i.e. word2vec, GloVe, or paragram The scores for formality, politeness and frustration or introduced along with counterfitting [MOT+ 16] or extracted on the Ney Work Times and Washington retrofitting [FDJ+ 15]. In this work, we leverage these Post articles are generally higher than the ones ex- enriched vector spaces too in order to evaluate their tracted for CACM and ClueWeb09 datasets, except performance for standard IR tasks, namely - query ex- for the frustration score reported for ClueWeb which is pansion and ranking. very close to the frustration score extracted for NYT articles. These results suggest that Aff2Vec embed- 3 Word embeddings for query expan- dings should work well on the news domain as they are sion built to appropriately capture such affective aspects of information. We leverage word embeddings to perform query ex- pansion in a way similar to [RPMG16]. For each query term q contained in the query text Q, the word em- 2 Related work bedding model is used to fetch wq nearest neighbour Dict2vec[TGH17b] builds word embeddings using on- we in the embedding space, so that cos(we , wq ) > t, line dictionaries and optimizing an objective function where t is the minimum allowed cosine similarity be- where each word embedding is built via positive sam- tween two embeddings to consider the word e associ- pling of strongly correlated words and negative sam- ated to the vector we a good expansion for the word pling of weak correlated ones [TGH17a]. In [ZC17], q associated with the query term vector wq . Upon embeddings are optimized using different objective successful retrieval of an expansion of at least a term functions in a supervised manner based on lists of q in a query, a new ”alternative” query A where q is substituted by e is created. Consequently the query 5.1 Results to be executed on the IR system becomes a boolean query of the form Q OR A. If more than one query Table 2 shows performance for ranking experiments on term has a valid expansion fetched from the embedding the NYT dataset using different embeddings. We ob- model, all possible combinations of query terms and serve that usage of term embeddings doesn’t give ben- relative expansion terms is generated. For example, efits in many cases, classic BM25 and query likelihood given a query ”recent research about AI”, if term em- retrieval models provide better NDCG than almost all beddings output that nearest(recent) = latest with the models except the affect enriched ones. A GloVe cos(recent, latest) = 0.8 bigger than the threshold retrofitted affect enriched embedding model is the top 0.75, the output query will be composed by two op- performing one for NDCG measure. On the other hand tional clauses: ”recent research about AI” OR ”latest none of the term embedding ranking could outperform research about AI”. BM25 on the mean average precision measure. Ranking experiments on NYT 4 Word embeddings for ranking Model NDCG MAP BM25 0.4334 0.1977 In order to use word embedding models for ranking we QL 0.4325 0.1913 chose to use the averaging word embeddings approach NON ENRICHED MODELS (also known as AWE ). Each document and query vec- GloVe 0.4292 0.1883 tor is calculated by averaging the word vectors related GloVe.42B.300d 0.4003 0.1690 to each word in documents and query texts. The query GloVe.6B.100d 0.4291 0.1911 / document score is measured by the cosine similarity GloVe.6B.200d 0.4314 0.1964 between their respective averaged vectors, as in other GloVe.6B.300d 0.4316 0.1946 research works like [MNCC16, RGMJ16, RMLH17, GloVe.6B.50d 0.4078 0.1760 GSS17]. In our experiments we used each word TF- GloVe-Twitter-100 0.4212 0.1849 iDF vector to normalize (divide) the averaged word GloVe-Twitter-200 0.4242 0.1873 embedding for query and document vectors. We ob- GloVe-Twitter-50 0.4128 0.1798 served that using this technique to smooth the sum GloVe-Twitter-25 0.3541 0.1377 of the word vectors instead of just dividing it by the w2v-GoogleNews-300 0.4294 0.1922 number of its words (mean) resulted in better rank- dict2vec-dim100 0.4101 0.1885 ing results. This seems in line with the findings from dict2vec-dim200 0.4155 0.1891 [SLMJ15] which indicate that cosine similarity may dict2vec-dim300 0.4151 0.1899 be polluted by term frequencies when comparing word ENRICHED MODELS embeddings. counterfit-GloVe 0.3980 0.1720 GloVe-retrofitted 0.4216 0.1861 5 Experiments paragram-counterfit 0.3840 0.1580 paragram-74627 0.4337 0.1937 We compare the usage of Aff2Vec word embeddings paragram-retrofitted 0.3969 0.1703 in the ranking and query expansion task against both paragram-retrofitted-74627 0.3963 0.1698 vanilla embedding models (like word2vec and GloVe) w2v-76427 0.4328 0.1969 and enriched models like Dict2vec models [TGH17a]. w2v-counterfit-header 0.3972 0.1721 We also present experiments with variants in Aff2Vec: w2v-retrofitted 0.4341 0.1914 counterfitted and retrofitted models with enriched af- AFFECT ENRICHED MODELS fect information. All the models used in our exper- counterfit-GloVe-affect 0.4311 0.1753 iments are pretrained. To setup our evaluations we GloVe-affect 0.4594 0.1926 use two open source toolkits Anserini [YFL17] and GloVe-retrofitted-affect-555 0.4693 0.1948 Lucene4IR [AMH+ 17], both based on Apache Lucene paragram-affect 0.4619 0.1969 [BMII12]. We run ranking and query expansion ex- paragram-counterfit-affect 0.4339 0.1788 periments on the New York Times articles from the w2v-affect 0.4592 0.1926 TREC Core ’17 track [AHK+ 17] since it’s a relevant w2v-counterfit-affect 0.4309 0.1766 dataset for the news domain. For the sake of generaliz- w2v-retrofitted-affect 0.4601 0.1911 ability, we also conduct the same evaluations over the CACM dataset [Fox83], a ”classic” dataset for IR. For Table 2: Ranking experiments on TREC Core ’17 the case of query expansion we include evaluation us- ing WordNet [Mil95] in order to provide an expansion Table 3 shows performance for query expansion ex- baseline not based on word embeddings. periments on the NYT dataset using different embed- dings. We observe that classic BM25 and query like- the best results, with affect enriched paragram em- lihood retrieval models provide better NDCG than al- beddings reporting both best NDCG and MAP, 0.02 most all the models except some of the affect enriched better than non affect enriched paragram embeddings ones. This is in line with what we observed for the results in both NDCG and MAP. ranking task on the same dataset. A GloVe retrofitted affect enriched embedding model is the top performing Ranking experiments on CACM one for both NDCG and MAP. Model NDCG MAP Query expansion experiments on NYT BM25 0.3805 0.1947 Model MAP NDCG QL 0.3621 0.2056 BM25 0.1977 0.4334 NON ENRICHED MODELS QL 0.1913 0.4325 GloVe.42B.300d 0.3638 0.2007 NON ENRICHED MODELS GloVe.6B.100d 0.4440 0.2722 GloVe 0.1951 0.4337 GloVe.6B.200d 0.4452 0.2732 GloVe.42B.300d 0.1947 0.4308 GloVe.6B.300d 0.4450 0.2730 GloVe.6B.100d 0.1903 0.4291 GloVe.6B.50d 0.4437 0.2720 GloVe.6B.200d 0.1947 0.4308 GloVe-Twitter-100 0.5109 0.3260 GloVe.6B.300d 0.1947 0.4308 GloVe-Twitter-200 0.5138 0.3292 GloVe.6B.50d 0.1799 0.4119 GloVe-Twitter-25 0.5309 0.3217 GloVe-Twitter-100 0.1863 0.4218 GloVe-Twitter-50 0.4682 0.2715 GloVe-Twitter-200 0.1863 0.4218 w2v-GoogleNews-300 0.3697 0.1960 GloVe-Twitter-25 0.1391 0.3488 GloVe 0.4483 0.2760 GloVe-Twitter-50 0.1812 0.4147 ENRICHED MODELS w2v-GoogleNews-300 0.1947 0.4308 counterfit-GloVe 0.4563 0.2680 dict2vec-dim100 0.1995 0.4335 GloVe-retrofitted 0.4507 0.2787 dict2vec-dim200 0.1959 0.4315 w2v-76427 0.4920 0.3033 dict2vec-dim300 0.1957 0.4315 w2v-counterfit-header 0.4085 0.2225 WordNet 0.1977 0.4334 w2v-retrofitted 0.3993 0.2350 ENRICHED MODELS paragram-counterfit 0.5675 0.3722 counterfit-GloVe 0.1801 0.4027 paragram-74627 0.5539 0.3541 GloVe-retrofitted 0.1940 0.4264 paragram-retrofitted 0.5263 0.3467 paragram-counterfit 0.1663 0.3906 paragram-retrofitted-74627 0.5380 0.3633 paragram-74627 0.2005 0.4365 AFFECT ENRICHED MODELS paragram-retrofitted 0.1798 0.4012 counterfit-GloVe-affect 0.4247 0.2383 paragram-retrofitted-74627 0.1798 0.4012 GloVe-affect 0.4326 0.2553 w2v-76427 0.1964 0.4318 w2v-affect 0.3900 0.2080 w2v-counterfit-header 0.1734 0.3991 w2v-counterfit-affect 0.3791 0.2006 w2v-retrofitted 0.1967 0.4368 w2v-retrofitted-affect 0.3555 0.1986 AFFECT ENRICHED MODELS paragram-affect 0.5848 0.3986 GloVe-affect 0.1947 0.4308 paragram-counterfit-affect 0.5860 0.3996 counterfit-GloVe-affect 0.1810 0.4044 GloVe-retrofitted-affect-555 0.2021 0.4421 Table 4: Ranking experiments on CACM paragram-affect 0.1977 0.4309 paragram-counterfit-affect 0.1844 0.4094 w2v-affect 0.1940 0.4305 Table 5 shows performance for query expansion ex- w2v-counterfit-affect 0.1762 0.4029 periments on the CACM dataset using different em- w2v-retrofitted-affect 0.1971 0.4345 beddings. We observe that usage of term embeddings Table 3: Query expansion experiments on TREC Core generally causes steadily higher NDCG and MAP. ’17. While we expected best results with Aff2Vec mod- els it turned out ”vanilla” word2vec model trained on Table 4 shows performance for Ranking experi- Google News corpus outperformed all the others in ments on the CACM dataset using different embed- NDCG and MAP. On the other hand the best perform- dings. We observe that usage of term embeddings ing enriched model is a retrofitted word2vec model generally causes steadily higher NDCG and MAP. In whereas among affect enriched models the GloVe particular the paragaram embeddings models report retrofitted one provides the best results. Query expansion experiments on CACM tion in expansion tasks and then towards building fu- Model NDCG MAP sion approaches leveraging enriched word vectors with BM25 0.3805 0.1947 standard IR baselines. QL 0.3621 0.2056 NON ENRICHED MODELS References WordNet 0.4014 0.2146 [AHK+ 17] James Allan, Donna Harman, Evangelos GloVe.42B.300d 0.4657 0.2701 Kanoulas, Dan Li, Christophe Van Gysel, GloVe.6B.100d 0.4646 0.2635 and Ellen Vorhees. Trec 2017 common core GloVe.6B.200d 0.4633 0.2631 track overview. In Proc. TREC, 2017. GloVe.6B.300d 0.4724 0.2707 GloVe.6B.50d 0.4575 0.2588 [AMH+ 17] Leif Azzopardi, Yashar Moshfeghi, Martin GloVe-Twitter-100 0.4500 0.2576 Halvey, Rami S Alkhawaldeh, Krisztian GloVe-Twitter-200 0.4454 0.2524 Balog, Emanuele Di Buccio, Diego Cec- GloVe-Twitter-25 0.4215 0.2373 carelli, Juan M Fernández-Luna, Charlie GloVe-Twitter-50 0.4422 0.2528 Hull, Jake Mannix, et al. Lucene4ir: De- w2v-GoogleNews-300 0.4824 0.2828 veloping information retrieval evaluation GloVe 0.4635 0.2685 resources using lucene. In ACM SIGIR Fo- ENRICHED MODELS rum, volume 50, pages 58–75. ACM, 2017. counterfit-GloVe 0.4622 0.2661 GloVe-retrofitted 0.4676 0.2723 [BMII12] Andrzej Bialecki, Robert Muir, Grant In- w2v-76427 0.4366 0.2518 gersoll, and Lucid Imagination. Apache w2v-counterfit-header 0.4557 0.2629 lucene 4. In SIGIR 2012 workshop on w2v-retrofitted 0.4738 0.2816 open source information retrieval, page 17, paragram-counterfit 0.4661 0.2716 2012. paragram-74627 0.4626 0.2712 [CHYZ09] Jamie Callan, Mark Hoy, Changkuk Yoo, paragram-retrofitted 0.4470 0.2636 and Le Zhao. Clueweb09 data set, 2009. paragram-retrofitted-74627 0.4486 0.2646 AFFECT ENRICHED MODELS [CLC+ 15] Zhigang Chen, Wei Lin, Qian Chen, Xi- counterfit-GloVe-affect 0.4622 0.2673 aoping Chen, Si Wei, Hui Jiang, and Xi- GloVe-affect 0.4694 0.2734 aodan Zhu. Revisiting word embedding GloVe-retrofitted-affect-555 0.4722 0.2799 for contrasting meaning. In Proceedings w2v-affect 0.4609 0.2643 of the 53rd Annual Meeting of the Asso- w2v-counterfit-affect 0.4579 0.2674 ciation for Computational Linguistics and w2v-retrofitted-affect 0.4667 0.2744 the 7th International Joint Conference on paragram-affect 0.4426 0.2586 Natural Language Processing (Volume 1: paragram-counterfit-affect 0.4634 0.2723 Long Papers), volume 1, pages 106–115, 2015. Table 5: Query expansion experiments on CACM [FDJ+ 15] Manaal Faruqui, Jesse Dodge, Sujay Ku- mar Jauhar, Chris Dyer, Eduard Hovy, 6 Conclusions and Noah A Smith. Retrofitting word vec- We present extensive experiments to evaluate the im- tors to semantic lexicons. In Proceedings of pact of affect-enriched word embeddings for informa- the 2015 Conference of the North Ameri- tion retrieval over a news corpus, namely ranking and can Chapter of the Association for Com- query expansion implemented using open-source toolk- putational Linguistics: Human Language its. We show that using affect-enriched models shows Technologies, pages 1606–1615, 2015. a significant improvement for ranking against base- [FFJ+ 16] Linnea Fornander, Marc Friberg, Vida Jo- line/vanilla embeddings (2̃0%) as well as other en- hansson, V Lindh-Håård, Pontus Ohlsson, riched embeddings (2̃-10%). In case of query expan- and Ida Palm. Generating synonyms using sion, improvement is observed for the NYT dataset but word vectors and an easy-to-read corpus. vanilla GloVe embeddings report highest values for the 2016. CACM dataset. We believe the semantic structure and vocabulary distribution of the CACM dataset results [Fox83] Edward A Fox. Characterization of two in this behavior. We plan to extend this work first new experimental collections in computer towards understanding the role of semantic informa- and information science containing textual and bibliographic concepts. Technical re- [RGMJ16] Dwaipayan Roy, Debasis Ganguly, Mandar port, Cornell University, 1983. Mitra, and Gareth JF Jones. Representing documents and queries as sets of word em- [GSS17] Lukas Galke, Ahmed Saleh, and Ansgar bedded vectors for information retrieval. Scherp. Word embeddings for practical in- arXiv preprint arXiv:1606.07869, 2016. formation retrieval. In 47. Jahrestagung der Gesellschaft für Informatik, Infor- [RMLH17] Navid Rekabsaz, Bhaskar Mitra, Mi- matik 2017, Chemnitz, Germany, Septem- hai Lupu, and Allan Hanbury. To- ber 25-29, 2017, pages 2155–2167, 2017. ward incorporation of relevant docu- ments in word2vec. arXiv preprint [KCC18] Sopan Khosla, Niyati Chhaya, and Kushal arXiv:1707.06598, 2017. Chawla. Aff2vec: Affect–enriched distri- [RPMG16] Dwaipayan Roy, Debjyoti Paul, Mandar butional word representations. In Pro- Mitra, and Utpal Garain. Using word em- ceedings of the 27th International Confer- beddings for automatic query expansion. ence on Computational Linguistics, pages arXiv preprint arXiv:1606.07608, 2016. 2204–2218, 2018. [SLMJ15] Tobias Schnabel, Igor Labutov, David [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, Mimno, and Thorsten Joachims. Evalu- and Jeffrey Dean. Efficient estimation ation methods for unsupervised word em- of word representations in vector space. beddings. In Proceedings of the 2015 Con- arXiv preprint arXiv:1301.3781, 2013. ference on Empirical Methods in Natural Language Processing, pages 298–307, 2015. [Mil95] George A Miller. Wordnet: a lexical [TGH17a] Julien Tissier, Christophe Gravier, and database for english. Communications of Amaury Habrard. Dict2vec: Learning the ACM, 38(11):39–41, 1995. word embeddings using lexical dictionar- ies. In Conference on Empirical Methods [MNCC16] Bhaskar Mitra, Eric Nalisnick, Nick in Natural Language Processing (EMNLP Craswell, and Rich Caruana. A dual em- 2017), pages 254–263, 2017. bedding space model for document rank- ing. arXiv preprint arXiv:1602.01137, [TGH17b] Julien Tissier, Christopher Gravier, and 2016. Amaury Habrard. Dict2vec : Learning word embeddings using lexical dictionar- [MOT+ 16] Nikola Mrkšic, Diarmuid OSéaghdha, ies. In Proceedings of the 2017 Conference Blaise Thomson, Milica Gašic, Lina Rojas- on Empirical Methods in Natural Lan- Barahona, Pei-Hao Su, David Vandyke, guage Processing, pages 254–263. Associ- Tsung-Hsien Wen, and Steve Young. ation for Computational Linguistics, 2017. Counter-fitting word vectors to linguistic constraints. In Proceedings of NAACL- [WKB13] Amy Beth Warriner, Victor Kuperman, HLT, pages 142–148, 2016. and Marc Brysbaert. Norms of valence, arousal, and dominance for 13,915 en- [NWV16] Kim Anh Nguyen, Sabine Schulte im glish lemmas. Behavior Research Methods, Walde, and Ngoc Thang Vu. Integrating 45(4):1191–1207, 2013. distributional lexical contrast into word [YFL17] Peilin Yang, Hui Fang, and Jimmy Lin. embeddings for antonym-synonym distinc- Anserini: Enabling the use of lucene for tion. arXiv preprint arXiv:1605.07766, information retrieval research. In Pro- 2016. ceedings of the 40th International ACM SIGIR Conference on Research and De- [Pic97] Rosalind W. Picard. Affective Computing. velopment in Information Retrieval, pages MIT Press, Cambridge, MA, USA, 1997. 1253–1256. ACM, 2017. [PSM14] Jeffrey Pennington, Richard Socher, and [ZC17] Hamed Zamani and W Bruce Croft. Christopher Manning. Glove: Global vec- Relevance-based word embedding. In Pro- tors for word representation. In Proceed- ceedings of the 40th International ACM SI- ings of the 2014 conference on empiri- GIR Conference on Research and Develop- cal methods in natural language processing ment in Information Retrieval, pages 505– (EMNLP), pages 1532–1543, 2014. 514. ACM, 2017.