-

A ect Enriched Word Embeddings for News Information Retrieval

Tommaso Teo li

li@adobe.com teo li@adobe.com 0

Niyati Chhaya

nchhaya@adobe.com 0

In: A. Aker, D. Albakour, A. Barron-Ceden~o, S. Dori-Hacohen,

1 0 Adobe 1 M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the, NewsIR'19 Workshop at SIGIR , Paris, France, 25-July-2019, published at http://ceur-ws.org

Distributed representations of words have shown to be useful to improve the e ectiveness of IR systems in many sub-tasks like query expansion, retrieval and ranking. Algorithms like word2vec, GloVe and others are also key factors in many improvements in different NLP tasks. One common issue with such embedding models is that words like happy and sad appear in similar contexts and hence are wrongly clustered close in the embedding space. In this paper we leverage A 2Vec, a set of word embeddings models which include a ect information, in order to better capture the a ect aspect in news text to achieve better results in information retrieval tasks, also such embeddings are less hit by the synonym/antonym issue. We evaluate their e ectiveness on two IR related tasks (query expansion and ranking) over the New York Times dataset (TREC-core '17) comparing them against other word embeddings based models and classic ranking models.

Distributed representations of words, also known as word embeddings, have played a key role in various downstream NLP tasks. Such vector representations place vectors of semantically similar words close in the embedding space, allowing for e cient and e ective estimation of word similarity. Word2vec [MCCD13] and Copyright c 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. GloVe [PSM14] are among the most widely adopted word embedding models because of their e ectiveness in capturing word semantics. One of the advantage of using word embeddings in information retrieval is that they are more e ective in capturing query intent and document topics than other local vector representations traditionally used in IR (like TF-iDF vectors). Text tokens in IR don't always overlap with exact words; tokens often coincide with subwords (e.g. generated by stemmers), ngrams, shingles, etc. Therefore word embeddings are also often referred to as term embeddings in the context of IR. Term embeddings can be used to rank queries and documents; in such context a dense vector representation for the query is derived and scored against corresponding dense vector representations for documents in the IR system. Query and document vector representations are generated by aggregating term or word embeddings associated with their respective text terms from the query and document texts. Word embeddings can also be used in the query expansion task. Term embeddings are used in such contexts to nd good expansion candidates from a global vocabulary of terms (by comparing word vectors), such enriched queries are used to retrieve the documents. Most of recent good performing word embedding models are generated in an unsupervised manner by learning word representations looking at their surrounding contexts. However one issue with word embeddings is that words with about opposite meanings can have very similar contexts, so that, for example, `happy' and `sad' may lie closer than they should be in the embedding space, see related efforts in [CLC+15] and [NWV16]. In order to mitigate this semantic understanding issue, we propose to use a ect-enriched word embedding models (also known as A 2Vec[KCC18]) for IR tasks, as they outperform baseline word embedding models on word-similarity task and sentiment analysis. Our contribution is the usage of A 2Vec models as term embeddings for information retrieval in the news domain. Beyond the Dataset a ect scoring formality politeness 0.7087 0.6291 0.7788 0.7456 0.3619 0.1229 0.4319 0.2708 synonym-antonym issue we except A 2Vec models to work well for news IR because of their capability of better capturing writers' a ective attitude towards articles' text (see section 1.1). We present experiments against standard IR datasets, empirically establishing the utility of the proposed approach. In order to assess the potential applicability of A 2Vec embeddings in the context of information retrieval, we run preliminary evaluation of the amount of formality, politeness and frustration contained in common text collections used in information retrieval experiments. For this purpose we leverage the a ect scoring algorithm that is used for building A 2Vec embeddings. We extract mean a ect scores for formality, politeness and frustration on each dataset. Such an evaluation involves two collection of news: the datasets from TREC core 2018 track, Washington Post articles, and TREC core 2017 track, New York Times articles. Also we extract a ect scores from the ClueWeb09 dataset [CHYZ09], containing text of HTML pages crawled from the Web, and the CACM dataset, a collection of titles and abstracts from the CACM journal. Results are reported in table 1. queries and related relevant and non-relevant results. In [FFJ+16], word vectors in combination with bilingual dictionaries are used to extract synonyms so that they can be used to expand queries. Documents are represented as bags of vectors generated as mixture of distributions in [RPMG16]. E orts like [CLC+15] and [NWV16] are related to our work in the fact that they can be incorporated in usage of term embeddings in IR tasks. For our ranking scenario, [RGMJ16] is relevant as documents and queries are represented by mixtures of Gaussians over word embeddings, each of the Gaussians centered around centroid learned via e.g. a k-means algorithm. The likelihood of a query with respect to a document is measured by the distance of the query vector from each centroid that document belongs to, using centroid similarity or average intersimilarity. 2.1

A 2Vec: A ect-enriched dings [KCC18] embed

Word representations historically have only captured semantic or contextual information, but ignored other subtle word relationships such as di erence in sentiment. A ect refers to the feeling of an emotion or a feeling [Pic97]. Words such as `glad', `awesome', `happy', `disgust' or `sad' can be referred to as a ective words. A 2Vec introduces a post-training approach that introduces `emotion'-sensitivity or affect information in word embeddings. A 2Vec leverages existing a ect lexicon such as Warriner's lexicon [WKB13] which has a list of over 14,000 English words tagged with valence (V), arousal (A), and dominance (D) scores. The a ect-enriched embeddings introduced by A 2Vec are either built on top of vanilla word embeddings i.e. word2vec, GloVe, or paragram or introduced along with counter tting [MOT+16] or retro tting [FDJ+15]. In this work, we leverage these enriched vector spaces too in order to evaluate their performance for standard IR tasks, namely - query expansion and ranking. 3

Word embeddings for query expansion

We leverage word embeddings to perform query expansion in a way similar to [RPMG16]. For each query term q contained in the query text Q, the word embedding model is used to fetch wq nearest neighbour we in the embedding space, so that cos(we; wq) > t, where t is the minimum allowed cosine similarity between two embeddings to consider the word e associated to the vector we a good expansion for the word q associated with the query term vector wq. Upon successful retrieval of an expansion of at least a term q in a query, a new "alternative" query A where q is Dataset NYT WP CACM

ClueWeb09

The scores for formality, politeness and frustration extracted on the Ney Work Times and Washington Post articles are generally higher than the ones extracted for CACM and ClueWeb09 datasets, except for the frustration score reported for ClueWeb which is very close to the frustration score extracted for NYT articles. These results suggest that A 2Vec embeddings should work well on the news domain as they are built to appropriately capture such a ective aspects of information. 2

Related work

Dict2vec[TGH17b] builds word embeddings using online dictionaries and optimizing an objective function where each word embedding is built via positive sampling of strongly correlated words and negative sampling of weak correlated ones [TGH17a]. In [ZC17], embeddings are optimized using di erent objective functions in a supervised manner based on lists of substituted by e is created. Consequently the query to be executed on the IR system becomes a boolean query of the form Q OR A. If more than one query term has a valid expansion fetched from the embedding model, all possible combinations of query terms and relative expansion terms is generated. For example, given a query "recent research about AI", if term embeddings output that nearest(recent) = latest with cos(recent; latest) = 0:8 bigger than the threshold 0:75, the output query will be composed by two optional clauses: "recent research about AI" OR "latest research about AI". 4

Word embeddings for ranking

In order to use word embedding models for ranking we chose to use the averaging word embeddings approach (also known as AWE ). Each document and query vector is calculated by averaging the word vectors related to each word in documents and query texts. The query / document score is measured by the cosine similarity between their respective averaged vectors, as in other research works like [MNCC16, RGMJ16, RMLH17, GSS17]. In our experiments we used each word TFiDF vector to normalize (divide) the averaged word embedding for query and document vectors. We observed that using this technique to smooth the sum of the word vectors instead of just dividing it by the number of its words (mean) resulted in better ranking results. This seems in line with the ndings from [SLMJ15] which indicate that cosine similarity may be polluted by term frequencies when comparing word embeddings. 5

Experiments

We compare the usage of A 2Vec word embeddings in the ranking and query expansion task against both vanilla embedding models (like word2vec and GloVe) and enriched models like Dict2vec models [TGH17a]. We also present experiments with variants in A 2Vec: counter tted and retro tted models with enriched affect information. All the models used in our experiments are pretrained. To setup our evaluations we use two open source toolkits Anserini [YFL17] and Lucene4IR [AMH+17], both based on Apache Lucene [BMII12]. We run ranking and query expansion experiments on the New York Times articles from the TREC Core '17 track [AHK+17] since it's a relevant dataset for the news domain. For the sake of generalizability, we also conduct the same evaluations over the CACM dataset [Fox83], a "classic" dataset for IR. For the case of query expansion we include evaluation using WordNet [Mil95] in order to provide an expansion baseline not based on word embeddings. 5.1

Results

dings. We observe that classic BM25 and query likelihood retrieval models provide better NDCG than almost all the models except some of the a ect enriched ones. This is in line with what we observed for the ranking task on the same dataset. A GloVe retro tted a ect enriched embedding model is the top performing one for both NDCG and MAP. the best results, with a ect enriched paragram embeddings reporting both best NDCG and MAP, 0:02 better than non a ect enriched paragram embeddings results in both NDCG and MAP. We present extensive experiments to evaluate the impact of a ect-enriched word embeddings for information retrieval over a news corpus, namely ranking and query expansion implemented using open-source toolkits. We show that using a ect-enriched models shows a signi cant improvement for ranking against baseline/vanilla embeddings (2~0%) as well as other enriched embeddings (2~-10%). In case of query expansion, improvement is observed for the NYT dataset but vanilla GloVe embeddings report highest values for the CACM dataset. We believe the semantic structure and vocabulary distribution of the CACM dataset results in this behavior. We plan to extend this work rst towards understanding the role of semantic information in expansion tasks and then towards building fusion approaches leveraging enriched word vectors with standard IR baselines. [BMII12]

Andrzej Bialecki, Robert Muir, Grant Ingersoll, and Lucid Imagination. Apache lucene 4. In SIGIR 2012 workshop on open source information retrieval, page 17, 2012. [CHYZ09] Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. Clueweb09 data set, 2009. [CLC+15] Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. Revisiting word embedding for contrasting meaning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 106{115, 2015. [FDJ+15] Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith. Retro tting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language

Technologies, pages 1606{1615, 2015. [FFJ+16] Linnea Fornander, Marc Friberg, Vida Johansson, V Lindh-Haard, Pontus Ohlsson, and Ida Palm. Generating synonyms using word vectors and an easy-to-read corpus.

2016. [Fox83]

Edward A Fox. Characterization of two new experimental collections in computer and information science containing textual [GSS17] [KCC18]

[AHK+17] James

Allan

, Donna Harman, Evangelos Kanoulas,

Dan

Li , Christophe Van Gysel,

and Ellen

Vorhees . Trec 2017 common core track overview . In Proc. TREC , 2017 .

[AMH+17] Leif

Azzopardi

, Yashar Moshfeghi, Martin Halvey, Rami S Alkhawaldeh, Krisztian Balog, Emanuele Di Buccio, Diego Ceccarelli, Juan M Fernandez-Luna , Charlie Hull, Jake Mannix , et al. Lucene4ir: Developing information retrieval evaluation resources using lucene . In ACM SIGIR Forum , volume 50 , pages 58 { 75 . ACM, 2017 .

and bibliographic concepts . Technical report , Cornell University, 1983 .

Lukas

Galke , Ahmed Saleh, and

Ansgar

Scherp . Word embeddings for practical information retrieval . In 47. Jahrestagung der Gesellschaft fur Informatik , Informatik 2017 , Chemnitz, Germany, September 25-29 , 2017 , pages 2155 { 2167 , 2017 .

Sopan

Khosla , Niyati Chhaya, and

Kushal

Chawla . A 2vec: A ect{enriched distributional word representations . In Proceedings of the 27th International Conference on Computational Linguistics , pages 2204 { 2218 , 2018 .

[MCCD13]

Tomas

Mikolov , Kai Chen, Greg Corrado, and Je rey Dean. E cient estimation of word representations in vector space . arXiv preprint arXiv:1301.3781 , 2013 .

[Mil95] George A Miller. Wordnet: a lexical database for english . Communications of the ACM , 38 ( 11 ): 39 { 41 , 1995 .

[MNCC16]

Bhaskar

Mitra , Eric Nalisnick, Nick Craswell, and

Rich

Caruana . A dual embedding space model for document ranking . arXiv preprint arXiv:1602.01137 , 2016 .

[MOT+16] Nikola

Mrksic

, Diarmuid

OSeaghdha

, Blaise Thomson, Milica Gasic, Lina

RojasBarahona

, Pei-Hao

, David Vandyke, Tsung-Hsien Wen , and Steve Young . Counter- tting word vectors to linguistic constraints . In Proceedings of NAACLHLT , pages 142 { 148 , 2016 .

[NWV16] [Pic97] [PSM14] Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction . arXiv preprint arXiv:1605.07766 , 2016 .

MIT Press, Cambridge, MA, USA, 1997 .

Je rey Pennington, Richard Socher, and Christopher

Manning . Glove: Global vectors for word representation . In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 1532 { 1543 , 2014 .

[RGMJ16]

Dwaipayan

Roy , Debasis Ganguly, Mandar Mitra, and Gareth JF Jones. Representing documents and queries as sets of word embedded vectors for information retrieval . arXiv preprint arXiv:1606.07869 , 2016 .

[RMLH17]

Navid

Rekabsaz , Bhaskar Mitra, Mihai Lupu, and

Allan

Hanbury . Toward incorporation of relevant documents in word2vec . arXiv preprint arXiv:1707.06598 , 2017 .

[RPMG16]

Dwaipayan

Roy , Debjyoti Paul, Mandar Mitra, and

Utpal

Garain . Using word embeddings for automatic query expansion . arXiv preprint arXiv:1606.07608 , 2016 .

[SLMJ15]

Tobias

Schnabel , Igor Labutov, David Mimno,

and Thorsten

Joachims . Evaluation methods for unsupervised word embeddings . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 298 { 307 , 2015 .

[TGH17a] Julien Tissier, Christophe Gravier, and Amaury Habrard. Dict2vec: Learning word embeddings using lexical dictionaries . In Conference on Empirical Methods in Natural Language Processing (EMNLP 2017 ), pages 254 { 263 , 2017 .

[TGH17b] Julien Tissier, Christopher Gravier, and Amaury Habrard. Dict2vec : Learning word embeddings using lexical dictionaries . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 254 { 263 . Association for Computational Linguistics, 2017 .

[WKB13] [YFL17] [ZC17] Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. Norms of valence, arousal, and dominance for 13,915 english lemmas . Behavior Research Methods , 45 ( 4 ): 1191 { 1207 , 2013 .

Anserini: Enabling the use of lucene for information retrieval research . In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 1253 { 1256 . ACM, 2017 .

Relevance-based word embedding . In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 505 { 514 . ACM, 2017 .