=Paper=
{{Paper
|id=Vol-2079/paper11
|storemode=property
|title=On Temporally Sensitive Word Embeddings for News Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-2079/paper11.pdf
|volume=Vol-2079
|authors=Taewon Yoon,Sung-Hyon Myaeng,Hyun-Wook Woo,Seung-Wook Lee,Sang-Bum Kim
|dblpUrl=https://dblp.org/rec/conf/ecir/YoonMWLK18
}}
==On Temporally Sensitive Word Embeddings for News Information Retrieval==
On Temporally Sensitive Word Embeddings for News Information Retrieval Tae-Won Yoon Sung-Hyon Myaeng Hyun-Wook Woo School of Computing School of Computing Naver Corp. KAIST KAIST Seongnam-si, South Korea Daejeon, South Korea Daejeon, South Korea hw.woo@navercorp.com dbsus13@kaist.ac.kr myaeng@kaist.ac.kr Seung-Wook Lee Sang-Bum Kim Naver Corp. Naver Corp. Seongnam-si, South Korea Seongnam-si, South Korea swook.lee@navercorp.com sangbum.kim@navercorp.com from two sets of news articles covering two dis- joint time spans. The collection is comprised Abstract of 500 most frequent queries and their clicked news articles in July, 2017, provided by Naver Word embedding is one of the hot issues in re- Corp. The experimental result shows there is cent natural language processing (NLP) and a need for word embeddings to be built in a information retrieval (IR) research because it temporally sensitive way for news IR. has a potential to represent text at a semantic level. Current word embedding methods take 1 Introduction advantage of term proximity relationships in a large corpus to generate a vector represen- The method of representing words and texts as vec- tation of a word in a semantic space. We tors has drawn much attention in the natural language argue that the semantic relationships among processing (NLP) and information retrieval (IR) areas. terms should change as time goes by, espe- Various embedding methods for words, sentences, and cially for news IR. With unusual and unprece- paragraphs have emerged to represent them in a low dented events reported in news articles, for ex- dimensional vector space so that their semantic rela- ample, the word co-occurrence statistics in the tionships can be computed[MSC+ 13, PSM14]. Miklov time period covering the events would change et al.[MSC+ 13] proposed two efficient word-level em- non-trivially, affecting the semantic relation- bedding models, Skip-gram and CBOW, both using ships of some words in the embedding space an objective function to predict the relationship of and hence news IR. With a hypothesis that words in a sentence. A different approach was pro- news IR would benefit from changing word posed based on matrix factorization over a word-word embeddings over time, this paper reports our matrix with a neural network model by Pennington et initial investigation along the line. We con- al.[PSM14] structed a news retrieval collection based on One of the most important issues in building an mobile search and conducted a retrieval exper- embedding model is choosing an appropriate corpus iment to compare the embeddings constructed for training. There have been several studies on the effect of employing different corpora for their types Copyright c 2018 for the individual papers by the papers’ au- and domains in training embeddings. Siwei Lai at thors. Copying permitted for private and academic purposes. el.[LLHZ16] tested five different embedding models This volume is published and copyrighted by its editors. with three different domain corpora (wiki-dump, NYT In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18 corpus, IMDB corpus) on eight different tasks. They Workshop at ECIR, Grenoble, France, 26-March-2018, pub- conclue that the influence of the domains is dominant lished at http://ceur-ws.org in most tasks, proving the importance of choosing a right domain. Diaz et al.[DMC16] also showed the im- dard constructed from the click-through data. portance of using a corpus with the same domain in a query expansion task by comparing different embed- 2 Models and Dataset ding spaces, one trained globally and the other trained on a local task-specific corpus. They used Skip-gram 2.1 Embedding Models and Glove for embedding models, five different local We employed two most well-known word embedding corpora for retrieval and embedding training. They models: word2vec (skip-gram version) proposed by found that a locally trained embedding model works Miklov et al.[MSC+ 13] and Glove by Pennington et much better than globally trained one in the query al.[PSM14]. expansion task. Word2vec. This model has two different versions, Word embeddings may not reflect the dynamic na- CBOW and Skip-gram, both of which use the context ture of word meanings if a static collection is used for words of the target word to compute its semantics. training. It is natural that new words coined with tech- CBOW uses the context words as the input and at- nological advances or emerging cultures can change the tempts to predict the target word from them. Skip- word embedding space. Especially in a news corpus gram, on the other hand, calculates the probability that describes new events and contemporary issues, of existence of the context words based on the target changes in word statistics would be more phenomenal word. For optimization, a negative sampling method and the word embedding space should also change ac- and hierarchical softmax function can be used. Nega- cordingly. With an extensive coverage of an unusual tive sampling is an optimization method that uses not real-life event in news articles, such as the terror in Las all the words but randomly sampled ones. Hierarchi- Vegas in 2017, the semantic distance between terms cal softmax is a method that keeps all words mutual like Las Vegas and gun control, for example, would be- appearance information into a binary tree to reduce come much closer at least for a time being. We argue the calculation cost. In our work, we used Skip-gram that capturing this type of word meaning dynamics with negative sampling1 . should improve news IR and recommendation tasks. Glove. This model is based on matrix factoriza- While the aforementioned research showed the im- tion over a word-word matrix with a neural network portance of considering the domain of the corpus, there model. It converts the word-word co-occurrence in- has not been much work on investigating the impor- formation to vectors. After training, the dot product tance of the publication time of the corpus for retrieval of two words becomes proportional to the log value of tasks. As time goes by, the meaning of a word and its concurrent probability of the two words. According to relationship to other words would change, too. Kulka- Pennington et al.[PSM14], the Glove model has been rni et al.[KARPS15] shows that as time goes by, the known to show a superior result in word analogy tasks meaning and the usage of words changes. They ana- and good at preserving semantic word relationships lyze the change of word meanings and the relationship rather than syntactic ones. between words based on the time frames. However, they just focus on a computational approach to de- 2.2 Dataset tect statistically significant linguistic shifts, and did not apply result to to retrieval tasks. Click-through data. In order to evaluate the per- We examined the importance of the time periods formance of multiple sets of word embeddings for of news corpora used for word embedding training the retrieval task, we employed a news corpus with by conducting a similarity-based news retrieval ex- news click-through data provided by Naver Corp.2 , the periment based on three different corpora (Korean biggest portal service provider in South Korea, serving Wikipedia articles and news articles in March and in around 42 million users. The news click-through data July, 2017) and two different commonly used word em- covers all the mobile search clicks that took place be- bedding models. A news retrieval collection was de- tween July 1 and July 9, 2017. The number of records veloped by extracting the most frequently asked 500 or clicks is 53,472,390. The details of the test collection queries in July, 2017, and their clicked news articles in constructed from the click-through data is in section the click-through news data. For evaluation, we used 3.2.1 below. the news retrieval task based on inverse document fre- July news corpus. This corpus was generated quency weighted word centroid similarities (CentIDF), from the news click-through data and used for training. proposed by Georgios-Ioannis Brokos et al.[BMA16]. All the clicked news articles were collected regardless For each query in the retrieval experiment, we ranked of the number of clicks. When the embeddings were the news documents based on the cosine similarity be- 1 We also tested the CBOW model but the result is omitted tween the query embedding and a document embed- because it shows similar tendency ding and compared the result against the gold stan- 2 https://www.navercorp.com/en/index.nhn constructed, only the nouns extracted from the news text were used. This corpus shares the same domain and the collection time with the retrieval evaluation collection. This corpus consists of 6,011,811 unique news articles with 1,232,910 tokens3 . March news corpus. We collected news articles clicked in March, four months earlier than the period of Figure 1: Date of each corpus used for the experi- the evaluation corpus, so that we can examine how the ment. The time periods of the corpora used for the time difference affects the word embedding result in experiment. Even though one third of the Wikipedia the news domain. Like the July corpus, only the nouns documents were created after the test set, the future extracted by a morphological analyzer were used. This documents is only one tenth of the entire wiki corpus corpus has the same domain with the retrieval evalu- because the portion of Wikipedia corpus is only 30% ation collection but a different time period. This cor- of the whole and there four months after July. pus consists of 10,398,040 unique news articles with 1,381,901 tokens. for a simple news retrieval task. As such, we do not Wiki corpus. In order to reassure the importance attempt here to compare these embedding-based re- of the training data domain, especially for news IR, we trieval results against either word-based or embedding- also built a collection of general articles from Korean based state-of-the-art IR methods. We make the re- Wikipedia and Namu-wiki, which are the most widely trieval process as simple as possible so that we can used online encyclopedic wiki collections in Korea. observe the effect of different embedding methods on Like the news corpora, only the nouns were extracted the retrieval process without an interference of other and used for word embeddings. A Wikipedia dump factors that have been devised for retrieval effective- (389,584 articles) and a Namu- wiki dump (533,406 ar- ness. ticles) were downloaded in December 2017 and March 2017, respectively. Given that the test corpus was 3.1 Training and Parameter Settings based on the queries in July, searching the Wikipeida documents generated at a later time until December For the training of generating word embeddings, we gives the effect of searching future data (see Fig. 1). used python gensim library4 for word2vec and the While this may seem irrational for news search, it author-provided code5 for Glove. Other parameters should not affect the experimental result in that the for the Skip-gram model are: 300 for the vector dimen- Wikipeida articles are not so sensitive to time and that sion, 5 words for the context window size, and 0.0001 the number of future articles is relatively small. Namu- for the learning rate. For dropout, all words that ap- wiki played a more dominant role than Wikipedia in pear less than 3 times were ignored. For Glove, we that the former contains more articles with a longer trained it with 300 for the vector dimension, 15 for text per article. The total size of the Namu-wiki cor- the context window size, and 15 for maximum itera- pus is 4 times bigger than that of the wikipedia cor- tions. All words that appear less than 5 were dropped pus. The resulting corpus contains 922,990 articles out. with 2,167,577 tokens in total. 3.2 Evaluation via News Retrieval Table 1: The dataset used for comparisons. All the 3.2.1 Evaluation-set data were collected in 2017. Name Wiki corpus Domain Wiki Collection Time March,December # Articles 922,990 # Tokens 2,167,577 Based on the past research that claims using click- March news July news News News March July 10,398,040 6,011,811 1,381,901 1,232,910 through data can be an alternative way to evaluate retrieval performance[J+ 03, LFZ+ 07], we selected 500 most frequently occurred queries from the news click- 3 Experiment through data introduced in section 2.2. The queries were searched (or used) at least 6,000 times with the The main goal of the experiment is to gain an insight average of 36,521 times all the way up to about one on the need to use word embeddings computed from million times. By taking a union of the clicked news different time periods for news IR that usually seeks articles, the resulting test collection consists of 500 contemporary information, by comparing word embed- queries and 17,530 documents that were clicked at ding results from the three different types of corpora least twice by the users who entered queries to the 3 All the datasets used in this paper are in Korean. They search engine. After excluding the news articles that are used after extracting nouns based on the results from the morphological analyzer provided by Naver Corp. The examples 4 https://radimrehurek.com/gensim/ of the terms given in this paper are English translations 5 https://github.com/stanfordnlp/GloVe were clicked just once, a query has 33.5 relevant doc- the writing style of the Namu-wiki corpus being some- uments on average with the maximum of 439. times informal with miscellaneous information and In- ternet slangs make the Wiki corpus result worse than 3.2.2 Experimental Setup and Evaluation the March corpus. This suggests that it is critical to Metrics build embeddings with the corpus in a similar domain and writing style when the Skip-gram model is used. To generate a vector for a query or a news article, An important finding is that regardless of the met- we used the TF-IDF weighted word centroid calcula- rics used, the July corpus gave the best results. While tion method (CentIDF6 ) proposed by Georgios-Ioannis → − this is somewhat expected at an abstract level, it pro- Brokos et al.[BMA16]. A document vector t is com- vides an important insight on the use of embeddings puted as follows: for IR. Using embeddings as opposed to words would |V increase recall, perhaps at the expense of lower preci- P| T F (wj , t) · IDF (wj ) · − →j w sion in IR because of flexible matches. However, the → − j=1 experimental result shows increased precision with a t = |V P| more contemporary corpus used for embedding con- T F (wj , t) · IDF (wj ) struction. This suggests that the embeddings con- j=1 structed from the same time period better reflect the Where |V | is the vocabulary size of each sentence, wj semantics of the words used by the users. Given that as a word at j-th position in the sentence t. the embeddings capture the context of a target word, After generating document and query vectors, news two words appearing in a close proximity in a corpus articles are ranked according to cosine similarity with would share similar semantics. This would have the ef- each query vector. The ranked list of news articles is fect of retrieving news articles that may not have the used as a search result for the query. For comparisons exact query word (hence higher recall) and of reinforc- among different embedding results, we use the result ing their relevance with the matched related words of of three commonly used evaluation metrics: precision the right context (hence high precision). at 10, mean average precision (MAP) and NDCG at 10 based on binary relevance decisions. Table 2: Evaluating embedding models based on a news retrieval task. Bold faced numbers are the best 3.2.3 Analysis of Retrieval Performance results in different metrics. Both CentIDF and Arith- metic Mean are used for sentence embedding. The overall comparisons among the three different cor- CentIDF Model Precision@10 NDCG@10 MAP pora are summarized in Table 2 for two different em- Glove (wikipedia) 0.7114 0.7654 0.6192 bedding models. For the Skip-gram model, the MAP Glove (March) 0.7046 0.7600 0.6188 result of the model trained on the July corpus is Glove (July) 0.7300 0.7776 0.6533 Skip-gram (wikipedia) 0.6915 0.7509 0.5939 shown to perfrom 5.5% better than that trained on Skip-gram (March) 0.7203 0.7719 0.6317 the March corpus although the time difference was Skip-gram (July) 0.7399 0.7841 0.6666 Arithmetic Mean only four months. The improvement was as high as Model Precision@10 NDCG@10 MAP 12% when compared to the result trained on a general Glove (wikipedia) 0.6015 0.6518 0.5138 Glove (March) 0.6023 0.6529 0.5263 corpus (the wiki corpus), i.e. on a different document Glove (July) 0.6612 0.7018 0.5948 type or domain. For the Glove model, the MAP result Skip-gram (wikipedia) 0.5658 0.5193 0.4763 Skip-gram (March) 0.6706 0.7147 0.5866 trained on the July corpus is shown to be about 5.5% Skip-gram (July) 0.7090 0.7491 0.6404 better than both the model trained on the general cor- pus and the model trained on the March corpus. This 3.2.4 Qualitative Analysis strongly suggests that it is critical to build embeddings with the corpus in a similar time period for news re- In order to better understand the effect of different trieval. corpora on embeddings and potentially on retrieval, we The Skip-gram model is more sensitive to the do- picked two time-sensitive queries corresponding to two main than the Glove model. This is because the Glove separate sensational incidents in Korea between July 1 model is better at extracting semantic relationships and July 9 and computed cosine similarity between the among words than syntactic ones. That is, the stylis- embedding of each and those of other words to rank tic differences between the Wiki corpus and the March them when the three different corpora were used. The news corpus (without any temporal benefits) are less first one was related to a claim made by several par- important. For the Skip-gram model, on the contrary, ents that McDonalds hamburgers caused a hamburger 6 It is known to be better than arithmetic mean. Unweighted disease (Hemolytic uremic syndrome)7 , and the other method was also tried but without any gain. 7 http://koreaherald.com/view.php?ud=20170705000868 Table 3: Top ten similar terms obtained by three different corpora for two sample queries “Hamburger disease (Hemolytic uremic syndrome)” and “Incheon kid murder”. The expected intent-aware words are marked ‘*’ query: “Hamburger disease(Hemolytic uremic syndrome)” query: “Incheon kid murder” Wiki corpus March corpus July corpus Wiki corpus March corpus July corpus Swing-top 215.8g Hematotoxic* Jung Duk Soon Bupyeong Murderer* Substitute (food) Burger* Hemolytic* Park Nari Kidnap(while sleeping) Elementary girl* Cancer Synchytrium endobioticum Uremic* Lee Duek Hwa Before murder* Final Verdict* Soy-source bottle Burger King Basedow’s disease Woo Jung Sun Doodle(river) Killer* Celiac spruse Maclab Shagas disease Yang Jiseung Taheutajeu Don-Am dong Basedow’s disease Beef Maclab Wentu Antu After murder* Park Chun Pung Taste Mayagbingsso 215.8g Gak Jae Eun Palda(mountain) Incite Criminal* Bread Fast (food)* Haemolyticity* Oh Jong Guen Siha(lake) John Odgren Parkinson’s disease BigKing McDonald* Song Yung Cil Elementary girl* Live-in lover DOMDOM(burger) Kim Kyo Bun Uremicity* Lee Wan Hue Re-phase Kidnap* was the kidnap and murder of an eight-year-old girl in news articles. the elementary school by teenagers8 . Table 3 shows While anecdotal, the examples in Table 3 consti- top ten closest words under each corpus for the two tute a strong indication that it is critical to use the queries. corpus that coincides with the time-sensitive queries For the ”Hamburger disease” query, the result of in news IR. The embedding space would be entirely Skip-gram trained on the wiki corpus consists of words different from that of the same news corpus covering that are generally related to each of the query words. a different time period, giving very different similarity Some are related to food (e.g. ”Swing-top”, ”Sub- relationships among words. As an example, we tested stitute food”, ”Soy-source bottle”, ”Taste”, ”Bread”) ”presidential impeachment” as the query, which was a while others are to a disease (e.g. ”Cancer”, ”Base- very sensational incident in March. We observe that dows disease”, ”Celiac pruse”, ”Parkisons disease”). the Skip-gram result trained on the wiki corpus has But none of them are directly relevant to the inten- words that are unrelated to the query, such as pres- tion of the query, such as ”Hemolytic”, ”Umremic”, ident impeachment incident that took place in other and ”McDonald”. The result does not even contain countries, such as ”Dilma Vana Rousseff”, the former words about ”Burger” itself but those that are about president in Brazil. The result under the March corpus the general notion of ”Food” or ”Disease”. It is obvi- is slightly better than the result under the July corpus ous that the embeddings constructed out of the Wiki since the incident took place at that specific time. corpus would bring in noise for news retrieval. The result under the March corpus is completely 4 Conclusion and Future work different in the sense that the words about ”Hamber- gur” were picked up. So the embedding space is much Given that timeliness is a rather unique aspect of new more focused on more contemporary issues in general. IR, word embeddings should be constructed in such a Since the Hamburger disease” related event didnt oc- way that they reflect the evolving word-to-word rela- cur yet in March, however, none of the words are rel- tionships caused by emerging events and issues. Be- evant to the query. It is very clear that the model ginning with this hypothesis, we set out to build em- trained on the July corpus gave the best result includ- beddings based on the news corpora of different time ing the six intent-aware words with an asterisk. periods as well as on an encyclopedic corpus as a base- For the ”Incheon kid murder” query, the Skip-gram line for comparison, expecting to see the word embed- model trained on the wiki corpus gives a result con- dings constructed based on a temporally close corpus sisting of perpetrators and victims of a murder in Ko- would help retrieving more relevant news articles than rea, especially in Incheon, which would be good search those based on temporally disparate documents. terms if the intent were to retrieve general informa- We conducted an experiment with a newly con- tion, not about specific event-related news. It is be- structed news IR corpus and a simple retrieval pro- cause the target corpus contains articles about indi- cess using the cosine similarity measure for word em- vidual murder cases. On the other hand, the March bedding matches as well as qualitative analysis of the corpus gave a completely different words that are re- pseudo-expansion of query terms. The result clearly lated to descriptions of different murder cases, such as shows that it is worth constructing and using a corpus ”Kidnap (while sleeping)”, ”before murder” and ”Af- of temporally close news articles for news IR especially ter murder”, contributing to the better retrieval result when word embeddings are used. The qualitative anal- in the experiment. The model trained on the July cor- ysis of two sample queries strongly suggests that the pus shows the most meaningful result containing six semantic relationships among words change appropri- intent-aware words that would help retrieving relevant ately with different corpora so as to useful terms can be automatically generated for query expansion if the 8 http://koreaherald.com/view.php?ud=20170330000938 temporal and domain aspects of the corpora match with the queries. [LFZ+ 07] Yiqun Liu, Yupeng Fu, Min Zhang, The initial result reported in the paper needs to be Shaoping Ma, and Liyun Ru. Auto- expanded in a number of different ways. Just to name matic search engine performance evalua- a few, we first need to be able to suggest the appropri- tion with click-through data analysis. In ate time periods by which new embedding space must Proceedings of the 16th international con- be created for news IR. Another immediate question is ference on World Wide Web, pages 1133– in what ways we can avoid new embedding construc- 1134. ACM, 2007. tions from the scratch when we have the embeddings for a series of past time spans. We are currently in [LLHZ16] Siwei Lai, Kang Liu, Shizhu He, and Jun the process of utilizing the past click-through data to Zhao. How to generate a good word capture the dynamic meaning changes across time pe- embedding. IEEE Intelligent Systems, riods. 31(6):5–14, 2016. [MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai Acknowledgment Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and This research was supported by the Naver Corp. phrases and their compositionality. In Ad- and Next-Generation Information Computing Devel- vances in neural information processing opment Program through the National Research Foun- systems, pages 3111–3119, 2013. dation of Korea (NRF) funded by the Ministry of Science & ICT (2017M3C4A7065963). Any opinions, [PSM14] Jeffrey Pennington, Richard Socher, and findings and conclusions expressed in this material do Christopher D. Manning. Glove: Global not necessarily reflect the sponsors. vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, References 2014. [BMA16] Georgios-Ioannis Brokos, Prodromos Malakasiotis, and Ion Androutsopoulos. Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing, BioNLP@ACL 2016, Berlin, Germany, August 12, 2016, pages 114–118, 2016. [DMC16] Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally- trained word embeddings. In Proceed- ings of the 54th Annual Meeting of the Association for Computational Linguis- tics(ACL), August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016. [J+ 03] Thorsten Joachims et al. Evaluating retrieval performance using clickthrough data., 2003. [KARPS15] Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Statistically significant detection of linguistic change. In Proceedings of the 24th International Conference on World Wide Web(WWW), pages 625–635. International World Wide Web Conferences Steering Committee, 2015.