<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Temporally Sensitive Word Embeddings for News Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tae-Won Yoon Sung-Hyon Myaeng</string-name>
          <email>dbsus13@kaist.ac.kr myaeng@kaist.ac.kr Seung-Wook Lee Naver Corp. Seongnam-si, South Korea swook.lee@navercorp.com</email>
          <email>myaeng@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sang-Bum Kim</string-name>
          <email>sangbum.kim@navercorp.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hyun-Wook Woo</string-name>
          <email>hw.woo@navercorp.com</email>
          <email>swook.lee@navercorp.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez,</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR'18, Workshop at ECIR</institution>
          ,
          <addr-line>Grenoble, France, 26-March-2018, published at http://ceur-ws.org</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Naver Corp.</institution>
          ,
          <addr-line>Seongnam-si</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computing School of Computing, KAIST KAIST</institution>
          ,
          <addr-line>Daejeon</addr-line>
          ,
          <country>South</country>
          <addr-line>Korea Daejeon</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>12</lpage>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Word embedding is one of the hot issues in
recent natural language processing (NLP) and
information retrieval (IR) research because it
has a potential to represent text at a semantic
level. Current word embedding methods take
advantage of term proximity relationships in
a large corpus to generate a vector
representation of a word in a semantic space. We
argue that the semantic relationships among
terms should change as time goes by,
especially for news IR. With unusual and
unprecedented events reported in news articles, for
example, the word co-occurrence statistics in the
time period covering the events would change
non-trivially, a ecting the semantic
relationships of some words in the embedding space
and hence news IR. With a hypothesis that
news IR would bene t from changing word
embeddings over time, this paper reports our
initial investigation along the line. We
constructed a news retrieval collection based on
mobile search and conducted a retrieval
experiment to compare the embeddings constructed
Copyright c 2018 for the individual papers by the papers'
authors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
from two sets of news articles covering two
disjoint time spans. The collection is comprised
of 500 most frequent queries and their clicked
news articles in July, 2017, provided by Naver
Corp. The experimental result shows there is
a need for word embeddings to be built in a
temporally sensitive way for news IR.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>The method of representing words and texts as
vectors has drawn much attention in the natural language
processing (NLP) and information retrieval (IR) areas.
Various embedding methods for words, sentences, and
paragraphs have emerged to represent them in a low
dimensional vector space so that their semantic
relationships can be computed[MSC+13, PSM14]. Miklov
et al.[MSC+13] proposed two e cient word-level
embedding models, Skip-gram and CBOW, both using
an objective function to predict the relationship of
words in a sentence. A di erent approach was
proposed based on matrix factorization over a word-word
matrix with a neural network model by Pennington et
al.[PSM14]</p>
      <p>One of the most important issues in building an
embedding model is choosing an appropriate corpus
for training. There have been several studies on the
e ect of employing di erent corpora for their types
and domains in training embeddings. Siwei Lai at
el.[LLHZ16] tested ve di erent embedding models
with three di erent domain corpora (wiki-dump, NYT
corpus, IMDB corpus) on eight di erent tasks. They
conclue that the in uence of the domains is dominant
in most tasks, proving the importance of choosing a
right domain. Diaz et al.[DMC16] also showed the
importance of using a corpus with the same domain in
a query expansion task by comparing di erent
embedding spaces, one trained globally and the other trained
on a local task-speci c corpus. They used Skip-gram
and Glove for embedding models, ve di erent local
corpora for retrieval and embedding training. They
found that a locally trained embedding model works
much better than globally trained one in the query
expansion task.</p>
      <p>Word embeddings may not re ect the dynamic
nature of word meanings if a static collection is used for
training. It is natural that new words coined with
technological advances or emerging cultures can change the
word embedding space. Especially in a news corpus
that describes new events and contemporary issues,
changes in word statistics would be more phenomenal
and the word embedding space should also change
accordingly. With an extensive coverage of an unusual
real-life event in news articles, such as the terror in Las
Vegas in 2017, the semantic distance between terms
like Las Vegas and gun control, for example, would
become much closer at least for a time being. We argue
that capturing this type of word meaning dynamics
should improve news IR and recommendation tasks.</p>
      <p>While the aforementioned research showed the
importance of considering the domain of the corpus, there
has not been much work on investigating the
importance of the publication time of the corpus for retrieval
tasks. As time goes by, the meaning of a word and its
relationship to other words would change, too.
Kulkarni et al.[KARPS15] shows that as time goes by, the
meaning and the usage of words changes. They
analyze the change of word meanings and the relationship
between words based on the time frames. However,
they just focus on a computational approach to
detect statistically signi cant linguistic shifts, and did
not apply result to to retrieval tasks.</p>
      <p>We examined the importance of the time periods
of news corpora used for word embedding training
by conducting a similarity-based news retrieval
experiment based on three di erent corpora (Korean
Wikipedia articles and news articles in March and in
July, 2017) and two di erent commonly used word
embedding models. A news retrieval collection was
developed by extracting the most frequently asked 500
queries in July, 2017, and their clicked news articles in
the click-through news data. For evaluation, we used
the news retrieval task based on inverse document
frequency weighted word centroid similarities (CentIDF),
proposed by Georgios-Ioannis Brokos et al.[BMA16].
For each query in the retrieval experiment, we ranked
the news documents based on the cosine similarity
between the query embedding and a document
embedding and compared the result against the gold
standard constructed from the click-through data.</p>
    </sec>
    <sec id="sec-3">
      <title>Models and Dataset</title>
      <sec id="sec-3-1">
        <title>Embedding Models</title>
        <p>We employed two most well-known word embedding
models: word2vec (skip-gram version) proposed by
Miklov et al.[MSC+13] and Glove by Pennington et
al.[PSM14].</p>
        <p>Word2vec. This model has two di erent versions,
CBOW and Skip-gram, both of which use the context
words of the target word to compute its semantics.
CBOW uses the context words as the input and
attempts to predict the target word from them.
Skipgram, on the other hand, calculates the probability
of existence of the context words based on the target
word. For optimization, a negative sampling method
and hierarchical softmax function can be used.
Negative sampling is an optimization method that uses not
all the words but randomly sampled ones.
Hierarchical softmax is a method that keeps all words mutual
appearance information into a binary tree to reduce
the calculation cost. In our work, we used Skip-gram
with negative sampling1.</p>
        <p>Glove. This model is based on matrix
factorization over a word-word matrix with a neural network
model. It converts the word-word co-occurrence
information to vectors. After training, the dot product
of two words becomes proportional to the log value of
concurrent probability of the two words. According to
Pennington et al.[PSM14], the Glove model has been
known to show a superior result in word analogy tasks
and good at preserving semantic word relationships
rather than syntactic ones.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Dataset</title>
        <p>Click-through data. In order to evaluate the
performance of multiple sets of word embeddings for
the retrieval task, we employed a news corpus with
news click-through data provided by Naver Corp.2, the
biggest portal service provider in South Korea, serving
around 42 million users. The news click-through data
covers all the mobile search clicks that took place
between July 1 and July 9, 2017. The number of records
or clicks is 53,472,390. The details of the test collection
constructed from the click-through data is in section
3.2.1 below.</p>
        <p>July news corpus. This corpus was generated
from the news click-through data and used for training.
All the clicked news articles were collected regardless
of the number of clicks. When the embeddings were
1We also tested the CBOW model but the result is omitted
because it shows similar tendency
2https://www.navercorp.com/en/index.nhn
constructed, only the nouns extracted from the news
text were used. This corpus shares the same domain
and the collection time with the retrieval evaluation
collection. This corpus consists of 6,011,811 unique
news articles with 1,232,910 tokens3.</p>
        <p>March news corpus. We collected news articles
clicked in March, four months earlier than the period of
the evaluation corpus, so that we can examine how the
time di erence a ects the word embedding result in
the news domain. Like the July corpus, only the nouns
extracted by a morphological analyzer were used. This
corpus has the same domain with the retrieval
evaluation collection but a di erent time period. This
corpus consists of 10,398,040 unique news articles with
1,381,901 tokens.</p>
        <p>Wiki corpus. In order to reassure the importance
of the training data domain, especially for news IR, we
also built a collection of general articles from Korean
Wikipedia and Namu-wiki, which are the most widely
used online encyclopedic wiki collections in Korea.
Like the news corpora, only the nouns were extracted
and used for word embeddings. A Wikipedia dump
(389,584 articles) and a Namu- wiki dump (533,406
articles) were downloaded in December 2017 and March
2017, respectively. Given that the test corpus was
based on the queries in July, searching the Wikipeida
documents generated at a later time until December
gives the e ect of searching future data (see Fig. 1).
While this may seem irrational for news search, it
should not a ect the experimental result in that the
Wikipeida articles are not so sensitive to time and that
the number of future articles is relatively small.
Namuwiki played a more dominant role than Wikipedia in
that the former contains more articles with a longer
text per article. The total size of the Namu-wiki
corpus is 4 times bigger than that of the wikipedia
corpus. The resulting corpus contains 922,990 articles
with 2,167,577 tokens in total.
The main goal of the experiment is to gain an insight
on the need to use word embeddings computed from
di erent time periods for news IR that usually seeks
contemporary information, by comparing word
embedding results from the three di erent types of corpora
3All the datasets used in this paper are in Korean. They
are used after extracting nouns based on the results from the
morphological analyzer provided by Naver Corp. The examples
of the terms given in this paper are English translations
for a simple news retrieval task. As such, we do not
attempt here to compare these embedding-based
retrieval results against either word-based or
embeddingbased state-of-the-art IR methods. We make the
retrieval process as simple as possible so that we can
observe the e ect of di erent embedding methods on
the retrieval process without an interference of other
factors that have been devised for retrieval e
ectiveness.
For the training of generating word embeddings, we
used python gensim library4 for word2vec and the
author-provided code5 for Glove. Other parameters
for the Skip-gram model are: 300 for the vector
dimension, 5 words for the context window size, and 0.0001
for the learning rate. For dropout, all words that
appear less than 3 times were ignored. For Glove, we
trained it with 300 for the vector dimension, 15 for
the context window size, and 15 for maximum
iterations. All words that appear less than 5 were dropped
out.
Based on the past research that claims using
clickthrough data can be an alternative way to evaluate
retrieval performance[J+03, LFZ+07], we selected 500
most frequently occurred queries from the news
clickthrough data introduced in section 2.2. The queries
were searched (or used) at least 6,000 times with the
average of 36,521 times all the way up to about one
million times. By taking a union of the clicked news
articles, the resulting test collection consists of 500
queries and 17,530 documents that were clicked at
least twice by the users who entered queries to the
search engine. After excluding the news articles that
4https://radimrehurek.com/gensim/
5https://github.com/stanfordnlp/GloVe
were clicked just once, a query has 33.5 relevant
documents on average with the maximum of 439.
3.2.2
To generate a vector for a query or a news article,
we used the TF-IDF weighted word centroid
calculation method (CentIDF6) proposed by Georgios-Ioannis
Brokos et al.[BMA16]. A document vector !t is
computed as follows:
jV j</p>
        <p>P T F (wj ; t) IDF (wj ) w!j
!t = j=1
jV j
P T F (wj ; t) IDF (wj )
j=1
Where jV j is the vocabulary size of each sentence, wj
as a word at j-th position in the sentence t.</p>
        <p>After generating document and query vectors, news
articles are ranked according to cosine similarity with
each query vector. The ranked list of news articles is
used as a search result for the query. For comparisons
among di erent embedding results, we use the result
of three commonly used evaluation metrics: precision
at 10, mean average precision (MAP) and NDCG at
10 based on binary relevance decisions.
3.2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Analysis of Retrieval Performance</title>
        <p>The overall comparisons among the three di erent
corpora are summarized in Table 2 for two di erent
embedding models. For the Skip-gram model, the MAP
result of the model trained on the July corpus is
shown to perfrom 5.5% better than that trained on
the March corpus although the time di erence was
only four months. The improvement was as high as
12% when compared to the result trained on a general
corpus (the wiki corpus), i.e. on a di erent document
type or domain. For the Glove model, the MAP result
trained on the July corpus is shown to be about 5.5%
better than both the model trained on the general
corpus and the model trained on the March corpus. This
strongly suggests that it is critical to build embeddings
with the corpus in a similar time period for news
retrieval.</p>
        <p>The Skip-gram model is more sensitive to the
domain than the Glove model. This is because the Glove
model is better at extracting semantic relationships
among words than syntactic ones. That is, the
stylistic di erences between the Wiki corpus and the March
news corpus (without any temporal bene ts) are less
important. For the Skip-gram model, on the contrary,
6It is known to be better than arithmetic mean. Unweighted
method was also tried but without any gain.
the writing style of the Namu-wiki corpus being
sometimes informal with miscellaneous information and
Internet slangs make the Wiki corpus result worse than
the March corpus. This suggests that it is critical to
build embeddings with the corpus in a similar domain
and writing style when the Skip-gram model is used.</p>
        <p>An important nding is that regardless of the
metrics used, the July corpus gave the best results. While
this is somewhat expected at an abstract level, it
provides an important insight on the use of embeddings
for IR. Using embeddings as opposed to words would
increase recall, perhaps at the expense of lower
precision in IR because of exible matches. However, the
experimental result shows increased precision with a
more contemporary corpus used for embedding
construction. This suggests that the embeddings
constructed from the same time period better re ect the
semantics of the words used by the users. Given that
the embeddings capture the context of a target word,
two words appearing in a close proximity in a corpus
would share similar semantics. This would have the
effect of retrieving news articles that may not have the
exact query word (hence higher recall) and of
reinforcing their relevance with the matched related words of
the right context (hence high precision).
In order to better understand the e ect of di erent
corpora on embeddings and potentially on retrieval, we
picked two time-sensitive queries corresponding to two
separate sensational incidents in Korea between July 1
and July 9 and computed cosine similarity between the
embedding of each and those of other words to rank
them when the three di erent corpora were used. The
rst one was related to a claim made by several
parents that McDonalds hamburgers caused a hamburger
disease (Hemolytic uremic syndrome)7, and the other
7http://koreaherald.com/view.php?ud=20170705000868
Given that timeliness is a rather unique aspect of new
IR, word embeddings should be constructed in such a
way that they re ect the evolving word-to-word
relationships caused by emerging events and issues.
Beginning with this hypothesis, we set out to build
embeddings based on the news corpora of di erent time
periods as well as on an encyclopedic corpus as a
baseline for comparison, expecting to see the word
embeddings constructed based on a temporally close corpus
would help retrieving more relevant news articles than
those based on temporally disparate documents.</p>
        <p>We conducted an experiment with a newly
constructed news IR corpus and a simple retrieval
process using the cosine similarity measure for word
embedding matches as well as qualitative analysis of the
pseudo-expansion of query terms. The result clearly
shows that it is worth constructing and using a corpus
of temporally close news articles for news IR especially
when word embeddings are used. The qualitative
analysis of two sample queries strongly suggests that the
semantic relationships among words change
appropriately with di erent corpora so as to useful terms can
be automatically generated for query expansion if the
temporal and domain aspects of the corpora match
8http://koreaherald.com/view.php?ud=20170330000938
with the queries.</p>
        <p>The initial result reported in the paper needs to be
expanded in a number of di erent ways. Just to name
a few, we rst need to be able to suggest the
appropriate time periods by which new embedding space must
be created for news IR. Another immediate question is
in what ways we can avoid new embedding
constructions from the scratch when we have the embeddings
for a series of past time spans. We are currently in
the process of utilizing the past click-through data to
capture the dynamic meaning changes across time
periods.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Acknowledgment</title>
        <p>This research was supported by the Naver Corp.
and Next-Generation Information Computing
Development Program through the National Research
Foundation of Korea (NRF) funded by the Ministry of
Science &amp; ICT (2017M3C4A7065963). Any opinions,
ndings and conclusions expressed in this material do
not necessarily re ect the sponsors.
[BMA16]
[J+03]</p>
        <p>Thorsten Joachims et al. Evaluating
retrieval performance using clickthrough
data., 2003.
[LLHZ16]</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [KARPS15]
          <string-name>
            <given-names>Vivek</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          , Rami Al-Rfou,
          <string-name>
            <given-names>Bryan</given-names>
            <surname>Perozzi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Skiena</surname>
          </string-name>
          .
          <article-title>Statistically signi cant detection of linguistic change</article-title>
          .
          <source>In Proceedings of the 24th International Conference on World Wide Web(WWW)</source>
          , pages
          <fpage>625</fpage>
          {
          <fpage>635</fpage>
          . International World Wide Web Conferences Steering Committee,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Yiqun</given-names>
            <surname>Liu</surname>
          </string-name>
          , Yupeng Fu, Min Zhang, Shaoping Ma, and
          <string-name>
            <given-names>Liyun</given-names>
            <surname>Ru</surname>
          </string-name>
          .
          <article-title>Automatic search engine performance evaluation with click-through data analysis</article-title>
          .
          <source>In Proceedings of the 16th international conference on World Wide Web</source>
          , pages
          <volume>1133</volume>
          {
          <fpage>1134</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Siwei</given-names>
            <surname>Lai</surname>
          </string-name>
          , Kang Liu, Shizhu He, and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>How to generate a good word embedding</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>31</volume>
          (
          <issue>6</issue>
          ):5{
          <fpage>14</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [MSC+13]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Je</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [PSM14]
          <article-title>Je rey Pennington</article-title>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          {
          <fpage>1543</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>