<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Temporal Random Indexing: a Tool for Analysing Word Meaning Variations in News</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <email>name.surname@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annalina Caputo</string-name>
          <email>name.surname@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <email>name.surname@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science, Univ. of Bari Aldo Moro</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>324</volume>
      <abstract>
        <p>The availability of data spanning different epochs has inspired a new analysis of cultural, social, and linguistic phenomena from a temporal perspective. This paper describes the application of Temporal Random Indexing (TRI) to the news context. TRI is able to build geometrical spaces of word meanings that consider several periods of time. Hence, TRI enables the analysis of the evolution in time of the meaning of a word. We propose some examples of application of TRI to the analysis of word meanings in the news scenario; this analysis enables the detection of linguistic variations that emerge in speci c time intervals and that can be related to particular events.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The analysis of word meaning variations over time
periods is a crucial task for identifying changes in social
and cultural phenomena. The diachronic analysis of
a language allows to discover linguistic variations over
time. Generally, a diachronic analysis is performed on
a large time interval since linguistic variations happen
quite slowly. However, this is not the case for fast
datastreaming scenarios like the Web, and in particular
social media such as Twitter or Facebook, where
sociocultural and linguistic phenomena quickly rise and fall.
Although the news scenario is generally characterized
by the use of a regular language, the large number of
events that occur along the time line causes sudden
topic shifts, making the analysis of this data similar
to the data-streaming scenario.</p>
      <p>In this paper we describe a technique called
Temporal Random Indexing (TRI) that we have
successfully applied to several diachronic analyses of the
language [BCS15]. TRI is able to build several
geometrical spaces of word meanings, called Distributional
Semantic Models (DSM), one for each time interval,
by skimming through huge corpora of text in order to
learn the context of usage of words over time. In the
resulting spaces, semantic similarity between words is
expressed by the closeness between word-points. Thus,
the semantic similarity can be computed as the cosine
of the angle between the two vectors that represent
the words. We show how to adopt TRI as a tool to
discover particular phenomena in news data-streaming
and how to link these linguistic changes to interesting
events reported in the news content.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>TRI is based on Random Indexing (RI) [Sah05], a
dimensionality reduction methodology and
computational framework for distributional semantics. Given
a term-term co-occurrence matrix A, RI builds a new
matrix B where the Euclidean distance between points
is preserved. Formally, given a corpus D of n
documents, and a vocabulary V of m words extracted from
D, we perform two steps: 1) assign a random vector
ri to each word wi 2 V ; 2) compute a semantic vector
svi for each word wi as the sum of all random
vectors assigned to words co-occurring with wi in a given
context. The context is the set of c words that
precede and follow wi. The second step is de ned by the
following equation:
svi = ∑
∑</p>
      <p>rj
d2D c&lt;j&lt;+c
j̸=i
(1)
The set of semantic vectors assigned to words in V
represents the WordSpace.</p>
      <p>The classical RI does not take into account temporal
information, but it can be easily adapted to our
purposes by applying the methodology proposed in [JS09].
Speci cally, if the corpus of n documents D is
annotated with metadata containing information about the
publication date, we can split the collection in p
subsets D1; D2; : : : ; Dp, where p is the number of
different time periods we want to analyse. The rst step in
the classical RI is unchanged in TRI: a random vector
is assigned to each word in the whole vocabulary V .
This represents the strength of this approach: the use
of the same random vectors for all the spaces makes
them comparable. The second step is similar to the
one proposed for RI but it takes into account the
temporal information: a different WordSpace Tk is built
for each time period Dk. Hence, the semantic
vector for a word in a given time interval is the result of
its co-occurrences with other words in the same time
interval, but the use of the same random vectors for
building the word representations over different time
spans guarantees their comparability along the
timeline. This means that a vector in the WordSpace T1
can be compared with vectors in the space T2.</p>
      <p>Let Tk be a period that ranges from tkstart to tkend ,
where tkstart &lt; tkend . In order to build the WordSpace
Tk we consider only the documents dk whose
publication date falls within the time interval Tk as follows:
sviTk =
∑</p>
      <p>∑
dk2Dk c&lt;j&lt;+c
j̸=i
rj
(2)
Using this approach we can build a WordSpace for each
time period Tk over a corpus D tagged with
information about the publication year. The word wi has a
distinct semantic vector sviTk for each time period Tk
built by accumulating random vectors according to the
co-occurring words in that period. The great
potentiality of TRI lies on the use of the same random
vectors to build different WordSpaces: semantic vectors
in different time periods remain comparable because
they are the linear combination of the same random
vectors.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Case study</title>
      <p>The main goal of this case study is to show how to
adopt TRI1 to discover interesting phenomena in the
1TRI is available as an open-source project at: https://
github.com/pippokill/tri
allegations
called
corruption
made
apology
met
became
case
initially
forced
news scenario. Speci cally, we can analyse the
similarity between the vector representations of a term
across different time periods in order to detect changes
in the usage of the term. Then, we can scrutinise both
the neighbour terms and the news related to such a
term during the period of time when the similarity
has changed in order to understand if a speci c event
occurred.</p>
      <p>We adopt the Signal Media One-Million News
Articles dataset that consists of 1 million articles scraped
during the time interval 1-30 September 2015. News
are extracted from Reuters, in addition to local news
sources and blogs. We split the dataset in ve time
periods of about one week: 1-6, 7-13, 14-20, 21-27, and
28-30. The split re ects the start and end of weeks
in the month of September 2015. Then, for each
period we build a WordSpace exploiting TRI. In
particular, we analyse the 150,000 most frequent words in the
whole corpus and we set the vector dimension to 500
using two non-zero elements in the random vector.</p>
      <p>In each time interval, we try to discover terms that
change their semantics with respect to the previous
periods. Formally, given two time periods Th and Tk,
where Th precedes Tk, and a term ti, we can simple
compute the cosine similarity between the semantic
vector of ti in Th and the semantic vector of ti in Tk
(sim(sviTh ; sviTk )). The similarity is a good indicator
of the variation of semantics of the term ti: a low
similarity suggests a meaning shift. Using this approach
we can rank all terms according to their similarity in
ascending order. Top terms in the rank are good
candidates for further analysis. However, in order to limit
our analysis to those terms that frequently occur in
the whole collection, the similarity scores have been
multiplied by the term document frequency. By
looking to such ranks, we discover that the word \scandal"
had a semantic shift between the 3rd and the 4th week
as showed in Table 1.</p>
      <p>Another interesting analysis is the variation in
similarity values between pairs of words over time: an
upsurge in similarity re ects the increment of
cooccurrences between the two words in similar contexts.
Figure 1 reports the similarity between \scandal" and
\Volkswagen" over time. The plot shows a spike in the
similarity value starting from the fourth time interval
(21-27 September), which corresponds to the scandal
about the Volkswagen diesel emission.
●
1
●
2
●
3
time.period
●
4
●
5</p>
      <p>Semantic vectors can be exploited to implement a
semantic information retrieval system [BCS11]. The
idea is to provide a vector representation for both
documents and queries. In particular given a text W (e.g.
a document or a query) composed of k terms we can
build the vector representation of W as the vector sum
of the k semantic vectors occurring in W . Formally,
given W = t1t2 : : : tk the sequence of k terms in W , its
vector representation is w = svt1 + svt2 + + svtk .
Using the same approach we can build the vector
representation q for a query Q. Then the similarity
between a query Q and a document D is given by the
cosine similarity between q and d. TRI provides
different WordSpaces for each time period Tk. Then, the
vector representation of a document published during
the period Tk is built by exploiting only the
semantic vectors of the corresponding WordSpace. At query
time, the query representation is built by taking into
account the semantic vectors of the time period we
want to search.</p>
      <p>As showed in Figure 1, in the third time period the
similarity between \scandal" and \Volkswagen" starts
to increase. We try to investigate this phenomenon
from the information retrieval point of view. Table 2
reports the rst three snippets retrieved by the query
\scandal" and \Volkswagen" in the third time
period. The column VSM reports results obtained with a
classical vector space model implemented by Lucene2,
while the column TRI reports results obtained by TRI.</p>
      <p>The VSM model gives more importance to
documents that contain both terms, this is the case of the
TRI
Volkswagen to recall
500,000... a device
that disguises
pollution levels...</p>
      <p>EPA, California
investigate Volkswagen for
clean air violations...</p>
      <sec id="sec-3-1">
        <title>Volkswagen Ordered to</title>
        <p>Recall Half a Million
Cars After It Cheated
on Smog Checks...</p>
      </sec>
      <sec id="sec-3-2">
        <title>Volkswagen emissions</title>
        <p>cheating...
investigations over an emissions
scandal...</p>
        <p>The reinvention of
Volkswagen. In the
Volkswagen Group,
there is a sense...</p>
        <p>rst two documents, while the third document is not
relevant at all. The rst three documents retrieved
by TRI are all relevant for the given query since they
all talk about events related to the Volkswagen diesel
emission scandal. However, it is interesting to notice
that no document contains explicitly the word
\scandal". These results can be explained by the nature
of the semantic search, which does not rely on string
matching, but rather assigns a rank to documents on
the basis of their proximity to the semantic vector
scandal + V olkswagen taken from the third time
period. Semantic search based on TRI opens new
opportunities for implementing effective semantic search
engines that take into account word meaning variation
over time. We plan to deeply investigate this aspect
in future research.
[BCS15] Pierpaolo Basile, Annalina Caputo, and
Giovanni Semeraro. Temporal random indexing:
A system for analysing word meaning over
time. IJCL, 1(1):55{68, 12 2015.
[JS09]</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>