<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Looking for the Best Historical Window for Assessing Semantic Similarity Using Human Literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jorge Martinez-Gil</string-name>
          <email>gil@scch.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Pichler</string-name>
          <email>mario.pichler@scch.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorena Paoletti</string-name>
          <email>lorena.paoletti@scch.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Software Competence Center</institution>
          ,
          <addr-line>Hagenberg, Softwarepark 21, 4232</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Software Competence Center</institution>
          ,
          <addr-line>Hagenberg, Softwarepark 21, 4232, Austria, jorge.martinez</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe the way to get bene t from broad cultural trends through the quantitative analysis of a vast digital book collection representing the digested history of humanity. Our research work has revealed that appropriately comparing the occurrence patterns of words in some periods of human literature can help us to accurately determine the semantic similarity between these words by means of computers without requiring human intervention. Preliminary results seem to be promising. knowledge integration; semantic similarity; culturomics</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>It is widely accepted that the meaning of words evolve
over time. However, it is still unclear if word occurrences
in human literature along the history can be meaningful in
computing word semantic similarity. By semantic
similarity measurement we mean the research challenge whereby
two terms are assigned a score based on the likeness of their
meaning. Automatic measurement of semantic similarity is
considered to be of great importance for many computer
related elds since a wide variety of techniques. The
reason is that textual semantic similarity measures can be used
for understanding beyond the literal lexical representation
of words and phrases. For example, it is possible to
automatically identify that speci c terms (e.g., Finance) yields
matches on similar terms (e.g., Economics, Economic
Affairs, Financial A airs, etc.). This capability of
understanding beyond the lexical representation of words makes
semantic similarity methods to be of great importance to the
Linked Data community. For example, the ontology
alignment problem can be addressed by means of methods of this
kind.
(c) 2016, Copyright is with the authors. Published in the Workshop Proceedings
of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bordeaux, France)
on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
under the terms of the Creative Commons license CC-by-nc-nd 4.0
The traditional approach for solving this problem has
consisted of using manually compiled dictionaries to determine
the semantic similarity between terms, but an important
problem remains open. There is a gap between dictionaries
and the language used by people, the reason is a balance
that every dictionary must strike: to be comprehensive
enough for being a useful reference but concise enough to be
practically used. For this reason, many infrequent words are
usually omitted. Therefore, how can we measure semantic
similarity in situations where terms are not covered by a
dictionary? We think Culturomics could be an answer.</p>
      <p>Culturomics consists of collecting and analyzing data from
the study of human culture. Michel et al. [8] established
this discipline by means of their seminal work where they
presented a corpus of digitized texts containing 5.2 million
books which represent about a 4 percent of all books
ever printed. This study of human culture through digitized
books have had a strong positive impact in our core research
since its inception. In a previous work [7], the idea of word
co-occurrence in human literature for supporting semantic
correspondence discovery was explored. Now, we go a step
further beyond with a much more complete framework
being able to improve our past results. Therefore, the main
contributions presented in this work are:
1. We propose to use culturomics for trying to determine
the semantic similarity between words1 by comparing
their occurrence pattern in human literature by means
of an appropriate statistical analysis.
2. We evaluate a pool of quantitative algorithms for time
series comparison to determine what are the most
appropriate methods in this context. These algorithms
are going to be applied on some statistical
transformations which can help to reduce noise.
3. We try to determine what is the best historical time
period for computing semantic similarity using human
literature.</p>
      <p>The rest of this paper is organized as follows: Section 2
describes related approaches that are proposed in the
literature. Section 3 describes the key ideas to understand our
contribution. Section 4 presents a qualitative evaluation of
our method, and nally, we draw conclusions and future
lines of research.
1We focus in the English language only</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>In the past, there have been great e orts in nding new
semantic similarity measures mainly due to its fundamental
importance in many elds of the modern computer science.
The detection of di erent formulations of the same concept
is a key method in a lot of computer-related elds. To name
only a few, we can refer to a) clustering [3], service
matchmaking [1], web data integration [6], or schema matching [2]
rely on a good performance when determining the meaning
of data.</p>
      <p>If we focus on the eld of semantic change, we can see
how authors de ne it as a change of one or more meanings
of the word in time. Developing automatic methods for
identifying changes in word meaning can therefore be useful
for both theoretical linguistics and a variety of applications
which depend on lexical information. Some works have
explored this path, for instance [10] investigated the signi cant
changes in the distribution of terms in the Google N-gram
corpus and their relationships with emotion words or [5] who
presented an approach for automatic detection of semantic
change of words based on distributional similarity models.
Our approach is di erent in the sense we compute semantic
similarity using a speci c historical window.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>CONTRIBUTION</title>
      <p>Our contribution is an analysis of books published along
the history. The aim is to build novel measures which can
determine the semantic similarity of words in an automatic
way. The main reason for preferring this paradigm rather
than a traditional approach based on dictionaries is
obvious; according to the book library digitized by Google2, the
number of words in the English lexicon is currently above a
million. Therefore, there are more words from the data sets
we are using than in any dictionary. For instance, the
Webster's Dictionary3, lists much less than 400,000 single-word
word forms currently [8].</p>
      <p>We have chosen ten well-known algorithms for time
series comparison. This pool includes distance measures
(Euclidean, Chebyshev, Jaccard, and Manhattan) , similarity
measures (Cosine, Dynamic Time Warping, Roberts, and
Ruzicka), and correlation coe cients (Pearson and
Spearman's correlation) [4]. We provide a brief description for
each of these algorithms listed in alphabetical order below.
We consider that the pair x and y are the time series
representation for each of the words to be compared.
1. Cosine similarity is a measure between two time
series which determines the cosine of the angle between
them.</p>
      <p>sim(x; y) =</p>
      <p>Pn</p>
      <p>i=1 xi yi
pPn i=1 yi2
i=1 xi2pPn
2. Euclidean distance computes the euclidean distance
between each two points along the time series.</p>
      <p>vu n
sim(x; y) = tuX(xi
i=1
yi)2
2http://books.google.com/ngrams
3http://www.merriam-webster.com</p>
      <p>sim(x; y) = maxin=1 jxi
yij
4. Dynamic Time Warping uses a dynamic programming
technique to determine the best alignment that will
produce the optimal distance.</p>
      <p>sim(x; y) =
jxik</p>
      <p>yikj
n;m</p>
      <p>X
i=1;k=1
5. Jaccard distance measures the similarity of two sets by
comparing the size of the overlapping points against
the size of the two time series.
6. Manhattan distance computes the sum of the
absolute values of the di erences between the
corresponding points from the time series.</p>
      <p>sim(x; y) =</p>
      <p>Pn</p>
      <p>i=1(xi ^ yi)
Pn
i=1(xi _ yi)
n
sim(x; y) = X jxi</p>
      <p>
        yij
i=1
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
7. Pearson Correlation determines the ratio between the
covariance and the standard deviation of two time
series.
      </p>
      <p>sim(x; y) =</p>
      <p>Pin=1(xi
x)(yi</p>
      <p>y)
pPin=1(xi
x)2pPin=1(yi
y)2
8. Roberts similarity examines the relation between the
sum of each two corresponding points within the min
and max of them.</p>
      <p>sim(x; y) =</p>
      <p>Pin=1(xi + yi)
minfxi;yig
maxfxi;yig</p>
      <p>Pin=1(xi + yi)
9. Ruzicka similarity tries to nd the di erence between
each of two corresponding pairs divided by the
maximum for each case.</p>
      <p>sim(x; y) =</p>
      <p>Pn</p>
      <p>i=1 min(xi; yi)
Pn
i=1 max(xi; yi)
10. Spearman's rank correlation is a statistical measure
that tries to nd if there is a monotonic relationship
between the two time series.</p>
      <p>
        sim(x; y) = 1
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
6 P(xi
      </p>
      <p>N (N 2
yi)2
1)</p>
      <p>Therefore, our contribution is a framework where the
problem is addressed using di erent perspectives: a)
algorithms for comparing time series similarity, b) statistical
transformations of time series using reduction, baseline removal,
rescaling and smoothing techniques, and c) looking for the
most appropriate time window, thus, the range of years
which helps us to perform the most accurate predictions.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Working with statistical transformations</title>
      <p>Working with time series has a number of problems since
two similar time series can present the same pattern but
di erent occurrence volumes. This can be solved by means
of normalization techniques. However, there are some
algorithms where normalization has not any kind of e ect, for
instance when using Cosine Distance which tries to measure
the angle between the two vectors of numeric values.
3.1.1</p>
      <p>Smoothing of the original time series</p>
      <p>Smoothing a time series consists of creating an
approximating function to capture important patterns, while
leaving out noise or other disturbing phenomena. Therefore,
smoothing is a widely used technique for reducing of
canceling the e ect due to random variations. This technique,
when properly applied, reveals more clearly the underlying
trend of the time series. We want to run the algorithms in
smoothed data because this kind of technique can help us to
obtain cleaner time series and, therefore, results are going
to re ect trends more clearly.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Looking for the best time window</title>
      <p>Methods presented until now can give us some advice
about what direction should be explored. However, these
results are far from being considered optimal. One of the
main reasons is that we have only focused in a xed time
period. In order to overcome this limitation, we have
designed an algorithm for trying to capture the optimal time
window for solving the Miller-Charles benchmark data set.</p>
    </sec>
    <sec id="sec-6">
      <title>EVALUATION</title>
      <p>We report our results using the 1-gram data set o ered by
Google4. The data is in the range between 1800 and 2000.
The reason is that there are not enough books before 1800
to reliably quantify many of the queries from the data sets
we are using. On the other hand, after year 2000, quality
of the corpus is lower since it is subject to many changes.
Results are obtained according Miller-Charles data set [9].
The rationale behind this way to evaluate quality is that
the results obtained by means of arti cial techniques may
be compared to human judgments.
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>Evaluation with classic algorithms</title>
      <p>Table 1 shows the results over the raw data. The
Euclidean distance presents the best performance. However,
the scores obtained are very low. This is the reason we
propose to apply some statistical transformations.
4.2</p>
    </sec>
    <sec id="sec-8">
      <title>Statistical transformations over the time series</title>
      <p>4https://books.google.com/ngrams</p>
      <sec id="sec-8-1">
        <title>Algorithm</title>
        <p>Cosine
Chebyshev</p>
        <p>DTW
Euclidean</p>
        <p>Jaccard
Manhattan</p>
        <p>Pearson
Roberts
Ruzicka
Spearman
Score
0.28
0.29
0.35
0.32
nosense
0.26
0.28
0.23
0.24
0.36</p>
        <p>One of the best-known smoothing methods is the Moving
Average (MA) technique which takes a certain number of
past periods and add them together; then it divides them by
the number of periods. Table 5 shows the results when using
smoothed time series using MA for the periods 5, 10, 20 and
50 years respectively. Another popular smoothing method is
called Exponential Moving Average (EMA) technique which
applies more weight to recent data. The weighting applied
to the most recent data depends on the number of periods.
Table 6 shows the results when using smoothed time series
using EMA for the periods 5, 10, 20 and 50 years.
4.3</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Best historical window</title>
      <p>Until now, we have only focused in the xed time period
between 1800 and 2000. In order to overcome this limitation,
we have designed an algorithm for trying to capture the
optimal time window for solving the Miller-Charles benchmark
data set. The algorithm we have designed is able to test
every possible con guration for the time windows (with a
minimum size of 2 years), computational algorithm used and
statistical transformation for data. This means we have
automatically tested 2,412,000 di erent con gurations (20,100
di erent windows over 12 di erent statistical
transformations using 10 di erent algorithms). The best results we
have achieved are summarized in Table 7. We can see that
using the Pearson correlation coe cient between the years
1935 and 1942 using raw data or between 1806 and 1820
over a moving average of ve years allows us to solve the</p>
      <sec id="sec-9-1">
        <title>Algorithm</title>
        <p>Cosine
Chebyshev</p>
        <p>DTW
Euclidean</p>
        <p>Jaccard
Manhattan</p>
        <p>Pearson
Roberts
Ruzicka
Spearman
Score
0.28
0.41
0.35
0.30
0.37
0.22
0.28
0.28
0.26
0.37</p>
        <p>Miller-Charles benchmark data set [9] with a high accuracy.
This means that our hypothesis stating that an appropriate
combination of: algorithms, statistical transformation and
time windows could lead to positive results is con rmed.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSIONS</title>
      <p>We have described how we have perform a quantitative
analysis of a vast digital book collection representing a
signi cant sample of the history of literature to solve problems
related to the semantic similarity. In fact, we have shown
that appropriately choosing a combination of quantitative
algorithms for comparing time series representing the
occurrence patterns, some statistical transformations on source
data which can help to reduce noise, and the election of a
correct time window can provide very accurate results when
measuring semantic similarity between single words.</p>
      <sec id="sec-10-1">
        <title>Algorithm</title>
        <p>Cosine
Chebyshev</p>
        <p>
          DTW
Euclidean
Jaccard
Manhattan
Pearson
Roberts
Ruzicka
Spearman
M(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
0.27
0.27
0.22
0.30
0.20
0.29
0.23
0.10
0.11
0.08
M(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
0.25
0.27
0.24
0.29
0.21
0.29
0.21
0.10
0.11
0.08
M(20)
0.24
0.25
0.24
0.29
0.19
0.28
0.18
0.10
0.11
0.08
M(50)
0.25
0.22
0.25
0.28
0.31
0.27
0.15
0.11
0.12
0.09
        </p>
      </sec>
      <sec id="sec-10-2">
        <title>Algorithm</title>
        <p>Pearson
Pearson
Pearson</p>
      </sec>
      <sec id="sec-10-3">
        <title>Data</title>
        <p>
          Raw Data
MA(50)
EMA(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
        </p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Acknowledgments</title>
      <p>We thank the reviewers for their useful comments. This
work has been funded by ACEPROM (Proj. Nr. 841284)
funded by the Austrian Research Promotion Agency (FFG).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bianchini</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Antonellis</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Melchiori</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Flexible Semantic-Based Service Matchmaking and Discovery</article-title>
          .
          <source>World Wide Web</source>
          <volume>11</volume>
          (
          <issue>2</issue>
          ):
          <fpage>227</fpage>
          -
          <lpage>251</lpage>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Castano</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montanelli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lorusso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Instance Matching for Ontology Population</article-title>
          .
          <source>SEBD</source>
          <year>2008</year>
          :
          <fpage>121</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>De</surname>
            <given-names>Virgilio</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Cappellari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Miscione</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Cluster-Based Exploration for E ective Keyword Search over Semantic Datasets</article-title>
          .
          <source>ER</source>
          <year>2009</year>
          :
          <fpage>205</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Deza</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deza</surname>
          </string-name>
          , E. Encyclopedia of Distances.
          <year>2013</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Gulordava</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus</article-title>
          .
          <source>GEMS</source>
          <year>2011</year>
          :
          <fpage>67</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Martinez-Gil</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aldana-Montes</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          <article-title>Semantic similarity measurement using historical google search patterns</article-title>
          .
          <source>Inf. Syst. Frontiers</source>
          <volume>15</volume>
          (
          <issue>3</issue>
          ):
          <fpage>399</fpage>
          -
          <lpage>410</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Martinez-Gil</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Picher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery. I-KNOW</article-title>
          <year>2014</year>
          :
          <volume>1</volume>
          :
          <fpage>1</fpage>
          -
          <issue>1</issue>
          :
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aiden</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pickett</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoiberg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clancy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norvig</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orwant</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Aiden</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>Quantitative analysis of culture using millions of digitized books</article-title>
          ,
          <source>Science</source>
          <volume>331</volume>
          (
          <issue>6014</issue>
          ):
          <fpage>176</fpage>
          -
          <lpage>182</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charles</surname>
            <given-names>W.G.</given-names>
          </string-name>
          <article-title>Contextual correlates of semantic similarity</article-title>
          .
          <source>Language and Cognitive Processes</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strapparava</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Behind the Times: Detecting Epoch Changes using Large Corpora</article-title>
          .
          <source>IJCNLP</source>
          <year>2013</year>
          :
          <fpage>347</fpage>
          -
          <lpage>355</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>