<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards automatic language evolution tracking ⋆ A study on word sense tracking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nina Tahmasebi</string-name>
          <email>tahmasebi@L3S.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Risse</string-name>
          <email>risse@L3S.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Dietze</string-name>
          <email>dietze@L3S.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center</institution>
          ,
          <addr-line>Hanover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowing about language evolution can significantly help to reveal lost information and help access documents containing language that has long since been forgotten. In this position paper we will report on our methods for finding word senses and show how these can be used to reveal important information about their evolution over time. We discuss the weaknesses of current approaches and outline future work to overcome these weaknesses.</p>
      </abstract>
      <kwd-group>
        <kwd>language evolution</kwd>
        <kwd>word sense discrimination</kwd>
        <kwd>clustering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>This work is motivated by the goal to ensure the accessibility and especially the
interpretability of long term archives in order to secure knowledge for future
generations. Language is evolving over time; new terms are created, existing
terms change their meanings and others disappear. The available technology for
accessing digital archives works well as long as the user is aware of the language
evolution. But how should a young scholar find out that the term fireman was
used in the 19th century to describe a firefighter?</p>
      <p>Etymological dictionaries can be used to address the issue of language
evolution by providing mappings or expanding queries. However, such dictionaries
have several drawbacks. Firstly, they are rare and general. Few domain specific
etymological dictionaries, such as Medline [AS05] for the medical domain, are
available. Secondly, most of these dictionaries are created manually [oed,Mil95].</p>
      <p>New kinds of digital archives and collections, e.g. Web archives, will
increasingly store user generated content (e.g., Blogs, tweets, forums etc) which follow
few norms. Slang and gadget names are used frequently but rarely make it into
a formal dictionary. To make matters worse, these terms change at a rapid pace.
Due to the change rate, as well as the huge amount of data stored in archives,
it will not be possible to manually create and maintain entries and mappings
for term evolution. Instead, there will be an increasing need to find and handle
changes in language in an automatic way.
⋆ This work is partly funded by the European Commission under ARCOMEM (ICT
270239)</p>
      <p>Since automatic approaches for finding word senses within a collection of
text already exist, namely word sense discrimination (WSD), these are natural
starting points towards an automatic method for detecting language evolution.
In [TNTR10] we presented our processing method for WSD and analyzed its
applicability on historic document collections. In this paper we will focus on how
WSD can be used to reveal important information about language evolution over
time. We discuss the weaknesses of current approaches and outline open issues
to overcome these weaknesses.</p>
      <p>In the next section we discuss the method used for finding word senses.
In section 3 we present our experiments with word sense discrimination to find
language evolution. The paper finishes with conclusions and an outlook on future
work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Automatically Detecting Word Senses</title>
      <p>In this paper our understanding of a word sense is to get a description of the
meaning of a term in the context of the analyzed collection. In order to find word
senses from large text collections, automated methods need to be exploited. For
this reason we use word sense discrimination as an unsupervised learning method
for grouping words that represent the same sense. The process consists of three
main steps:
1. Pre-processing
2. Co-occurrence graph creation
3. Word sense clustering
Pre-processing We pre-process text by performing an initial cleaning of the data
using regular expressions and apply an OCR error correction method described
in [Nik10]. Next we extract nouns and noun phrases of size two, here on terms,
from the cleaned text. We use two part-of-speech taggers namely TreeTagger 1
and Lingua::EN::Tagger 2 to identify and lemmatize terms. These are added to
a dictionary corresponding to the corpora in which the terms were found.
Co-occurrence graph creation After creating the dictionary, a co-occurrence graph
is created. All terms that are separated with an and, or or comma are
considered co-occurring. For example, if the sequence “. . . sports like tennis, football
and rugby . . . ” is found, the terms “tennis”, “football” and “rugby” are
considered co-occurring. Within the graph, each term is represented as node where
linked nodes represent co-occurring terms. Finally, the graph is filtered and only
co-occurrences that occur at least trice in the collection are kept. This threshold
was indentified during past experiments and aims at reducing the level of noise
and removing the most spurious connections.
1 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
2 http://search.cpan.org/ acoburn/Lingua-EN-Tagger-0.15/Tagger.pm
Word sense clustering The clustering step is the core step of word sense
discrimination. The curvature clustering algorithm proposed by [DES04] computes
the clustering coefficient [WS98], also called the curvature value, for each node
to cluster the graph.</p>
      <p>All nodes with a curvature value below a certain threshold are removed. These
nodes correspond to terms that (1) have no significant sense in the collection or
(2) are ambiguous, that is, they connect parts of the graph that would otherwise
not be connected. By removing those terms, the graph falls apart into connected
components that correspond the cluster core. E.g., the term rock is likely to
connect terms related to its stone sense with terms from its music sense that
would otherwise not be connected. To capture also the ambiguous terms, the
cluster core is extended with all terms that co-occur with the terms in the cluster
core. In this paper we use a curvature threshold of 0.3.
3
3.1</p>
      <sec id="sec-2-1">
        <title>Data</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Towards Word Sense Evolution</title>
      <p>For our experiments we use The Times Archive [Tim08] (London) because of its
long time span. The corpus consists of articles from 1785 − 1985 and contains
7.8 million articles scanned from microfilm in 2001. The articles contain some
amount of OCR errors, decreasing with time. A more in depth description of the
corpus can be found in [TNTR10]. More than half of the errors were corrected
during the initial cleaning of the data (Step 1 in Section 2), however, a large
amount still remain. The resulting co-occurrence graphs follow the amount of
errors in the data and are larger if the articles contain fewer errors. The number
of clusters that can be found per year is highly dependent on the graph size and
result in an average of 575 clusters per year and 7.5 terms per cluster.
3.2</p>
      <sec id="sec-3-1">
        <title>Experiments</title>
        <p>In this study we manually choose terms for which we have reason to believe
there has been evolution. We look at the frequency of the terms and extract all
available clusters. Our aim is to see how much can be revealed with respect to
language evolution by examining word sense clusters.</p>
        <p>St. Petersburg The city of St. Petersburg (refereed to only as Petersburg from
now on) was founded in 1703 as “Sankt-Piter-Burh” and soon after renamed to
“Saint Petersburg”. From 1914-1924 it was named “Petrograd” and afterwards
“Leningrad” and since 1991 the name is again “Saint Petersburg”.</p>
        <p>In Figure 1 the term frequencies of the city names from The Times Archive
are shown. Petersburg was first mentioned in 1805 and then occasionally until
1838 after which it figured frequently in the corpus. The first mentioning of
Petrograd was 1914 corresponding well to the name change. Starting 1923 the
frequency of Petrograd decreases and is mentioned only occasionally after 1939.
Leningrad is mentioned the first time in 1920 and then again between 1924−1985.</p>
        <p>Term Frequency from The Times Archive for St. Petersburg
0.02%
0.015%
0.01%
0.005%</p>
        <p>Petersburg
Petrograd</p>
        <p>Leningrad</p>
        <p>In Table 1 we see some clusters for Petersburg (1856-1913), Petrograd
(19141918) and Leningrad (1928-1978). There is little in the clusters to indicate that
all three terms represent the same city. However, clusters for Petrograd exist only
between 1944-1918 and together with the term frequency of the term this can be
seen as hints that the city of Petrograd existed only temporary. From the term
frequencies we can see that Petersburg looses in frequency as Petrograd gains.
Also the clusters are changed and there are no clusters for Petersburg after
Petrograd has been introduced. The name change between Petrograd and Leningrad
does not follow the same characteristics as the first cluster with Leningrad
appears 10 years after the last one with Petrograd.</p>
        <p>The peak in the frequencies do not offer any hints of evolution for the term.
Instead the peak in 1905 for Petersburg are most likely induced by the
general strike of October 1905, the peaks for Petrograd (1915-1917) and Leningrad
(1941) correspond to World War I (WWI) and World War II (WWII).
Travel The term travel has no name change but rather a concept change. In
Figure 2 we see the frequency of travel and traveller from The Times Archive. For
travel we find that the frequency increases around 1912 and has a significantly
higher frequency until 1985 with some dips for WWI, WWII, 1960’s and 1979.</p>
        <p>Term Frequency from The Times Archive for Travel
0.08%
0.06%
0.04%
0.02%</p>
        <p>Travel</p>
        <p>Traveller
0%
1785 1810 1835 1860 1885 1910 1935 1960 1985</p>
        <p>Year</p>
        <p>To find the sense of travel we look at Table 2 where a subset of the clusters
are shown. Until 1906 we find that the term has been clustered with other terms
like literature, science, art, book all indicating that travel was a topic reserved for
the privileged few and mostly accessible in books for the rest. However, starting
1906 we find travel clustered with terms like full board, hotel, sightseeing, sea
side to indicate that the concept of travel became more concrete and accessible
in everyday life. This change coincides with a higher frequency of travel in the
corpus and the clusters clearly show us that change has occurred.</p>
        <p>A similar shift in concept can also be seen in clusters concerning travellers.
In Table 3 we see that the type of people that traveled change. The first two
clusters containing the term yellow admiral refer to the classic “The Wags, or the
Camp of Pleasure” by Charles Dibdin. As with the senses of travel the traveller
transforms from being a salesman, clerk or merchant to being more concrete
with terms like visa, passport, ticket, commuter.</p>
        <p>Flight The terms aeroplane and aircraft correspond to manmade devices and
were introduced in The Times Archive before WWI. In Figure 3 we find the
term frequencies. Both terms exhibit peaks during WWI and WWII but after
WWII, aircraft gains in popularity while aeroplane is forgotten.</p>
        <p>Term Frequency from The Times Archive for Airplane
0.1%
0.08%
0.06%
0.04%
0.02%</p>
        <p>Aeroplane
Aircraft</p>
        <p>Flight
0%
1785 1810 1835 1860 1885 1910 1935 1960 1985</p>
        <p>Year
The term flight however, was present already before the introduction of the
flying machines. In Figure 3 we see that it was present in the collection already
in 1785. Together with aircraft and aeroplane, the term flight increases in
frequency before WWI. During WWII the term aeroplane is more or less replaced
by aircraft. During this period, the term flight keeps a high frequency which
indicates that it is related to the concept of flight and not to a specific term.</p>
        <p>In Table 4 we can follow the evolution of the concept of flight. Between
18261833 the terms robson, flight, organ builder correspond to the names Flight &amp;
Robson who were indeed (church) organ builders. From 1869-1895 the clusters
contain hurdle race, flight, yard and indicate the flight over a hurdle. 1938-1957
flight is clustered with terms like direction, length, spin, pace and refer to the
flight of a cricket ball. Starting 1973 we find flight clustered with terms that
represent its most common use today, a flight in a holiday sense.
Looking at the examples presented in the previous sections, we find that they
differ in character. For the St. Petersburg example, we find limited relation
between the term frequencies and name changes. Instead, peaks in the frequency
correspond to events. For the clusters, we also find little evidence of change.
Though clusters containing a city name only exist when the city name is active,
the clusters cannot directly be used to map city names automatically.</p>
        <p>One explanation for the lack of relation can be that the clusters do not
correspond to true word senses. Instead clustering algorithms aimed at capturing
entity descriptions might results in clusters which can better provide a basis for
finding the name changes automatically. Another possible explanation is related
to the specific characteristics of individual datasets which might be more or less
suitable to derive information about particular types of entities.</p>
        <p>The travel example however, is a representative of a concept evolution rather
than name change. Here we find a strong relation between increased frequency
and changed meaning. Based on the flight example we recognized two aspects.
Term frequency for aeroplane and aircraft appeared with the invention and
introduction of the inventions in daily life. The term flight, however, changed or added
a meaning. Also, the relation between increase in frequency and the change in
meaning for flight is strong. The flight example falls in the same category as
Internet and surfing where Internet was the invention and surfing the term that
changed/added a sense as a consequence. More in depth analysis is required to
see if these relationships can be identified in an automatic fashion.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>In this study we exploited automatically identified word senses and term
frequencies to investigate if language evolution could be detected. We found that
concept evolution is well represented in the word senses and word sense tracking
can thus be used for this type of language evolution tracking. However, word
senses and frequency information were not sufficient to automatically find terms
that replace each other over time (e.g., St.P etersburg → P etrograd). We found
that frequency bursts can be caused both by language evolution as well as events;
however, event driven bursts are not presented in our clusters and need to be
detected using supplementary methods. As part of future work we will focus on
finding more clusters to overcome the cluster sparseness and to classify reasons
for frequency bursts, e.g., strikes, fires and political events.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We would like to thank Times Newspapers Limited for providing the archive of
The Times for our research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [AS05]
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Abecker</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ljiljana</given-names>
            <surname>Stojanovic</surname>
          </string-name>
          .
          <article-title>Ontology evolution: Medline case study</article-title>
          .
          <source>In Proceedings of Wirtschaftsinformatik</source>
          <year>2005</year>
          : eEconomy, eGovernment, eSociety, pages
          <fpage>1291</fpage>
          -
          <lpage>1308</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [DES04]
          <string-name>
            <given-names>Beate</given-names>
            <surname>Dorow</surname>
          </string-name>
          , Jean-pierre
          <string-name>
            <surname>Eckmann</surname>
            , and
            <given-names>Danilo</given-names>
          </string-name>
          <string-name>
            <surname>Sergi</surname>
          </string-name>
          .
          <article-title>Using curvature and markov clustering in graphs for lexical acquisition and word sense discrimination</article-title>
          .
          <source>In In Workshop MEANING-2005</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>[Mil95] George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>WordNet: A Lexical Database for English</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>38</volume>
          :
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Nik10]
          <string-name>
            <given-names>Kai</given-names>
            <surname>Niklas</surname>
          </string-name>
          .
          <article-title>Unsupervised post-correction of ocr errors</article-title>
          .
          <source>Master's thesis</source>
          , Leibniz Universit¨at Hannover,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Tim08] The Times of London</source>
          ,
          <year>2008</year>
          . http://archive.timesonline.co.uk/tol/archive/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [TNTR10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tahmasebi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Niklas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Theuerkauf</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Risse</surname>
          </string-name>
          .
          <article-title>Using Word Sense Discrimination on Historic Document Collections</article-title>
          . In
          <source>In Proc. of 10th ACM/IEEE Joint Conference on Digital Libraries (JCDL)</source>
          , Surfers Paradise, Gold Coast, Australia,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [WS98]
          <string-name>
            <given-names>D.J.</given-names>
            <surname>Watts</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Strogatz</surname>
          </string-name>
          .
          <article-title>Collective dynamics of “small-world” networks</article-title>
          .
          <source>Nature</source>
          ,
          <volume>393</volume>
          :
          <fpage>440</fpage>
          -
          <lpage>442</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>