Introduction

Towards automatic language evolution tracking ⋆ A study on word sense tracking

Nina Tahmasebi

tahmasebi@L3S.de 0

Thomas Risse

risse@L3S.de 0

Stefan Dietze

dietze@L3S.de 0 0 L3S Research Center , Hanover , Germany

Knowing about language evolution can significantly help to reveal lost information and help access documents containing language that has long since been forgotten. In this position paper we will report on our methods for finding word senses and show how these can be used to reveal important information about their evolution over time. We discuss the weaknesses of current approaches and outline future work to overcome these weaknesses.

language evolution word sense discrimination clustering

Introduction

This work is motivated by the goal to ensure the accessibility and especially the interpretability of long term archives in order to secure knowledge for future generations. Language is evolving over time; new terms are created, existing terms change their meanings and others disappear. The available technology for accessing digital archives works well as long as the user is aware of the language evolution. But how should a young scholar find out that the term fireman was used in the 19th century to describe a firefighter?

Etymological dictionaries can be used to address the issue of language evolution by providing mappings or expanding queries. However, such dictionaries have several drawbacks. Firstly, they are rare and general. Few domain specific etymological dictionaries, such as Medline [AS05] for the medical domain, are available. Secondly, most of these dictionaries are created manually [oed,Mil95].

New kinds of digital archives and collections, e.g. Web archives, will increasingly store user generated content (e.g., Blogs, tweets, forums etc) which follow few norms. Slang and gadget names are used frequently but rarely make it into a formal dictionary. To make matters worse, these terms change at a rapid pace. Due to the change rate, as well as the huge amount of data stored in archives, it will not be possible to manually create and maintain entries and mappings for term evolution. Instead, there will be an increasing need to find and handle changes in language in an automatic way. ⋆ This work is partly funded by the European Commission under ARCOMEM (ICT 270239)

Since automatic approaches for finding word senses within a collection of text already exist, namely word sense discrimination (WSD), these are natural starting points towards an automatic method for detecting language evolution. In [TNTR10] we presented our processing method for WSD and analyzed its applicability on historic document collections. In this paper we will focus on how WSD can be used to reveal important information about language evolution over time. We discuss the weaknesses of current approaches and outline open issues to overcome these weaknesses.

In the next section we discuss the method used for finding word senses. In section 3 we present our experiments with word sense discrimination to find language evolution. The paper finishes with conclusions and an outlook on future work. 2

Automatically Detecting Word Senses

In this paper our understanding of a word sense is to get a description of the meaning of a term in the context of the analyzed collection. In order to find word senses from large text collections, automated methods need to be exploited. For this reason we use word sense discrimination as an unsupervised learning method for grouping words that represent the same sense. The process consists of three main steps: 1. Pre-processing 2. Co-occurrence graph creation 3. Word sense clustering Pre-processing We pre-process text by performing an initial cleaning of the data using regular expressions and apply an OCR error correction method described in [Nik10]. Next we extract nouns and noun phrases of size two, here on terms, from the cleaned text. We use two part-of-speech taggers namely TreeTagger 1 and Lingua::EN::Tagger 2 to identify and lemmatize terms. These are added to a dictionary corresponding to the corpora in which the terms were found. Co-occurrence graph creation After creating the dictionary, a co-occurrence graph is created. All terms that are separated with an and, or or comma are considered co-occurring. For example, if the sequence “. . . sports like tennis, football and rugby . . . ” is found, the terms “tennis”, “football” and “rugby” are considered co-occurring. Within the graph, each term is represented as node where linked nodes represent co-occurring terms. Finally, the graph is filtered and only co-occurrences that occur at least trice in the collection are kept. This threshold was indentified during past experiments and aims at reducing the level of noise and removing the most spurious connections. 1 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ 2 http://search.cpan.org/ acoburn/Lingua-EN-Tagger-0.15/Tagger.pm Word sense clustering The clustering step is the core step of word sense discrimination. The curvature clustering algorithm proposed by [DES04] computes the clustering coefficient [WS98], also called the curvature value, for each node to cluster the graph.

All nodes with a curvature value below a certain threshold are removed. These nodes correspond to terms that (1) have no significant sense in the collection or (2) are ambiguous, that is, they connect parts of the graph that would otherwise not be connected. By removing those terms, the graph falls apart into connected components that correspond the cluster core. E.g., the term rock is likely to connect terms related to its stone sense with terms from its music sense that would otherwise not be connected. To capture also the ambiguous terms, the cluster core is extended with all terms that co-occur with the terms in the cluster core. In this paper we use a curvature threshold of 0.3. 3 3.1

Data Towards Word Sense Evolution

For our experiments we use The Times Archive [Tim08] (London) because of its long time span. The corpus consists of articles from 1785 − 1985 and contains 7.8 million articles scanned from microfilm in 2001. The articles contain some amount of OCR errors, decreasing with time. A more in depth description of the corpus can be found in [TNTR10]. More than half of the errors were corrected during the initial cleaning of the data (Step 1 in Section 2), however, a large amount still remain. The resulting co-occurrence graphs follow the amount of errors in the data and are larger if the articles contain fewer errors. The number of clusters that can be found per year is highly dependent on the graph size and result in an average of 575 clusters per year and 7.5 terms per cluster. 3.2

Experiments

In this study we manually choose terms for which we have reason to believe there has been evolution. We look at the frequency of the terms and extract all available clusters. Our aim is to see how much can be revealed with respect to language evolution by examining word sense clusters.

St. Petersburg The city of St. Petersburg (refereed to only as Petersburg from now on) was founded in 1703 as “Sankt-Piter-Burh” and soon after renamed to “Saint Petersburg”. From 1914-1924 it was named “Petrograd” and afterwards “Leningrad” and since 1991 the name is again “Saint Petersburg”.

In Figure 1 the term frequencies of the city names from The Times Archive are shown. Petersburg was first mentioned in 1805 and then occasionally until 1838 after which it figured frequently in the corpus. The first mentioning of Petrograd was 1914 corresponding well to the name change. Starting 1923 the frequency of Petrograd decreases and is mentioned only occasionally after 1939. Leningrad is mentioned the first time in 1920 and then again between 1924−1985.

Term Frequency from The Times Archive for St. Petersburg 0.02% 0.015% 0.01% 0.005%

Petersburg Petrograd

Leningrad

In Table 1 we see some clusters for Petersburg (1856-1913), Petrograd (19141918) and Leningrad (1928-1978). There is little in the clusters to indicate that all three terms represent the same city. However, clusters for Petrograd exist only between 1944-1918 and together with the term frequency of the term this can be seen as hints that the city of Petrograd existed only temporary. From the term frequencies we can see that Petersburg looses in frequency as Petrograd gains. Also the clusters are changed and there are no clusters for Petersburg after Petrograd has been introduced. The name change between Petrograd and Leningrad does not follow the same characteristics as the first cluster with Leningrad appears 10 years after the last one with Petrograd.

The peak in the frequencies do not offer any hints of evolution for the term. Instead the peak in 1905 for Petersburg are most likely induced by the general strike of October 1905, the peaks for Petrograd (1915-1917) and Leningrad (1941) correspond to World War I (WWI) and World War II (WWII). Travel The term travel has no name change but rather a concept change. In Figure 2 we see the frequency of travel and traveller from The Times Archive. For travel we find that the frequency increases around 1912 and has a significantly higher frequency until 1985 with some dips for WWI, WWII, 1960’s and 1979.

Term Frequency from The Times Archive for Travel 0.08% 0.06% 0.04% 0.02%

Travel

Traveller 0% 1785 1810 1835 1860 1885 1910 1935 1960 1985

Year

To find the sense of travel we look at Table 2 where a subset of the clusters are shown. Until 1906 we find that the term has been clustered with other terms like literature, science, art, book all indicating that travel was a topic reserved for the privileged few and mostly accessible in books for the rest. However, starting 1906 we find travel clustered with terms like full board, hotel, sightseeing, sea side to indicate that the concept of travel became more concrete and accessible in everyday life. This change coincides with a higher frequency of travel in the corpus and the clusters clearly show us that change has occurred.

A similar shift in concept can also be seen in clusters concerning travellers. In Table 3 we see that the type of people that traveled change. The first two clusters containing the term yellow admiral refer to the classic “The Wags, or the Camp of Pleasure” by Charles Dibdin. As with the senses of travel the traveller transforms from being a salesman, clerk or merchant to being more concrete with terms like visa, passport, ticket, commuter.

Flight The terms aeroplane and aircraft correspond to manmade devices and were introduced in The Times Archive before WWI. In Figure 3 we find the term frequencies. Both terms exhibit peaks during WWI and WWII but after WWII, aircraft gains in popularity while aeroplane is forgotten.

Term Frequency from The Times Archive for Airplane 0.1% 0.08% 0.06% 0.04% 0.02%

Aeroplane Aircraft

Flight 0% 1785 1810 1835 1860 1885 1910 1935 1960 1985

Year The term flight however, was present already before the introduction of the flying machines. In Figure 3 we see that it was present in the collection already in 1785. Together with aircraft and aeroplane, the term flight increases in frequency before WWI. During WWII the term aeroplane is more or less replaced by aircraft. During this period, the term flight keeps a high frequency which indicates that it is related to the concept of flight and not to a specific term.

In Table 4 we can follow the evolution of the concept of flight. Between 18261833 the terms robson, flight, organ builder correspond to the names Flight & Robson who were indeed (church) organ builders. From 1869-1895 the clusters contain hurdle race, flight, yard and indicate the flight over a hurdle. 1938-1957 flight is clustered with terms like direction, length, spin, pace and refer to the flight of a cricket ball. Starting 1973 we find flight clustered with terms that represent its most common use today, a flight in a holiday sense. Looking at the examples presented in the previous sections, we find that they differ in character. For the St. Petersburg example, we find limited relation between the term frequencies and name changes. Instead, peaks in the frequency correspond to events. For the clusters, we also find little evidence of change. Though clusters containing a city name only exist when the city name is active, the clusters cannot directly be used to map city names automatically.

One explanation for the lack of relation can be that the clusters do not correspond to true word senses. Instead clustering algorithms aimed at capturing entity descriptions might results in clusters which can better provide a basis for finding the name changes automatically. Another possible explanation is related to the specific characteristics of individual datasets which might be more or less suitable to derive information about particular types of entities.

The travel example however, is a representative of a concept evolution rather than name change. Here we find a strong relation between increased frequency and changed meaning. Based on the flight example we recognized two aspects. Term frequency for aeroplane and aircraft appeared with the invention and introduction of the inventions in daily life. The term flight, however, changed or added a meaning. Also, the relation between increase in frequency and the change in meaning for flight is strong. The flight example falls in the same category as Internet and surfing where Internet was the invention and surfing the term that changed/added a sense as a consequence. More in depth analysis is required to see if these relationships can be identified in an automatic fashion. 4

Conclusions and Future Work

In this study we exploited automatically identified word senses and term frequencies to investigate if language evolution could be detected. We found that concept evolution is well represented in the word senses and word sense tracking can thus be used for this type of language evolution tracking. However, word senses and frequency information were not sufficient to automatically find terms that replace each other over time (e.g., St.P etersburg → P etrograd). We found that frequency bursts can be caused both by language evolution as well as events; however, event driven bursts are not presented in our clusters and need to be detected using supplementary methods. As part of future work we will focus on finding more clusters to overcome the cluster sparseness and to classify reasons for frequency bursts, e.g., strikes, fires and political events. 5

Acknowledgments

We would like to thank Times Newspapers Limited for providing the archive of The Times for our research.

[AS05]

Andreas

Abecker and

Ljiljana

Stojanovic . Ontology evolution: Medline case study . In Proceedings of Wirtschaftsinformatik 2005 : eEconomy, eGovernment, eSociety, pages 1291 - 1308 , 2005 .

[DES04]

Beate

Dorow , Jean-pierre Eckmann , and Danilo Sergi . Using curvature and markov clustering in graphs for lexical acquisition and word sense discrimination . In In Workshop MEANING-2005 , 2004 .

[Mil95] George

Miller . WordNet: A Lexical Database for English . Communications of the ACM , 38 : 39 - 41 , 1995 .

[Nik10]

Kai

Niklas . Unsupervised post-correction of ocr errors . Master's thesis , Leibniz Universit¨at Hannover, 2010 .

[Tim08] The Times of London , 2008 . http://archive.timesonline.co.uk/tol/archive/.

[TNTR10]

Tahmasebi ,

Niklas ,

Theuerkauf , and

Risse . Using Word Sense Discrimination on Historic Document Collections . In In Proc. of 10th ACM/IEEE Joint Conference on Digital Libraries (JCDL) , Surfers Paradise, Gold Coast, Australia, 2010 .

[WS98]

D.J.

Watts and

Strogatz . Collective dynamics of “small-world” networks . Nature , 393 : 440 - 442 , 1998 .