1. INTRODUCTION

Looking for the Best Historical Window for Assessing Semantic Similarity Using Human Literature

Jorge Martinez-Gil

gil@scch.at 1

Mario Pichler

mario.pichler@scch.at 0

Lorena Paoletti

lorena.paoletti@scch.at 0 0 Software Competence Center , Hagenberg, Softwarepark 21, 4232 , Austria 1 Software Competence Center , Hagenberg, Softwarepark 21, 4232, Austria, jorge.martinez

We describe the way to get bene t from broad cultural trends through the quantitative analysis of a vast digital book collection representing the digested history of humanity. Our research work has revealed that appropriately comparing the occurrence patterns of words in some periods of human literature can help us to accurately determine the semantic similarity between these words by means of computers without requiring human intervention. Preliminary results seem to be promising. knowledge integration; semantic similarity; culturomics

1. INTRODUCTION

It is widely accepted that the meaning of words evolve over time. However, it is still unclear if word occurrences in human literature along the history can be meaningful in computing word semantic similarity. By semantic similarity measurement we mean the research challenge whereby two terms are assigned a score based on the likeness of their meaning. Automatic measurement of semantic similarity is considered to be of great importance for many computer related elds since a wide variety of techniques. The reason is that textual semantic similarity measures can be used for understanding beyond the literal lexical representation of words and phrases. For example, it is possible to automatically identify that speci c terms (e.g., Finance) yields matches on similar terms (e.g., Economics, Economic Affairs, Financial A airs, etc.). This capability of understanding beyond the lexical representation of words makes semantic similarity methods to be of great importance to the Linked Data community. For example, the ontology alignment problem can be addressed by means of methods of this kind. (c) 2016, Copyright is with the authors. Published in the Workshop Proceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bordeaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 The traditional approach for solving this problem has consisted of using manually compiled dictionaries to determine the semantic similarity between terms, but an important problem remains open. There is a gap between dictionaries and the language used by people, the reason is a balance that every dictionary must strike: to be comprehensive enough for being a useful reference but concise enough to be practically used. For this reason, many infrequent words are usually omitted. Therefore, how can we measure semantic similarity in situations where terms are not covered by a dictionary? We think Culturomics could be an answer.

Culturomics consists of collecting and analyzing data from the study of human culture. Michel et al. [8] established this discipline by means of their seminal work where they presented a corpus of digitized texts containing 5.2 million books which represent about a 4 percent of all books ever printed. This study of human culture through digitized books have had a strong positive impact in our core research since its inception. In a previous work [7], the idea of word co-occurrence in human literature for supporting semantic correspondence discovery was explored. Now, we go a step further beyond with a much more complete framework being able to improve our past results. Therefore, the main contributions presented in this work are: 1. We propose to use culturomics for trying to determine the semantic similarity between words1 by comparing their occurrence pattern in human literature by means of an appropriate statistical analysis. 2. We evaluate a pool of quantitative algorithms for time series comparison to determine what are the most appropriate methods in this context. These algorithms are going to be applied on some statistical transformations which can help to reduce noise. 3. We try to determine what is the best historical time period for computing semantic similarity using human literature.

The rest of this paper is organized as follows: Section 2 describes related approaches that are proposed in the literature. Section 3 describes the key ideas to understand our contribution. Section 4 presents a qualitative evaluation of our method, and nally, we draw conclusions and future lines of research. 1We focus in the English language only

RELATED WORK

In the past, there have been great e orts in nding new semantic similarity measures mainly due to its fundamental importance in many elds of the modern computer science. The detection of di erent formulations of the same concept is a key method in a lot of computer-related elds. To name only a few, we can refer to a) clustering [3], service matchmaking [1], web data integration [6], or schema matching [2] rely on a good performance when determining the meaning of data.

If we focus on the eld of semantic change, we can see how authors de ne it as a change of one or more meanings of the word in time. Developing automatic methods for identifying changes in word meaning can therefore be useful for both theoretical linguistics and a variety of applications which depend on lexical information. Some works have explored this path, for instance [10] investigated the signi cant changes in the distribution of terms in the Google N-gram corpus and their relationships with emotion words or [5] who presented an approach for automatic detection of semantic change of words based on distributional similarity models. Our approach is di erent in the sense we compute semantic similarity using a speci c historical window. 3.

CONTRIBUTION

Our contribution is an analysis of books published along the history. The aim is to build novel measures which can determine the semantic similarity of words in an automatic way. The main reason for preferring this paradigm rather than a traditional approach based on dictionaries is obvious; according to the book library digitized by Google2, the number of words in the English lexicon is currently above a million. Therefore, there are more words from the data sets we are using than in any dictionary. For instance, the Webster's Dictionary3, lists much less than 400,000 single-word word forms currently [8].

We have chosen ten well-known algorithms for time series comparison. This pool includes distance measures (Euclidean, Chebyshev, Jaccard, and Manhattan) , similarity measures (Cosine, Dynamic Time Warping, Roberts, and Ruzicka), and correlation coe cients (Pearson and Spearman's correlation) [4]. We provide a brief description for each of these algorithms listed in alphabetical order below. We consider that the pair x and y are the time series representation for each of the words to be compared. 1. Cosine similarity is a measure between two time series which determines the cosine of the angle between them.

sim(x; y) =

i=1 xi yi pPn i=1 yi2 i=1 xi2pPn 2. Euclidean distance computes the euclidean distance between each two points along the time series.

vu n sim(x; y) = tuX(xi i=1 yi)2 2http://books.google.com/ngrams 3http://www.merriam-webster.com

sim(x; y) = maxin=1 jxi yij 4. Dynamic Time Warping uses a dynamic programming technique to determine the best alignment that will produce the optimal distance.

sim(x; y) = jxik

yikj n;m

X i=1;k=1 5. Jaccard distance measures the similarity of two sets by comparing the size of the overlapping points against the size of the two time series. 6. Manhattan distance computes the sum of the absolute values of the di erences between the corresponding points from the time series.

sim(x; y) =

i=1(xi ^ yi) Pn i=1(xi _ yi) n sim(x; y) = X jxi

yij i=1 ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) 7. Pearson Correlation determines the ratio between the covariance and the standard deviation of two time series.

sim(x; y) =

Pin=1(xi x)(yi

y) pPin=1(xi x)2pPin=1(yi y)2 8. Roberts similarity examines the relation between the sum of each two corresponding points within the min and max of them.

sim(x; y) =

Pin=1(xi + yi) minfxi;yig maxfxi;yig

Pin=1(xi + yi) 9. Ruzicka similarity tries to nd the di erence between each of two corresponding pairs divided by the maximum for each case.

sim(x; y) =

i=1 min(xi; yi) Pn i=1 max(xi; yi) 10. Spearman's rank correlation is a statistical measure that tries to nd if there is a monotonic relationship between the two time series.

sim(x; y) = 1 ( 10 ) 6 P(xi

N (N 2 yi)2 1)

Therefore, our contribution is a framework where the problem is addressed using di erent perspectives: a) algorithms for comparing time series similarity, b) statistical transformations of time series using reduction, baseline removal, rescaling and smoothing techniques, and c) looking for the most appropriate time window, thus, the range of years which helps us to perform the most accurate predictions. 3.1

Working with statistical transformations

Working with time series has a number of problems since two similar time series can present the same pattern but di erent occurrence volumes. This can be solved by means of normalization techniques. However, there are some algorithms where normalization has not any kind of e ect, for instance when using Cosine Distance which tries to measure the angle between the two vectors of numeric values. 3.1.1

Smoothing of the original time series

Smoothing a time series consists of creating an approximating function to capture important patterns, while leaving out noise or other disturbing phenomena. Therefore, smoothing is a widely used technique for reducing of canceling the e ect due to random variations. This technique, when properly applied, reveals more clearly the underlying trend of the time series. We want to run the algorithms in smoothed data because this kind of technique can help us to obtain cleaner time series and, therefore, results are going to re ect trends more clearly. 3.2

Looking for the best time window

Methods presented until now can give us some advice about what direction should be explored. However, these results are far from being considered optimal. One of the main reasons is that we have only focused in a xed time period. In order to overcome this limitation, we have designed an algorithm for trying to capture the optimal time window for solving the Miller-Charles benchmark data set.

EVALUATION

We report our results using the 1-gram data set o ered by Google4. The data is in the range between 1800 and 2000. The reason is that there are not enough books before 1800 to reliably quantify many of the queries from the data sets we are using. On the other hand, after year 2000, quality of the corpus is lower since it is subject to many changes. Results are obtained according Miller-Charles data set [9]. The rationale behind this way to evaluate quality is that the results obtained by means of arti cial techniques may be compared to human judgments. 4.1

Evaluation with classic algorithms

Table 1 shows the results over the raw data. The Euclidean distance presents the best performance. However, the scores obtained are very low. This is the reason we propose to apply some statistical transformations. 4.2

Statistical transformations over the time series

4https://books.google.com/ngrams

Algorithm

Cosine Chebyshev

DTW Euclidean

Jaccard Manhattan

Pearson Roberts Ruzicka Spearman Score 0.28 0.29 0.35 0.32 nosense 0.26 0.28 0.23 0.24 0.36

One of the best-known smoothing methods is the Moving Average (MA) technique which takes a certain number of past periods and add them together; then it divides them by the number of periods. Table 5 shows the results when using smoothed time series using MA for the periods 5, 10, 20 and 50 years respectively. Another popular smoothing method is called Exponential Moving Average (EMA) technique which applies more weight to recent data. The weighting applied to the most recent data depends on the number of periods. Table 6 shows the results when using smoothed time series using EMA for the periods 5, 10, 20 and 50 years. 4.3

Best historical window

Until now, we have only focused in the xed time period between 1800 and 2000. In order to overcome this limitation, we have designed an algorithm for trying to capture the optimal time window for solving the Miller-Charles benchmark data set. The algorithm we have designed is able to test every possible con guration for the time windows (with a minimum size of 2 years), computational algorithm used and statistical transformation for data. This means we have automatically tested 2,412,000 di erent con gurations (20,100 di erent windows over 12 di erent statistical transformations using 10 di erent algorithms). The best results we have achieved are summarized in Table 7. We can see that using the Pearson correlation coe cient between the years 1935 and 1942 using raw data or between 1806 and 1820 over a moving average of ve years allows us to solve the

Algorithm

Cosine Chebyshev

DTW Euclidean

Jaccard Manhattan

Pearson Roberts Ruzicka Spearman Score 0.28 0.41 0.35 0.30 0.37 0.22 0.28 0.28 0.26 0.37

Miller-Charles benchmark data set [9] with a high accuracy. This means that our hypothesis stating that an appropriate combination of: algorithms, statistical transformation and time windows could lead to positive results is con rmed.

CONCLUSIONS

We have described how we have perform a quantitative analysis of a vast digital book collection representing a signi cant sample of the history of literature to solve problems related to the semantic similarity. In fact, we have shown that appropriately choosing a combination of quantitative algorithms for comparing time series representing the occurrence patterns, some statistical transformations on source data which can help to reduce noise, and the election of a correct time window can provide very accurate results when measuring semantic similarity between single words.

Algorithm

Cosine Chebyshev

DTW Euclidean Jaccard Manhattan Pearson Roberts Ruzicka Spearman M( 5 ) 0.27 0.27 0.22 0.30 0.20 0.29 0.23 0.10 0.11 0.08 M( 10 ) 0.25 0.27 0.24 0.29 0.21 0.29 0.21 0.10 0.11 0.08 M(20) 0.24 0.25 0.24 0.29 0.19 0.28 0.18 0.10 0.11 0.08 M(50) 0.25 0.22 0.25 0.28 0.31 0.27 0.15 0.11 0.12 0.09

Algorithm

Pearson Pearson Pearson

Data

Raw Data MA(50) EMA( 5 )

Acknowledgments

We thank the reviewers for their useful comments. This work has been funded by ACEPROM (Proj. Nr. 841284) funded by the Austrian Research Promotion Agency (FFG).

[1] Bianchini , D. , De Antonellis , V. , Melchiori , M. Flexible Semantic-Based Service Matchmaking and Discovery . World Wide Web 11 ( 2 ): 227 - 251 ( 2008 ).

[2] Castano , S. , Ferrara , A. , Montanelli , S. , Lorusso , D. : Instance Matching for Ontology Population . SEBD 2008 : 121 - 132 .

[3] De

Virgilio

, R. , Cappellari , P. , Miscione , M. Cluster-Based Exploration for E ective Keyword Search over Semantic Datasets . ER 2009 : 205 - 218 .

[4] Deza , M.M. , Deza , E. Encyclopedia of Distances. 2013 . Springer.

[5] Gulordava , K. , Baroni , M. A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus . GEMS 2011 : 67 - 71 .

[6] Martinez-Gil , J. , Aldana-Montes , J.F. Semantic similarity measurement using historical google search patterns . Inf. Syst. Frontiers 15 ( 3 ): 399 - 410 ( 2013 ).

[7] Martinez-Gil , J. , Picher , M.

Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery. I-KNOW

2014 : 1 : 1 - 1 : 7 .

[8] Michel , J.B. , Shen , Y. , Aiden , A. , Veres , A. , Gray , M. , Pickett , J. , Hoiberg , D. , Clancy , D. , Norvig , P. , Orwant , J. , Pinker , S. , Nowak , M. , and Aiden , E. Quantitative analysis of culture using millions of digitized books , Science 331 ( 6014 ): 176 - 182 ( 2011 ).

[9] Miller , G.A. , Charles

W.G.

Contextual correlates of semantic similarity . Language and Cognitive Processes , 6 ( 1 ): 1 - 28 ( 1991 ).

[10] Popescu , O. , Strapparava , C. Behind the Times: Detecting Epoch Changes using Large Corpora . IJCNLP 2013 : 347 - 355 .