=Paper=
{{Paper
|id=Vol-1495/paper_5
|storemode=property
|title=Measuring the Relatedness between Documents in Comparable Corpora
|pdfUrl=https://ceur-ws.org/Vol-1495/paper_5.pdf
|volume=Vol-1495
|dblpUrl=https://dblp.org/rec/conf/tia/CostaPM15
}}
==Measuring the Relatedness between Documents in Comparable Corpora==
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
29
Measuring the Relatedness between Documents in Comparable Corpora
Hernani Costaa , Gloria Corpas Pastora and Ruslan Mitkovb
a
LEXYTRAD, University of Malaga, Spain
b
RIILP, University of Wolverhampton, UK
{hercos,gcorpas}@uma.es, r.mitkov@wlv.ac.uk
Abstract linguistic resources and attempting a meaningful
description of their content is often a perilous
This paper aims at investigating the task (Corpas Pastor and Seghiri, 2009). Usually,
use of textual distributional similarity
a corpus is given a short description such as
measures in the context of comparable
corpora. We address the issue of measuring “casual speech transcripts” or “tourism specialised
the relatedness between documents by comparable corpus”. Yet, such tags will be of
extracting, measuring and ranking their little use to those users seeking for a representative
common content. For this purpose, we and/or high quality domain-specific corpora.
designed and applied a methodology Apart from the usual description that comes
that exploits available natural language along with the corpus, like number of documents,
processing technology with statistical
tokens, types, source(s), creation date, policies
methods. Our findings showed that using
a list of common entities and a simple,
of usage, etc., nothing is said about how similar
yet robust set of distributional similarity the documents are or how to retrieve the most
measures was enough to describe and related ones. As a result, most of the resources
assess the degree of relatedness between at our disposal are built and shared without deep
the documents. Moreover, our method has analysis of their content, and those who use them
demonstrated high performance in the task blindly trust on the people’s or research group’s
of filtering out documents with a low level name behind their compilation process, without
of relatedness. By a way of example, one
knowing nothing about the relatedness quality
of the measures got 100%, 100%, 95% and
90% precision when injected 5%, 10%, of the documents. Although some tasks require
15% and 20% of noise, respectively. documents with a high degree of relatedness
between each other, the literature is scarce on this
matter.
1 Introduction
Accordingly, this work explores this niche by
Comparable corpora1 can be considered an taking advantage of several textual Distributional
important resource for several research areas Similarity Measures (DSMs) presented in the
such as Natural Language Processing (NLP), literature. Firstly, we selected a specialised
terminology, language teaching, and automatic corpus about tourism and beauty domain that was
and assisted translation, amongst other related manually compiled by researchers in the area of
areas. Nevertheless, an inherent problem to those translation and interpreting studies. Then, we
who deal with comparable corpora in a daily designed and applied a methodology that exploits
basis is the uncertainty about the data they are available NLP technology with statistical methods
dealing with. Indeed, little work has been done to assess how the documents correlate with each
on semi- or automatically characterising such other in the corpus. Our assumption is that the
1
I.e. corpora that include similar types of original texts
amount of information contained in a document
in one or more language using the same design criteria (cf. can be evaluated via summing the amount of
(EAGLES, 1996; Corpas Pastor, 2001)). information contained in the member words. For
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
30
this purpose, a list of common entities was used Having this in mind, we took advantage of two
as a unit of measurement capable of identifying IR measures commonly used in the literature, the
the amount of information shared between the Spearman’s Rank Correlation Coefficient (SCC)
documents. Our hypothesis is that this approach and the Chi-Square ( 2 ) to compute the similarity
will allow us to: compute the relatedness between between documents written in the same language
documents; describe and characterise the corpus (see section 2.1 and 2.2). Both measures are
itself; and to rank the documents by their degree particularly useful for this task because they are
of relatedness. In order to evaluate how the DSMs independent of text size (mostly because both
perform the task of ranking documents based on use a list of the common entities), and they are
their similarity and filter out the unrelated ones, language-independent.
we introduced noisy documents, i.e. out-of- The SCC distributional measure has been
domain documents to the corpus in hand. shown effective on determining similarity
The remainder of the paper is structured as between sentences, documents and even on
follows. Section 2 introduces some fundamental corpora of varying sizes (Kilgarriff, 2001; Costa
concepts related with DSMs, i.e. explains the et al., 2015; Costa, 2015). It is particularly useful,
theoretical foundations, related work and the for instance to measure the textual similarity
DSMs exploited in this experiment. Then, Section between documents because it is easy to compute
3 presents the corpora used in this work. After and is independent of text size as it can directly
applying the methodology described in Section compare ranked lists for large and small texts.
4, Section 5 presents and discusses the obtained The 2 similarity measure has also shown
results in detail. Finally, Section 6 presents the its robustness and high performance. By way
final remarks and highlights our future work. of example, 2 have been used to analyse the
conversation component of the British National
2 Distributional Similarity Measures Corpus (Rayson et al., 1997), to compare both
documents and corpora (Kilgarriff, 2001; Costa,
Information Retrieval (IR) (Singhal, 2001) is the
2015), and to identify topic related clusters in
task of locating specific information within a
imperfect transcribed documents (Ibrahimov et
collection of documents or other natural language
al., 2002). It is a simple statistic measure that
resources according to some request. This field
permits to assess if relationships between two
is rich in statistical methods that use words
variables in a sample are due to chance or the
and their (co-)occurrence to retrieve documents
relationship is systematic.
or sentences from large data sets. In simple
Bearing this in mind, distributional similarity
words, these IR methods aim to find the most
measures in general and SCC and 2 in particular
frequently used words and treat the rate of usage
have a wide range of applicabilities (Kilgarriff,
of each word in a given text as a quantitative
2001; Costa et al., 2015; Costa, 2015). Indeed,
attribute. Then, these words serve as features
this work aims at proving that these simple, yet
for a given statistical method. Following Harris’
robust and high-performance measures allow to
distributional hypothesis (Harris, 1970), which
describe the relatedness between documents in
assumes that similar words tend to occur in similar
specialised corpora and to rank them according to
contexts, these statistical methods are suitable,
their similarity.
for instance to find similar sentences based on
the words they contain (Costa et al., 2015) and 2.1 Spearman’s Rank Correlation
automatically extract or validate semantic entities Coefficient (SCC)
from corpora (Costa et al., 2010; Costa, 2010;
Costa et al., 2011). To this end, it is assumed In this work, the SCC is adopted and calculated as
that the amount of information contained in a in Kilgarriff (2001). Firstly, a list of the common
document could be evaluated by summing the entities2 L between two documents dl and dm is
amount of information contained in the document compiled, where Ldl ,dm ✓ (dl \dm ). It is possible
words. And, the amount of information conveyed to use the top n most common entities or all
by a word can be represented by means of the 2
In this work, the term ‘entity’ refers to “single words”,
weight assigned to it (Salton and Buckley, 1988). which can be a token, a lemma or a stem.
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
31
common entities between two documents, where 3 Corpora
n corresponds to the total number of common
entities considered |L|, i.e. {n|n 2 N 0 , n |L|} INTELITERM3 is a specialised comparable
– in this work we use all the common entities for corpus composed of documents collected from the
each document pair, i.e. n = |L|. Then, for each Internet. It was manually compiled by researchers
document the list of common entities (e.g. Ldl and with the purpose of building a representative
Ldm ) is ranked by frequency in an ascending order corpus (Biber, 1988, p.246) for the Tourism and
(RLdl and RLdm ), where the entity with lowest Beauty domain. It contains documents in four
frequency receives the numerical raking position different languages (English, Spanish, Italian and
1 and the entity with highest frequency receives German). Some of the texts are translations of
the numerical raking position n. Finally, for each each other (parallel), yet the majority is composed
common entity {e1 , ..., en } 2 L, the difference in of original texts. The corpus is composed of
the rank orders for the entity in each document is several subcorpora, divided by the language and
computed, and then normalised as a sum of the further for each language there are translated and
⇣Pn ⌘ original texts. For the purpose of this work, only
square of these differences s2i . The final original documents in English, Spanish and Italian
i=1
SCC equation is presented in expression 1, where were used, which for now on will be referred as
{SCC|SCC 2 R, 1 SCC 1}. int en, int es, int it, respectively.
In order to analyse how the DSMs perform
P
n the task of ranking documents based on their
6⇤ s2i similarity and filter out the unrelated ones,
i=1
SCC(dl , dm ) = 1 (1) it is necessary to introduce noisy documents,
n3 n
i.e. out-of-domain documents to the various
subcorpora. To do that, we chose the well-
2.2 Chi-Square ( 2 )
known Europarl4 corpus (Koehn, 2005), a parallel
The Chi-square ( 2 ) measure also uses a list of corpus composed by proceedings of the European
common entities (L). Similarly to SCC, it is also Parliament. As mentioned further in section 5.2,
possible to use the top n most common entities we added different amounts of noise to the various
or all common entities between two documents, subcorpora, more precisely 5%, 10%, 15% and
and again, we use all the common entities for 20%. These noisy documents were randomly
each document pair, i.e. n = |L|. The number selected from the “one per day” Europarl v.7 for
of occurrences of a common entity in L that the three working languages: English, Spanish
would be expected in each document is calculated and Italian (eur en, eur es, eur it, respectively).
from the frequency lists. If the size of the types
document dl and dm are Nl and Nm and the nDocs types tokens tokens
entity ei has the following observed frequencies int en 151 11,6k 496,2k 0.023
O(ei , dl ) and O(ei , dm ), then the expected values eur en 30 3.4k 29,8k 0.116
are eidl = Nl ⇤(O(eN i ,dl )+O(ei ,dm ))
and eidm = int es 224 13,2k 207,3k 0.063
l +Nm eur es 44 5,6k 43,5k 0.129
Nm ⇤(O(ei ,dl )+O(ei ,dm ))
Nl +Nm . Equation 2 presents the int it 150 19,9k 386,2k 0.052
2 formula, where O is the observed frequency
eur it 30 4,7k 29,6k 0.159
and E the expected frequency. The resulted 2
score should be interpreted as the interdocument Table 1: Statistical information per subcorpora.
distance between two documents. It is also
important to mention that { 2 | 2 2 R, 1 All the statistical information about both the
2 < 1}, which means that as more unrelated the
INTELITERM subcorpora and the set of 20%
common entities in L are, the lower the 2 score of noisy documents, randomly selected for each
will be. working language, are presented in Table 1. In
detail, this Table shows: the number of documents
X (O E)2 3
2
(dl , dm ) = (2) http://www.lexytrad.es/proyectos.html
4
E http://www.statmt.org/europarl/
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
32
(nDocs); the number of types (types); the number we will use the acronym NCE), a co-
of tokens (tokens); and the ratio of types per occurrence matrix was built for each pair
types
tokens ( tokens ) per subcorpus. These values of documents. Only those that have at
were obtained using the Antconc 3.4.3 (Anthony, least one occurrence in both documents are
2014) software, a corpus analysis toolkit for considered. As required by the DSMs (see
concordancing and text analysis. section 2), their frequency in both documents
is also stored within this matrix (Ldl ,dm =
4 Methodology {ei , (f (ei , dl ), f (ei , dm )); ej , (f (ej , dl ),
This section describes the methodology employed f (ej , dm )); ...; en , (f (en , dl ), f (en , dm ))},
to calculate and rank documents based on where f represents the frequency of an entity
their similarity using Distributional Similarity in a document). With the purpose of analysing
Measures (DSMs). All the tools, libraries and and comparing the performance of different
frameworks used for the purpose in hand are also DSMs, three different lists were created to be
pointed out. used as input features: the first one using the
Number of Common Tokens (NCT), another
1) Data Preprocessing: firstly all the using the Number of Common Lemmas
INTELITERM documents were processed (NCL) and the third one using the Number of
with the OpenNLP5 Sentence Detector and Common Stems (NCS).
Tokeniser. Then, the annotation process
3) Computing the similarity between
was done with the TT4J6 library, which is a
documents: the similarity between
Java wrapper around the popular TreeTagger
documents was calculated by applying
(Schmid, 1995) – a tool specifically designed
three different DSMs (DSM s =
to annotate text with part-of-speech and lemma
{DSMN CE , DSMSCC , DSM 2 }, where
information. Regarding the stemming, we
N CE , SCC and 2 refer to Number of Common
used the Porter stemmer algorithm provided
Entities, Spearman’s Rank Correlation
by the Snowball7 library. A method to remove
Coefficient and Chi-Square, respectively),
punctuation and special characters within the
each one calculated using three different input
words was also implemented. Finally, in order
features (NCT, NCL and NCS).
to get rid of the noise, a stopword list8 was
compiled to filter out the most frequent words 4) Computing the document final score: the
in the corpus. Once a document is computed document final score DSM (dl ) is the mean of
and the sentences are tokenised, lemmatised the similarity scores of the document with all
and stemmed, our system creates a new output the documents in the collection of documents,
file with all this new information, i.e. a nP1
DSMi (dl ,di )
new document containing: the original, the i.e. DSM (dl ) = i=1 n 1 , where n
tokenised, the lemmatised and the stemmed corresponds to the total number of documents
text. Using the stopword list mentioned above in the collection and DSMi (dl , di ) the resulted
a Boolean vector describing if the entity is a similarity score between the document dl with
stopword or not is also added to the document. all the documents in the collection.
This way, the system will be able to use only
the tokens, lemmas and stems that are not 5) Ranking documents: finally, the documents
stopwords. were ranked in a descending order according
to their DSMs scores (i.e. NCE, SCC or 2 ).
2) Identifying the list of common entities
between documents: in order to identify 5 Results and Analysis
a list of common entities (from now on
This experiment is divided into two parts. In the
5
https://opennlp.apache.org first part (section 5.1), we describe the corpus
6
http://reckart.github.io/tt4j/
7
http://snowball.tartarus.org
in hand by applying three different Distributional
8
Freely available to download through the following URL Similarity Measures (DSMs): the Number of
https://github.com/hpcosta/stopwords. Common Entities (NCE), the Spearman’s Rank
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
33
Correlation Coefficient (SCC) and the Chi-Square are some exception that we will discuss along this
( 2 ). As a input feature to the DSMs, three section. Another interesting observation is related
different lists of entities were used, i.e. the with the high Number of Common Tokens (NCT)
Number of Common Tokens (NCT), the Number in English (int en) when compared with Italian
of Common Lemmas (NCL) and the Number of and Spanish (int it and int es, respectively), see
Common Stems (NCS). By a way of example, Table 2 and Figure 1a. Later in this section, we
Table 2 shows the NCT between documents, the will try to explain this phenomenon.
SCC and the 2 scores and averages (av) along
with the associated standard deviations ( ) per SubC. Stats NCT SCC 2
measure and subcorpus. Figure 1 presents the av 163.70 0.42 279.39
int en
resulted average scores per document in a box plot 83.87 0.05 177.45
format for all the combinations DSM vs. feature. av 31.97 0.41 40.92
int es
Each box plot displays the full range of variation 23.48 0.07 38.21
av 101.08 0.39 201.97
(from min to max), the likely range of variation int it
55.71 0.05 144.68
(the interquartile range or IQR), the median, and
the high maximums and low minimums (also Table 2: Average and standard deviation of
know as outliers). It is important to mention common tokens scores between documents per
that for the first part of this experiment (section subcorpus.
5.1) we did not use a sample, but instead the
entire INTELITERM subcorpora in their original
Although the NCT per document on average is
size and form, which means that all obtained
higher for the int en subcorpus, the interquartile
results and made observations came from the
range (IQR) is larger than for the other subcorpora
entire population, in this case the English (int en),
(see Table 2 and Figure 1a), which means that the
Spanish (int es) and Italian (int it) subcorpora
middle 50% of the data is more distributed and
(for more details about the subcorpora see section
thus the average of NCT per document is more
3). Regarding the second part of this experiment,
variable. Moreover, longest whiskers (the lines
we used the same subcorpora, but an additional
extending vertically from the box) in Figure 1a
percentage of documents was added to them in
also indicates variability outside the upper and
order to test how the DSMs perform the task of
lower quartiles. Therefore, we can say that int en
filtering out these noisy documents, i.e. out-of-
has a wide type of documents and consequently
domain documents (see 5.2). In detail, Figure
some of them are only roughly correlated to the
2 shows how the average scores decrease when
rest of the subcorpus. Nevertheless, the data is
injecting noisy documents and Table 3 presents
skewed left and the longest whisker outside the
how the DSMs performed when that noise was
upper quartile indicates that the majority of the
injected.
data is strongly similar, i.e. the documents have
a high degree of relatedness between each other.
5.1 Describing the Corpus
This idea can be sustained not only by the positive
The first observation we can make from Figure average SCC scores, but also by the set of outliers
1 is that the distributions between the features above the upper whisker in Figure 1b. The average
are quite similar (see for instance Figures 1a, of 0.42 SCC score and =0.05 also implies a
1d and 1g). This means that it is possible to strong correlation between the documents in the
achieve acceptable results only using raw words int en subcorpus (Table 2). Likewise, the longest
(i.e. tokens). Stems and lemmas require more whisker and the set of outliers outside the upper
processing power and time to be used as features quartile in the 2 scores also indicate a high
– especially lemmas due to the part-of-speech relatedness between the documents.
tagger dependency and time consuming process Regarding the int it subcorpus, the SCC and the
implied. In general, we can say that the scores for 2 scores (Figures 1b and 1c) and the average
each subcorpus are symmetric (roughly the same of 101.08 common tokens per document and
on each side when cut down the middle), which =55.71 (Figure 1a and Table 2) suggest that the
means that the data is normally distributed. There data is normally distributed (Figure 1b) and highly
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
34
Common Tokens Spearman's rank correlation coefficient (tokens) Chi Square scores (tokens)
0.6
1000
300
800
0.5
Average of common tokens per document
250
Average score per document
Average score per document
200
600
0.4
150
400
0.3
100
200
50
0.2
0
0
int_en int_es int_it int_en int_es int_it int_en int_es int_it
Subcorpora Subcorpora Subcorpora
(a) (b) (c)
Common Lemmas Spearman's rank correlation coefficient (lemmas) Chi Square scores (lemmas)
0.6
300
1000
250
0.5
Average of common lemmas per document
800
Average score per document
Average score per document
200
0.4
600
150
0.3
400
100
0.2
200
50
0
0
0.1
int_en int_es int_it int_en int_es int_it int_en int_es int_it
Subcorpora Subcorpora Subcorpora
(d) (e) (f)
Common Stems Spearman's rank correlation coefficient (stems) Chi Square scores (stems)
0.6
300
1000
250
0.5
Average of common stems per document
800
Average score per document
Average score per document
200
0.4
600
150
0.3
400
100
0.2
200
50
0
0
0.1
int_en int_es int_it int_en int_es int_it int_en int_es int_it
Subcorpora Subcorpora Subcorpora
(g) (h) (i)
Figure 1: INTELITERM: average scores between documents per subcorpus.
correlated. Although this subcorpus got lower and Figure 1a reveal a lower NCT compared with
average scores for all the DSMs when compared int en and the int it subcorpora.
to the English subcorpus, Table 2, Figure 1a, The subcorpus int en has 163 common tokens
1b and 1c show that the average scores and the per document on average with a =83, and the
range of variation are quite similar to the English subcorpora int it and int es only have 101 and
subcorpus. Therefore, we can conclude that the 31 common tokens per document on average with
documents inside the Italian subcorpus are highly a =55 and =23, respectively (Table 2, NCT
related between each other. column). This means that the int it and int es
From the three subcorpora, the int es subcorpora are composed of documents with a
subcorpus is the biggest one with 224 documents lower level of relatedness when compared with
(Table 1). Nevertheless, the average scores per the English one. This fact could happen because
document are slightly different from the other Italian and Spanish have a richer morphology
box plots (see Figures 1a, 1b and 1c). The 2 compared to English. Therefore, due to bigger
standard deviation practically equal to its average number of inflection forms per lemma, there
(38.21 and 40.92, respectively) and the SCC is a larger number of tokens and consequently
variability inside and outside the IQR indicates less common tokens per document in Spanish.
some inconsistency in the data. Moreover, Table 2 Another explanation could come from the fact
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
35
that the tourism and beauty services are more Figure 2). As a result, at this point we have the
developed in Italy and Spain than in the UK and documents ranked in a descending order according
therefore there are more variety on the vocabulary to their DSMs scores.
used as well as in the services offered. Indeed,
Common Tokens
Table 1 offers some evidences about the employed
300
vocabulary. The English subcorpus has a lower
number of types and a higher number of tokens
250
Average of common tokens per document
(11,6k and 496,2k, respectively) when compared
200
with the Italian (19,9k types and 386,2k tokens)
150
and Spanish subcorpora (13,2k types and 207,3k
100
tokens). The high difference on the average of
50
common tokens per document between Spanish
0
and the other two languages can also be related
int_en05 int_en10 int_en15 int_en20 int_es05 int_es10 int_es15 int_es20 int_it05 int_it10 int_it15 int_it20
Subcorpora with 5%, 10%, 15% and 20% of noise
with the marketing strategies used to advertise
Figure 2: Average scores between documents
tourism and beauty services, which is somehow
when injecting 5%, 10%, 15% and 20% of noise
hard to confirm. Despite that our method is able
to the various subcorpora.
to catch the lexical level of similarity between
the documents, the semantic level is not taken
into account, i.e. does not consider synonyms In order to evaluate the DSMs precision, we
as similar words for example, and consequently analysed the first n positions in the ranking lists
would result on slightly different similarity scores produced by the three DSMs (individually), and
(again, another explanation difficult to confirm). in this case n is the number of original documents
To conclude, we can state from the statistical in a given INTELITERM subcorpus. Table 3
and theoretical evidences that the int en and the presents the precision values obtained by the
int it subcorpora look like they assemble highly DSMs when injecting different amounts of noise
correlated documents. We can not say the same to the various original subcorpora.
for the int es subcorpus. Due to the scarceness
SubC Noise NCT SCC 2
of evidences, we can only not reject the idea that
5% 0.89 0.22 1.00
this subcorpus is composed of similar documents. 10% 0.73 0.33 1.00
Nevertheless, as we will see in the next section, int en
15% 0.73 0.36 0.95
the fact that int es is composed by low related 20% 0.80 0.37 0.90
documents (according to our findings) will affect 5% 0.00 0.00 0.38
the ranking task. 10% 0.07 0.07 0.20
int es
15% 0.09 0.09 0.17
5.2 Measuring DSMs Performance 20% 0.14 0.18 0.23
The second part of this experiment aims at 5% 0.88 0.13 0.88
assessing how the DSMs perform the task of 10% 0.82 0.06 0.82
int it
15% 0.74 0.09 0.83
filtering out documents with a low level of
20% 0.73 0.13 0.87
relatedness. To do that, we injected different
sets of out-of-domain documents, randomly Table 3: DSMs precision when injecting different
selected from the Europarl corpus to the original amounts of noise to the various subcorpora.
INTELITERM subcorpora. More precisely, we
injected 5%, 10%, 15% and 20%9 to the various As expected, none of the DSMs got acceptable
subcorpora. As we can see in Figure 2, the more results for Spanish, being incapable of correctly
noisy documents are injected, the lower is the identify noisy documents. However, we need to
NCT. Then, the methodology described in Section be aware that this happened due to the pre-existing
4 was applied to these “new twelve subcorpora” low level of relatedness between the original
(int en05, int en10, ..., int it15 and int it20, see documents in the int es subcorpus (see Section
9
The number of documents that correspond to these 5.1 for more details). On the other hand, the DSMs
percentages can be inferred from Table 1. show promising results for English and Italian. By
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
36
a way of example, the 2 was capable of reaching performance.
100% when injected 5% and 10% of noise to the
int en subcorpus, and even 90% when injected Acknowledgements
20%. Although the NCT got lower precision, Hernani Costa is supported by the People
in general, when compared with the 2 , it still Programme (Marie Curie Actions) of the
reached 80% and 73% when injected 20% of European Union’s Framework Programme
noise to the English and to the Italian subcopora, (FP7/2007-2013) under REA grant agreement
respectively. From the evidences shown in Table no 317471. The research reported in this
3, we can say that the NCT and the 2 are suitable work has also been partially carried out in the
for the task of filtering out low related documents framework of the Educational Innovation Project
with a high precision degree. The same cannot be TRADICOR (PIE 13-054, 2014-2015); the R&D
say to the SCC measure, at least for this specific project INTELITERM (ref. no FFI2012-38881,
task. 2012-2015); the R&D Project for Excellence
TERMITUR (ref. no HUM2754, 2014-2017); and
6 Conclusions and Future Work the LATEST project (ref. 327197-FP7-PEOPLE-
In this paper we presented a simple methodology 2012-IEF).
and studied various Distributional Similarity
Measures (DSMs) for the purpose of measuring References
the relatedness between documents in specialised
comparable corpora. As input for these DSMs, Laurence Anthony. 2014. AntConc (Version
3.4.3) Machintosh OS X. Waseda University.
we used three different input features (lists of
Tokyo, Japan. Available from http://www.
common tokens, lemmas and stems). In the laurenceanthony.net.
end, we conclude that for the data in hand these Douglas Biber. 1988. Variation across speech and
features had similar performance. In fact, our writing. Cambridge University Press, Cambridge,
findings show that instead of using common UK.
lemmas or stems, which require external libraries, Gloria Corpas Pastor and Mı́riam Seghiri. 2009.
processing power and time, a simple list of Virtual Corpora as Documentation Resources:
common tokens was enough to describe our Translating Travel Insurance Documents (English-
data. Moreover, we proved that it is possible to Spanish). In A. Beeby, P.R. Inés, and P. Sánchez-
Gijón, editors, Corpus Use and Translating: Corpus
assess and describe comparable corpora through
Use for Learning to Translate and Learning
statistical methods. The number of entities shared Corpus Use to Translate, Benjamins translation
by their documents, the average scores obtained library, chapter 5, pages 75–107. John Benjamins
with the SCC and the 2 measure resulted to Publishing Company.
be an important surgical toolbox to dissect and Gloria Corpas Pastor. 2001. Compilación de un corpus
microscopically analyse comparable corpora. ad hoc para la enseñanza de la traducción inversa
Furthermore, these DSMs can be seen as especializada. TRANS, Revista de Traductologı́a,
5(1):155–184.
a suitable tool to rank documents by their
Hernani Costa, Hugo Gonçalo Oliveira, and Paulo
similarities. A handy feature to those who
Gomes. 2010. The Impact of Distributional
manually or semi-automatically compile corpora Metrics in the Quality of Relational Triples. In 19th
mined from the Internet and want to retrieve European Conf. on Artificial Intelligence, Workshop
the most similar ones and filter out documents on Language Technology for Cultural Heritage,
with a low level of relatedness. Our findings Social Sciences, and Humanities, ECAI’10, pages
show promising results when filtering out noisy 23–29, Lisbon, Portugal, August.
documents. Indeed, two of the measures got very Hernani Costa, Hugo Gonçalo Oliveira, and Paulo
high precision results, even when dealing with Gomes. 2011. Using the Web to Validate Lexico-
Semantic Relations. In 15th Portuguese Conf. on
20% of noise.
Artificial Intelligence, volume 7026 of EPIA’11,
In the future, we intend not only to perform pages 597–609, Lisbon, Portugal, October. Springer.
more experiments with these DSMs in other Hernani Costa, Hanna Béchara, Shiva Taslimipoor,
corpora and languages, but also test other Rohit Gupta, Constantin Orasan, Gloria
DSMs, like Jaccard or Cosine and compare their Corpas Pastor, and Ruslan Mitkov. 2015.
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
37
MiniExperts: An SVM approach for Measuring
Semantic Textual Similarity. In 9th Int. Workshop
on Semantic Evaluation, SemEval’15, pages
96–101, Denver, Colorado, June. ACL.
Hernani Costa. 2010. Automatic Extraction and
Validation of Lexical Ontologies from text.
Master’s thesis, University of Coimbra, Faculty
of Sciences and Technology, Department of
Informatics Engineering, Coimbra, Portugal,
September.
Hernani Costa. 2015. Assessing Comparable Corpora
through Distributional Similarity Measures. In
EXPERT Scientific and Technological Workshop,
pages 23–32, Malaga, Spain, June.
EAGLES. 1996. Preliminary Recommendations
on Corpus Typology. Technical report,
EAGLES Document EAG-TCWG-CTYP/P., May.
http://www.ilc.cnr.it/EAGLES96/
corpustyp/corpustyp.html.
Zelig Harris. 1970. Distributional Structure. In Papers
in Structural and Transformational Linguistics,
pages 775–794. D. Reidel Publishing Company,
Dordrecht, Holland.
Oktay Ibrahimov, Ishwar Sethi, and Nevenka
Dimitrova. 2002. The Performance Analysis
of a Chi-square Similarity Measure for Topic
Related Clustering of Noisy Transcripts. In 16th
Int. Conf. on Pattern Recognition, volume 4, pages
285–288. IEEE Computer Society.
Adam Kilgarriff. 2001. Comparing Corpora. Int.
Journal of Corpus Linguistics, 6(1):97–133.
Philipp Koehn. 2005. Europarl: A Parallel Corpus for
Statistical Machine Translation. In MT Summit.
Paul Rayson, Geoffrey Leech, and Mary Hodges.
1997. Social Differentiation in the Use of English
Vocabulary: Some Analyses of the Conversational
Component of the British National Corpus. Int.
Journal of Corpus Linguistics, 2(1):133–152.
Gerard Salton and Christopher Buckley. 1988. Term-
Weighting Approaches in Automatic Text Retrieval.
Information Processing & Management, 24(5):513–
523.
Helmut Schmid. 1995. Improvements In Part-of-
Speech Tagging With an Application To German.
In ACL SIGDAT-Workshop, pages 47–50, Dublin,
Ireland.
Amit Singhal. 2001. Modern Information Retrieval:
A Brief Overview. Bulletin of the IEEE Computer
Society Technical Committee on Data Engineering,
24(4):35–42.