<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Measuring the Relatedness between Documents in Comparable Corpora</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hernani Costa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gloria Corpas Pastor</string-name>
          <email>gcorpas@uma.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruslan Mitkov</string-name>
          <email>r.mitkov@wlv.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LEXYTRAD, University of Malaga</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>RIILP, University of Wolverhampton</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>38</lpage>
      <abstract>
        <p>This paper aims at investigating the use of textual distributional similarity measures in the context of comparable corpora. We address the issue of measuring the relatedness between documents by extracting, measuring and ranking their common content. For this purpose, we designed and applied a methodology that exploits available natural language processing technology with statistical methods. Our findings showed that using a list of common entities and a simple, yet robust set of distributional similarity measures was enough to describe and assess the degree of relatedness between the documents. Moreover, our method has demonstrated high performance in the task of filtering out documents with a low level of relatedness. By a way of example, one of the measures got 100%, 100%, 95% and 90% precision when injected 5%, 10%, 15% and 20% of noise, respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Comparable corpora1 can be considered an
important resource for several research areas
such as Natural Language Processing (NLP),
terminology, language teaching, and automatic
and assisted translation, amongst other related
areas. Nevertheless, an inherent problem to those
who deal with comparable corpora in a daily
basis is the uncertainty about the data they are
dealing with. Indeed, little work has been done
on semi- or automatically characterising such
1I.e. corpora that include similar types of original texts
in one or more language using the same design criteria (cf.
        <xref ref-type="bibr" rid="ref10 ref4">(EAGLES, 1996; Corpas Pastor, 2001)</xref>
        ).
linguistic resources and attempting a meaningful
description of their content is often a perilous
task
        <xref ref-type="bibr" rid="ref3">(Corpas Pastor and Seghiri, 2009)</xref>
        . Usually,
a corpus is given a short description such as
“casual speech transcripts” or “tourism specialised
comparable corpus”. Yet, such tags will be of
little use to those users seeking for a representative
and/or high quality domain-specific corpora.
Apart from the usual description that comes
along with the corpus, like number of documents,
tokens, types, source(s), creation date, policies
of usage, etc., nothing is said about how similar
the documents are or how to retrieve the most
related ones. As a result, most of the resources
at our disposal are built and shared without deep
analysis of their content, and those who use them
blindly trust on the people’s or research group’s
name behind their compilation process, without
knowing nothing about the relatedness quality
of the documents. Although some tasks require
documents with a high degree of relatedness
between each other, the literature is scarce on this
matter.
      </p>
      <p>Accordingly, this work explores this niche by
taking advantage of several textual Distributional
Similarity Measures (DSMs) presented in the
literature. Firstly, we selected a specialised
corpus about tourism and beauty domain that was
manually compiled by researchers in the area of
translation and interpreting studies. Then, we
designed and applied a methodology that exploits
available NLP technology with statistical methods
to assess how the documents correlate with each
other in the corpus. Our assumption is that the
amount of information contained in a document
can be evaluated via summing the amount of
information contained in the member words. For
this purpose, a list of common entities was used
as a unit of measurement capable of identifying
the amount of information shared between the
documents. Our hypothesis is that this approach
will allow us to: compute the relatedness between
documents; describe and characterise the corpus
itself; and to rank the documents by their degree
of relatedness. In order to evaluate how the DSMs
perform the task of ranking documents based on
their similarity and filter out the unrelated ones,
we introduced noisy documents, i.e.
out-ofdomain documents to the corpus in hand.</p>
      <p>The remainder of the paper is structured as
follows. Section 2 introduces some fundamental
concepts related with DSMs, i.e. explains the
theoretical foundations, related work and the
DSMs exploited in this experiment. Then, Section
3 presents the corpora used in this work. After
applying the methodology described in Section
4, Section 5 presents and discusses the obtained
results in detail. Finally, Section 6 presents the
final remarks and highlights our future work.
2</p>
      <p>
        Distributional Similarity Measures
Information Retrieval (IR)
        <xref ref-type="bibr" rid="ref18">(Singhal, 2001)</xref>
        is the
task of locating specific information within a
collection of documents or other natural language
resources according to some request. This field
is rich in statistical methods that use words
and their (co-)occurrence to retrieve documents
or sentences from large data sets. In simple
words, these IR methods aim to find the most
frequently used words and treat the rate of usage
of each word in a given text as a quantitative
attribute. Then, these words serve as features
for a given statistical method. Following Harris’
distributional hypothesis
        <xref ref-type="bibr" rid="ref11">(Harris, 1970)</xref>
        , which
assumes that similar words tend to occur in similar
contexts, these statistical methods are suitable,
for instance to find similar sentences based on
the words they contain
        <xref ref-type="bibr" rid="ref7 ref9">(Costa et al., 2015)</xref>
        and
automatically extract or validate semantic entities
from corpora
        <xref ref-type="bibr" rid="ref5 ref5 ref6 ref8 ref8">(Costa et al., 2010; Costa, 2010;
Costa et al., 2011)</xref>
        . To this end, it is assumed
that the amount of information contained in a
document could be evaluated by summing the
amount of information contained in the document
words. And, the amount of information conveyed
by a word can be represented by means of the
weight assigned to it
        <xref ref-type="bibr" rid="ref16 ref2">(Salton and Buckley, 1988)</xref>
        .
Having this in mind, we took advantage of two
IR measures commonly used in the literature, the
Spearman’s Rank Correlation Coefficient (SCC)
and the Chi-Square ( 2) to compute the similarity
between documents written in the same language
(see section 2.1 and 2.2). Both measures are
particularly useful for this task because they are
independent of text size (mostly because both
use a list of the common entities), and they are
language-independent.
      </p>
      <p>
        The SCC distributional measure has been
shown effective on determining similarity
between sentences, documents and even on
corpora of varying sizes
        <xref ref-type="bibr" rid="ref13 ref7 ref7 ref9 ref9">(Kilgarriff, 2001; Costa
et al., 2015; Costa, 2015)</xref>
        . It is particularly useful,
for instance to measure the textual similarity
between documents because it is easy to compute
and is independent of text size as it can directly
compare ranked lists for large and small texts.
      </p>
      <p>
        The 2 similarity measure has also shown
its robustness and high performance. By way
of example, 2 have been used to analyse the
conversation component of the British National
Corpus
        <xref ref-type="bibr" rid="ref15">(Rayson et al., 1997)</xref>
        , to compare both
documents and corpora
        <xref ref-type="bibr" rid="ref13 ref7 ref9">(Kilgarriff, 2001; Costa,
2015)</xref>
        , and to identify topic related clusters in
imperfect transcribed documents
        <xref ref-type="bibr" rid="ref12">(Ibrahimov et
al., 2002)</xref>
        . It is a simple statistic measure that
permits to assess if relationships between two
variables in a sample are due to chance or the
relationship is systematic.
      </p>
      <p>
        Bearing this in mind, distributional similarity
measures in general and SCC and 2 in particular
have a wide range of applicabilities
        <xref ref-type="bibr" rid="ref13 ref7 ref7 ref9 ref9">(Kilgarriff,
2001; Costa et al., 2015; Costa, 2015)</xref>
        . Indeed,
this work aims at proving that these simple, yet
robust and high-performance measures allow to
describe the relatedness between documents in
specialised corpora and to rank them according to
their similarity.
2.1 Spearman’s Rank Correlation
      </p>
      <p>Coefficient (SCC)
In this work, the SCC is adopted and calculated as
in Kilgarriff (2001). Firstly, a list of the common
entities2 L between two documents dl and dm is
compiled, where Ldl,dm ✓ (dl \ dm). It is possible
to use the top n most common entities or all
2In this work, the term ‘entity’ refers to “single words”,
which can be a token, a lemma or a stem.
common entities between two documents, where
n corresponds to the total number of common
entities considered |L|, i.e. {n|n 2 N 0, n  | L|}
– in this work we use all the common entities for
each document pair, i.e. n = |L|. Then, for each
document the list of common entities (e.g. Ldl and
Ldm ) is ranked by frequency in an ascending order
(RLdl and RLdm ), where the entity with lowest
frequency receives the numerical raking position
1 and the entity with highest frequency receives
the numerical raking position n. Finally, for each
common entity {e1, ..., en} 2 L, the difference in
the rank orders for the entity in each document is
computed, and then normalised as a sum of the
square of these differences ⇣ Pn si2⌘. The final
i=1
SCC equation is presented in expression 1, where
{SCC|SCC 2 R, 1 SCC  1}.
6 ⇤
n3
n
P s2
i=1 i
n
SCC(dl, dm) = 1
(1)
2.2</p>
    </sec>
    <sec id="sec-2">
      <title>Chi-Square ( 2)</title>
      <p>The Chi-square ( 2) measure also uses a list of
common entities (L). Similarly to SCC, it is also
possible to use the top n most common entities
or all common entities between two documents,
and again, we use all the common entities for
each document pair, i.e. n = | |
L . The number
of occurrences of a common entity in L that
would be expected in each document is calculated
from the frequency lists. If the size of the
document dl and dm are Nl and Nm and the
entity ei has the following observed frequencies
O(ei, dl) and O(ei, dm), then the expected values
are eidl = Nl⇤ (O(eNi,dl+l)N+mO(ei,dm)) and eidm =
Nm⇤ (O(ei,dl)+O(ei,dm)) . Equation 2 presents the</p>
      <p>Nl+Nm
2 formula, where O is the observed frequency
and E the expected frequency. The resulted 2
score should be interpreted as the interdocument
distance between two documents. It is also
important to mention that { 2| 2 2 R, 1
2 &lt; 1} , which means that as more unrelated the
common entities in L are, the lower the 2 score
will be.
31
3</p>
    </sec>
    <sec id="sec-3">
      <title>Corpora</title>
      <p>
        INTELITERM3 is a specialised comparable
corpus composed of documents collected from the
Internet. It was manually compiled by researchers
with the purpose of building a representative
corpus
        <xref ref-type="bibr" rid="ref2">(Biber, 1988, p.246)</xref>
        for the Tourism and
Beauty domain. It contains documents in four
different languages (English, Spanish, Italian and
German). Some of the texts are translations of
each other (parallel), yet the majority is composed
of original texts. The corpus is composed of
several subcorpora, divided by the language and
further for each language there are translated and
original texts. For the purpose of this work, only
original documents in English, Spanish and Italian
were used, which for now on will be referred as
int en, int es, int it, respectively.
      </p>
      <p>
        In order to analyse how the DSMs perform
the task of ranking documents based on their
similarity and filter out the unrelated ones,
it is necessary to introduce noisy documents,
i.e. out-of-domain documents to the various
subcorpora. To do that, we chose the
wellknown Europarl4 corpus
        <xref ref-type="bibr" rid="ref14">(Koehn, 2005)</xref>
        , a parallel
corpus composed by proceedings of the European
Parliament. As mentioned further in section 5.2,
we added different amounts of noise to the various
subcorpora, more precisely 5%, 10%, 15% and
20%. These noisy documents were randomly
selected from the “one per day” Europarl v.7 for
the three working languages: English, Spanish
and Italian (eur en, eur es, eur it, respectively).
int en
eur en
int es
eur es
int it
eur it
nDocs
151
30
224
44
150
30
types
11,6k
3.4k
13,2k
5,6k
19,9k
4,7k
tokens
496,2k
29,8k
207,3k
43,5k
386,2k
29,6k
types
tokens
0.023
0.116
0.063
0.129
0.052
0.159
      </p>
      <p>All the statistical information about both the
INTELITERM subcorpora and the set of 20%
of noisy documents, randomly selected for each
working language, are presented in Table 1. In
detail, this Table shows: the number of documents
2(dl, dm) =</p>
      <p>
        X (O
3http://www.lexytrad.es/proyectos.html
4http://www.statmt.org/europarl/
(nDocs); the number of types (types); the number
of tokens (tokens); and the ratio of types per
tokens ( ttoykpeenss ) per subcorpus. These values
were obtained using the Antconc 3.4.3
        <xref ref-type="bibr" rid="ref1">(Anthony,
2014)</xref>
        software, a corpus analysis toolkit for
concordancing and text analysis.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>
        This section describes the methodology employed
to calculate and rank documents based on
their similarity using Distributional Similarity
Measures (DSMs). All the tools, libraries and
frameworks used for the purpose in hand are also
pointed out.
1) Data Preprocessing: firstly all the
INTELITERM documents were processed
with the OpenNLP5 Sentence Detector and
Tokeniser. Then, the annotation process
was done with the TT4J6 library, which is a
Java wrapper around the popular TreeTagger
        <xref ref-type="bibr" rid="ref17">(Schmid, 1995)</xref>
        – a tool specifically designed
to annotate text with part-of-speech and lemma
information. Regarding the stemming, we
used the Porter stemmer algorithm provided
by the Snowball7 library. A method to remove
punctuation and special characters within the
words was also implemented. Finally, in order
to get rid of the noise, a stopword list8 was
compiled to filter out the most frequent words
in the corpus. Once a document is computed
and the sentences are tokenised, lemmatised
and stemmed, our system creates a new output
file with all this new information, i.e. a
new document containing: the original, the
tokenised, the lemmatised and the stemmed
text. Using the stopword list mentioned above
a Boolean vector describing if the entity is a
stopword or not is also added to the document.
This way, the system will be able to use only
the tokens, lemmas and stems that are not
stopwords.
2) Identifying the list of common entities
between documents: in order to identify
a list of common entities (from now on
5https://opennlp.apache.org
6http://reckart.github.io/tt4j/
7http://snowball.tartarus.org
8Freely available to download through the following URL
https://github.com/hpcosta/stopwords.
we will use the acronym NCE), a
cooccurrence matrix was built for each pair
of documents. Only those that have at
least one occurrence in both documents are
considered. As required by the DSMs (see
section 2), their frequency in both documents
is also stored within this matrix (Ldl,dm =
{ei, (f (ei, dl), f (ei, dm)); ej , (f (ej , dl),
f (ej , dm)); ...; en, (f (en, dl), f (en, dm))},
where f represents the frequency of an entity
in a document). With the purpose of analysing
and comparing the performance of different
DSMs, three different lists were created to be
used as input features: the first one using the
Number of Common Tokens (NCT), another
using the Number of Common Lemmas
(NCL) and the third one using the Number of
Common Stems (NCS).
3) Computing the similarity between
documents: the similarity between
documents was calculated by applying
three different DSMs (DSM s =
{DSMNCE, DSMSCC , DSM 2 }, where
NCE, SCC and 2 refer to Number of Common
Entities, Spearman’s Rank Correlation
Coefficient and Chi-Square, respectively),
each one calculated using three different input
features (NCT, NCL and NCS).
4) Computing the document final score: the
document final score DSM (dl) is the mean of
the similarity scores of the document with all
the documents in the collection of documents,
n 1
      </p>
      <p>P DSMi(dl,di)
i.e. DSM (dl) = i=1 n 1 , where n
corresponds to the total number of documents
in the collection and DSMi(dl, di) the resulted
similarity score between the document dl with
all the documents in the collection.
5) Ranking documents: finally, the documents
were ranked in a descending order according
to their DSMs scores (i.e. NCE, SCC or 2).
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results and Analysis</title>
      <p>This experiment is divided into two parts. In the
first part (section 5.1), we describe the corpus
in hand by applying three different Distributional
Similarity Measures (DSMs): the Number of
Common Entities (NCE), the Spearman’s Rank
Correlation Coefficient (SCC) and the Chi-Square
( 2). As a input feature to the DSMs, three
different lists of entities were used, i.e. the
Number of Common Tokens (NCT), the Number
of Common Lemmas (NCL) and the Number of
Common Stems (NCS). By a way of example,
Table 2 shows the NCT between documents, the
SCC and the 2 scores and averages (av) along
with the associated standard deviations ( ) per
measure and subcorpus. Figure 1 presents the
resulted average scores per document in a box plot
format for all the combinations DSM vs. feature.
Each box plot displays the full range of variation
(from min to max), the likely range of variation
(the interquartile range or IQR), the median, and
the high maximums and low minimums (also
know as outliers). It is important to mention
that for the first part of this experiment (section
5.1) we did not use a sample, but instead the
entire INTELITERM subcorpora in their original
size and form, which means that all obtained
results and made observations came from the
entire population, in this case the English (int en),
Spanish (int es) and Italian (int it) subcorpora
(for more details about the subcorpora see section
3). Regarding the second part of this experiment,
we used the same subcorpora, but an additional
percentage of documents was added to them in
order to test how the DSMs perform the task of
filtering out these noisy documents, i.e.
out-ofdomain documents (see 5.2). In detail, Figure
2 shows how the average scores decrease when
injecting noisy documents and Table 3 presents
how the DSMs performed when that noise was
injected.
5.1</p>
      <sec id="sec-5-1">
        <title>Describing the Corpus</title>
        <p>The first observation we can make from Figure
1 is that the distributions between the features
are quite similar (see for instance Figures 1a,
1d and 1g). This means that it is possible to
achieve acceptable results only using raw words
(i.e. tokens). Stems and lemmas require more
processing power and time to be used as features
– especially lemmas due to the part-of-speech
tagger dependency and time consuming process
implied. In general, we can say that the scores for
each subcorpus are symmetric (roughly the same
on each side when cut down the middle), which
means that the data is normally distributed. There
are some exception that we will discuss along this
section. Another interesting observation is related
with the high Number of Common Tokens (NCT)
in English (int en) when compared with Italian
and Spanish (int it and int es, respectively), see
Table 2 and Figure 1a. Later in this section, we
will try to explain this phenomenon.</p>
        <p>Although the NCT per document on average is
higher for the int en subcorpus, the interquartile
range (IQR) is larger than for the other subcorpora
(see Table 2 and Figure 1a), which means that the
middle 50% of the data is more distributed and
thus the average of NCT per document is more
variable. Moreover, longest whiskers (the lines
extending vertically from the box) in Figure 1a
also indicates variability outside the upper and
lower quartiles. Therefore, we can say that int en
has a wide type of documents and consequently
some of them are only roughly correlated to the
rest of the subcorpus. Nevertheless, the data is
skewed left and the longest whisker outside the
upper quartile indicates that the majority of the
data is strongly similar, i.e. the documents have
a high degree of relatedness between each other.
This idea can be sustained not only by the positive
average SCC scores, but also by the set of outliers
above the upper whisker in Figure 1b. The average
of 0.42 SCC score and =0.05 also implies a
strong correlation between the documents in the
int en subcorpus (Table 2). Likewise, the longest
whisker and the set of outliers outside the upper
quartile in the 2 scores also indicate a high
relatedness between the documents.</p>
        <p>Regarding the int it subcorpus, the SCC and the
2 scores (Figures 1b and 1c) and the average
of 101.08 common tokens per document and
=55.71 (Figure 1a and Table 2) suggest that the
data is normally distributed (Figure 1b) and highly
CommonTokens
Spearman'srankcorrelationcoefficient(tokens)
ChiSquarescores(tokens)
int_en
int_it
int_en
int_it
int_en
int_it
int_es
Subcorpora
(b)
Spearman'srankcorrelationcoefficient(lemmas)
int_es
Subcorpora
(e)
Spearman'srankcorrelationcoefficient(stems)
int_en
int_it
int_en
int_it
int_en</p>
        <p>int_it
0001
t 800
en
m
rcdou 600
rcseope
rage 400
veA
002
0
0001
tne 080
m
cudo
rpee 600
rscoeg
rvaeA 040
002
0
0001
tne 080
m
cduo
rpee 600
rcsoeg
rveaA 400
020
0
int_es
Subcorpora
(c)
ChiSquarescores(lemmas)
int_es
Subcorpora
(f)
ChiSquarescores(stems)
int_es
Subcorpora
(i)
correlated. Although this subcorpus got lower
average scores for all the DSMs when compared
to the English subcorpus, Table 2, Figure 1a,
1b and 1c show that the average scores and the
range of variation are quite similar to the English
subcorpus. Therefore, we can conclude that the
documents inside the Italian subcorpus are highly
related between each other.</p>
        <p>From the three subcorpora, the int es
subcorpus is the biggest one with 224 documents
(Table 1). Nevertheless, the average scores per
document are slightly different from the other
box plots (see Figures 1a, 1b and 1c). The 2
standard deviation practically equal to its average
(38.21 and 40.92, respectively) and the SCC
variability inside and outside the IQR indicates
some inconsistency in the data. Moreover, Table 2
and Figure 1a reveal a lower NCT compared with
int en and the int it subcorpora.</p>
        <p>The subcorpus int en has 163 common tokens
per document on average with a =83, and the
subcorpora int it and int es only have 101 and
31 common tokens per document on average with
a =55 and =23, respectively (Table 2, NCT
column). This means that the int it and int es
subcorpora are composed of documents with a
lower level of relatedness when compared with
the English one. This fact could happen because
Italian and Spanish have a richer morphology
compared to English. Therefore, due to bigger
number of inflection forms per lemma, there
is a larger number of tokens and consequently
less common tokens per document in Spanish.
Another explanation could come from the fact
that the tourism and beauty services are more
developed in Italy and Spain than in the UK and
therefore there are more variety on the vocabulary
used as well as in the services offered. Indeed,
Table 1 offers some evidences about the employed
vocabulary. The English subcorpus has a lower
number of types and a higher number of tokens
(11,6k and 496,2k, respectively) when compared
with the Italian (19,9k types and 386,2k tokens)
and Spanish subcorpora (13,2k types and 207,3k
tokens). The high difference on the average of
common tokens per document between Spanish
and the other two languages can also be related
with the marketing strategies used to advertise
tourism and beauty services, which is somehow
hard to confirm. Despite that our method is able
to catch the lexical level of similarity between
the documents, the semantic level is not taken
into account, i.e. does not consider synonyms
as similar words for example, and consequently
would result on slightly different similarity scores
(again, another explanation difficult to confirm).</p>
        <p>To conclude, we can state from the statistical
and theoretical evidences that the int en and the
int it subcorpora look like they assemble highly
correlated documents. We can not say the same
for the int es subcorpus. Due to the scarceness
of evidences, we can only not reject the idea that
this subcorpus is composed of similar documents.
Nevertheless, as we will see in the next section,
the fact that int es is composed by low related
documents (according to our findings) will affect
the ranking task.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Measuring DSMs Performance</title>
        <p>The second part of this experiment aims at
assessing how the DSMs perform the task of
filtering out documents with a low level of
relatedness. To do that, we injected different
sets of out-of-domain documents, randomly
selected from the Europarl corpus to the original
INTELITERM subcorpora. More precisely, we
injected 5%, 10%, 15% and 20%9 to the various
subcorpora. As we can see in Figure 2, the more
noisy documents are injected, the lower is the
NCT. Then, the methodology described in Section
4 was applied to these “new twelve subcorpora”
(int en05, int en10, ..., int it15 and int it20, see
9The number of documents that correspond to these
percentages can be inferred from Table 1.
35</p>
        <p>003
tne 520
m
cou
rsped 200
ken
tonom 150
m
co
f
rageo 100
veA 05</p>
        <p>In order to evaluate the DSMs precision, we
analysed the first n positions in the ranking lists
produced by the three DSMs (individually), and
in this case n is the number of original documents
in a given INTELITERM subcorpus. Table 3
presents the precision values obtained by the
DSMs when injecting different amounts of noise
to the various original subcorpora.</p>
        <p>SubC</p>
        <p>As expected, none of the DSMs got acceptable
results for Spanish, being incapable of correctly
identify noisy documents. However, we need to
be aware that this happened due to the pre-existing
low level of relatedness between the original
documents in the int es subcorpus (see Section
5.1 for more details). On the other hand, the DSMs
show promising results for English and Italian. By
a way of example, the 2 was capable of reaching
100% when injected 5% and 10% of noise to the
int en subcorpus, and even 90% when injected
20%. Although the NCT got lower precision,
in general, when compared with the 2, it still
reached 80% and 73% when injected 20% of
noise to the English and to the Italian subcopora,
respectively. From the evidences shown in Table
3, we can say that the NCT and the 2 are suitable
for the task of filtering out low related documents
with a high precision degree. The same cannot be
say to the SCC measure, at least for this specific
task.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>In this paper we presented a simple methodology
and studied various Distributional Similarity
Measures (DSMs) for the purpose of measuring
the relatedness between documents in specialised
comparable corpora. As input for these DSMs,
we used three different input features (lists of
common tokens, lemmas and stems). In the
end, we conclude that for the data in hand these
features had similar performance. In fact, our
findings show that instead of using common
lemmas or stems, which require external libraries,
processing power and time, a simple list of
common tokens was enough to describe our
data. Moreover, we proved that it is possible to
assess and describe comparable corpora through
statistical methods. The number of entities shared
by their documents, the average scores obtained
with the SCC and the 2 measure resulted to
be an important surgical toolbox to dissect and
microscopically analyse comparable corpora.</p>
      <p>Furthermore, these DSMs can be seen as
a suitable tool to rank documents by their
similarities. A handy feature to those who
manually or semi-automatically compile corpora
mined from the Internet and want to retrieve
the most similar ones and filter out documents
with a low level of relatedness. Our findings
show promising results when filtering out noisy
documents. Indeed, two of the measures got very
high precision results, even when dealing with
20% of noise.</p>
      <p>In the future, we intend not only to perform
more experiments with these DSMs in other
corpora and languages, but also test other
DSMs, like Jaccard or Cosine and compare their
36
performance.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>Hernani Costa is supported by the People
Programme (Marie Curie Actions) of the
European Union’s Framework Programme
(FP7/2007-2013) under REA grant agreement
no 317471. The research reported in this
work has also been partially carried out in the
framework of the Educational Innovation Project
TRADICOR (PIE 13-054, 2014-2015); the R&amp;D
project INTELITERM (ref. no FFI2012-38881,
2012-2015); the R&amp;D Project for Excellence
TERMITUR (ref. no HUM2754, 2014-2017); and
the LATEST project (ref.
327197-FP7-PEOPLE2012-IEF).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Laurence</given-names>
            <surname>Anthony</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <source>AntConc (Version 3.4</source>
          .3)
          <string-name>
            <surname>Machintosh</surname>
            <given-names>OS X</given-names>
          </string-name>
          . Waseda University. Tokyo, Japan. Available from http://www. laurenceanthony.net.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Biber</surname>
          </string-name>
          .
          <year>1988</year>
          .
          <article-title>Variation across speech and writing</article-title>
          . Cambridge University Press, Cambridge, UK.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Gloria</given-names>
            <surname>Corpas Pastor and M´ıriam Seghiri</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Virtual Corpora as Documentation Resources: Translating Travel Insurance Documents (EnglishSpanish)</article-title>
          . In A. Beeby,
          <string-name>
            <given-names>P.R.</given-names>
            <surname>Ine</surname>
          </string-name>
          <article-title>´s, and</article-title>
          <string-name>
            <given-names>P.</given-names>
            <surname>Sa</surname>
          </string-name>
          ´nchezGijo´n, editors, Corpus Use and
          <article-title>Translating: Corpus Use for Learning to Translate and Learning Corpus Use to Translate, Benjamins translation library</article-title>
          , chapter
          <volume>5</volume>
          , pages
          <fpage>75</fpage>
          -
          <lpage>107</lpage>
          . John Benjamins Publishing Company.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Gloria</given-names>
            <surname>Corpas Pastor</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Compilacio´n de un corpus ad hoc para la ensen˜anza de la traduccio´n inversa especializada</article-title>
          .
          <source>TRANS</source>
          , Revista de Traductolog´ıa,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):
          <fpage>155</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Hernani</given-names>
            <surname>Costa</surname>
          </string-name>
          , Hugo Gonc¸alo Oliveira, and
          <string-name>
            <given-names>Paulo</given-names>
            <surname>Gomes</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>The Impact of Distributional Metrics in the Quality of Relational Triples</article-title>
          .
          <source>In 19th European Conf. on Artificial Intelligence, Workshop on Language Technology for Cultural Heritage</source>
          ,
          <source>Social Sciences, and Humanities, ECAI'10</source>
          , pages
          <fpage>23</fpage>
          -
          <lpage>29</lpage>
          , Lisbon, Portugal,
          <year>August</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Hernani</given-names>
            <surname>Costa</surname>
          </string-name>
          , Hugo Gonc¸alo Oliveira, and
          <string-name>
            <given-names>Paulo</given-names>
            <surname>Gomes</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Using the Web to Validate LexicoSemantic Relations</article-title>
          .
          <source>In 15th Portuguese Conf. on Artificial Intelligence</source>
          , volume
          <volume>7026</volume>
          <source>of EPIA'11</source>
          , pages
          <fpage>597</fpage>
          -
          <lpage>609</lpage>
          , Lisbon, Portugal, October. Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Hernani</given-names>
            <surname>Costa</surname>
          </string-name>
          , Hanna Be´chara, Shiva Taslimipoor, Rohit Gupta, Constantin Orasan, Gloria Corpas Pastor, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Mitkov</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>MiniExperts: An SVM approach for Measuring Semantic Textual Similarity</article-title>
          .
          <source>In 9th Int. Workshop on Semantic Evaluation, SemEval'15</source>
          , pages
          <fpage>96</fpage>
          -
          <lpage>101</lpage>
          , Denver, Colorado, June. ACL.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Hernani</given-names>
            <surname>Costa</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Automatic Extraction and Validation of Lexical Ontologies from text</article-title>
          .
          <source>Master's thesis</source>
          , University of Coimbra,
          <source>Faculty of Sciences and Technology</source>
          , Department of Informatics Engineering, Coimbra, Portugal, September.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Hernani</given-names>
            <surname>Costa</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Assessing Comparable Corpora through Distributional Similarity Measures</article-title>
          .
          <source>In EXPERT Scientific and Technological Workshop</source>
          , pages
          <fpage>23</fpage>
          -
          <lpage>32</lpage>
          , Malaga, Spain, June.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>EAGLES.</surname>
          </string-name>
          <year>1996</year>
          .
          <article-title>Preliminary Recommendations on Corpus Typology</article-title>
          .
          <source>Technical report</source>
          , EAGLES Document EAG-TCWG-CTYP/P., May. http://www.ilc.cnr.it/EAGLES96/ corpustyp/corpustyp.html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Zelig</given-names>
            <surname>Harris</surname>
          </string-name>
          .
          <year>1970</year>
          .
          <article-title>Distributional Structure</article-title>
          .
          <source>In Papers in Structural and Transformational Linguistics</source>
          , pages
          <fpage>775</fpage>
          -
          <lpage>794</lpage>
          . D. Reidel Publishing Company, Dordrecht, Holland.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Oktay</given-names>
            <surname>Ibrahimov</surname>
          </string-name>
          , Ishwar Sethi, and
          <string-name>
            <given-names>Nevenka</given-names>
            <surname>Dimitrova</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>The Performance Analysis of a Chi-square Similarity Measure for Topic Related Clustering of Noisy Transcripts</article-title>
          .
          <source>In 16th Int. Conf. on Pattern Recognition</source>
          , volume
          <volume>4</volume>
          , pages
          <fpage>285</fpage>
          -
          <lpage>288</lpage>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Adam</given-names>
            <surname>Kilgarriff</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <string-name>
            <given-names>Comparing</given-names>
            <surname>Corpora</surname>
          </string-name>
          .
          <source>Int. Journal of Corpus Linguistics</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <fpage>97</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Koehn</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Europarl: A Parallel Corpus for Statistical Machine Translation</article-title>
          .
          <source>In MT Summit.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Paul</given-names>
            <surname>Rayson</surname>
          </string-name>
          , Geoffrey Leech, and
          <string-name>
            <given-names>Mary</given-names>
            <surname>Hodges</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Social Differentiation in the Use of English Vocabulary: Some Analyses of the Conversational Component of the British National Corpus</article-title>
          .
          <source>Int. Journal of Corpus Linguistics</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>133</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Buckley</surname>
          </string-name>
          .
          <year>1988</year>
          .
          <article-title>TermWeighting Approaches in Automatic Text Retrieval</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>24</volume>
          (
          <issue>5</issue>
          ):
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Improvements In Part-ofSpeech Tagging With an Application To German</article-title>
          .
          <source>In ACL SIGDAT-Workshop</source>
          , pages
          <fpage>47</fpage>
          -
          <lpage>50</lpage>
          , Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Amit</given-names>
            <surname>Singhal</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Modern Information Retrieval: A Brief Overview</article-title>
          .
          <source>Bulletin of the IEEE Computer Society Technical Committee on Data Engineering</source>
          ,
          <volume>24</volume>
          (
          <issue>4</issue>
          ):
          <fpage>35</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>