Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                                29


 Measuring the Relatedness between Documents in Comparable Corpora


                    Hernani Costaa , Gloria Corpas Pastora and Ruslan Mitkovb
                            a
                              LEXYTRAD, University of Malaga, Spain
                             b
                               RIILP, University of Wolverhampton, UK
                   {hercos,gcorpas}@uma.es, r.mitkov@wlv.ac.uk


                        Abstract                                linguistic resources and attempting a meaningful
                                                                description of their content is often a perilous
        This paper aims at investigating the                    task (Corpas Pastor and Seghiri, 2009). Usually,
        use of textual distributional similarity
                                                                a corpus is given a short description such as
        measures in the context of comparable
        corpora. We address the issue of measuring              “casual speech transcripts” or “tourism specialised
        the relatedness between documents by                    comparable corpus”. Yet, such tags will be of
        extracting, measuring and ranking their                 little use to those users seeking for a representative
        common content. For this purpose, we                    and/or high quality domain-specific corpora.
        designed and applied a methodology                      Apart from the usual description that comes
        that exploits available natural language                along with the corpus, like number of documents,
        processing technology with statistical
                                                                tokens, types, source(s), creation date, policies
        methods. Our findings showed that using
        a list of common entities and a simple,
                                                                of usage, etc., nothing is said about how similar
        yet robust set of distributional similarity             the documents are or how to retrieve the most
        measures was enough to describe and                     related ones. As a result, most of the resources
        assess the degree of relatedness between                at our disposal are built and shared without deep
        the documents. Moreover, our method has                 analysis of their content, and those who use them
        demonstrated high performance in the task               blindly trust on the people’s or research group’s
        of filtering out documents with a low level             name behind their compilation process, without
        of relatedness. By a way of example, one
                                                                knowing nothing about the relatedness quality
        of the measures got 100%, 100%, 95% and
        90% precision when injected 5%, 10%,                    of the documents. Although some tasks require
        15% and 20% of noise, respectively.                     documents with a high degree of relatedness
                                                                between each other, the literature is scarce on this
                                                                matter.
1       Introduction
                                                                    Accordingly, this work explores this niche by
Comparable corpora1 can be considered an                        taking advantage of several textual Distributional
important resource for several research areas                   Similarity Measures (DSMs) presented in the
such as Natural Language Processing (NLP),                      literature. Firstly, we selected a specialised
terminology, language teaching, and automatic                   corpus about tourism and beauty domain that was
and assisted translation, amongst other related                 manually compiled by researchers in the area of
areas. Nevertheless, an inherent problem to those               translation and interpreting studies. Then, we
who deal with comparable corpora in a daily                     designed and applied a methodology that exploits
basis is the uncertainty about the data they are                available NLP technology with statistical methods
dealing with. Indeed, little work has been done                 to assess how the documents correlate with each
on semi- or automatically characterising such                   other in the corpus. Our assumption is that the
    1
    I.e. corpora that include similar types of original texts
                                                                amount of information contained in a document
in one or more language using the same design criteria (cf.     can be evaluated via summing the amount of
(EAGLES, 1996; Corpas Pastor, 2001)).                           information contained in the member words. For
                Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                            30


this purpose, a list of common entities was used            Having this in mind, we took advantage of two
as a unit of measurement capable of identifying             IR measures commonly used in the literature, the
the amount of information shared between the                Spearman’s Rank Correlation Coefficient (SCC)
documents. Our hypothesis is that this approach             and the Chi-Square ( 2 ) to compute the similarity
will allow us to: compute the relatedness between           between documents written in the same language
documents; describe and characterise the corpus             (see section 2.1 and 2.2). Both measures are
itself; and to rank the documents by their degree           particularly useful for this task because they are
of relatedness. In order to evaluate how the DSMs           independent of text size (mostly because both
perform the task of ranking documents based on              use a list of the common entities), and they are
their similarity and filter out the unrelated ones,         language-independent.
we introduced noisy documents, i.e. out-of-                    The SCC distributional measure has been
domain documents to the corpus in hand.                     shown effective on determining similarity
   The remainder of the paper is structured as              between sentences, documents and even on
follows. Section 2 introduces some fundamental              corpora of varying sizes (Kilgarriff, 2001; Costa
concepts related with DSMs, i.e. explains the               et al., 2015; Costa, 2015). It is particularly useful,
theoretical foundations, related work and the               for instance to measure the textual similarity
DSMs exploited in this experiment. Then, Section            between documents because it is easy to compute
3 presents the corpora used in this work. After             and is independent of text size as it can directly
applying the methodology described in Section               compare ranked lists for large and small texts.
4, Section 5 presents and discusses the obtained               The 2 similarity measure has also shown
results in detail. Finally, Section 6 presents the          its robustness and high performance. By way
final remarks and highlights our future work.               of example, 2 have been used to analyse the
                                                            conversation component of the British National
2   Distributional Similarity Measures                      Corpus (Rayson et al., 1997), to compare both
                                                            documents and corpora (Kilgarriff, 2001; Costa,
Information Retrieval (IR) (Singhal, 2001) is the
                                                            2015), and to identify topic related clusters in
task of locating specific information within a
                                                            imperfect transcribed documents (Ibrahimov et
collection of documents or other natural language
                                                            al., 2002). It is a simple statistic measure that
resources according to some request. This field
                                                            permits to assess if relationships between two
is rich in statistical methods that use words
                                                            variables in a sample are due to chance or the
and their (co-)occurrence to retrieve documents
                                                            relationship is systematic.
or sentences from large data sets. In simple
                                                               Bearing this in mind, distributional similarity
words, these IR methods aim to find the most
                                                            measures in general and SCC and 2 in particular
frequently used words and treat the rate of usage
                                                            have a wide range of applicabilities (Kilgarriff,
of each word in a given text as a quantitative
                                                            2001; Costa et al., 2015; Costa, 2015). Indeed,
attribute. Then, these words serve as features
                                                            this work aims at proving that these simple, yet
for a given statistical method. Following Harris’
                                                            robust and high-performance measures allow to
distributional hypothesis (Harris, 1970), which
                                                            describe the relatedness between documents in
assumes that similar words tend to occur in similar
                                                            specialised corpora and to rank them according to
contexts, these statistical methods are suitable,
                                                            their similarity.
for instance to find similar sentences based on
the words they contain (Costa et al., 2015) and             2.1      Spearman’s Rank Correlation
automatically extract or validate semantic entities                  Coefficient (SCC)
from corpora (Costa et al., 2010; Costa, 2010;
Costa et al., 2011). To this end, it is assumed             In this work, the SCC is adopted and calculated as
that the amount of information contained in a               in Kilgarriff (2001). Firstly, a list of the common
document could be evaluated by summing the                  entities2 L between two documents dl and dm is
amount of information contained in the document             compiled, where Ldl ,dm ✓ (dl \dm ). It is possible
words. And, the amount of information conveyed              to use the top n most common entities or all
by a word can be represented by means of the                     2
                                                                In this work, the term ‘entity’ refers to “single words”,
weight assigned to it (Salton and Buckley, 1988).           which can be a token, a lemma or a stem.
                    Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                                31


common entities between two documents, where                    3        Corpora
n corresponds to the total number of common
entities considered |L|, i.e. {n|n 2 N 0 , n  |L|}             INTELITERM3 is a specialised comparable
– in this work we use all the common entities for               corpus composed of documents collected from the
each document pair, i.e. n = |L|. Then, for each                Internet. It was manually compiled by researchers
document the list of common entities (e.g. Ldl and              with the purpose of building a representative
Ldm ) is ranked by frequency in an ascending order              corpus (Biber, 1988, p.246) for the Tourism and
(RLdl and RLdm ), where the entity with lowest                  Beauty domain. It contains documents in four
frequency receives the numerical raking position                different languages (English, Spanish, Italian and
1 and the entity with highest frequency receives                German). Some of the texts are translations of
the numerical raking position n. Finally, for each              each other (parallel), yet the majority is composed
common entity {e1 , ..., en } 2 L, the difference in            of original texts. The corpus is composed of
the rank orders for the entity in each document is              several subcorpora, divided by the language and
computed, and then normalised as a sum of the                   further for each language there are translated and
                               ⇣Pn     ⌘                        original texts. For the purpose of this work, only
square of these differences         s2i . The final             original documents in English, Spanish and Italian
                                     i=1
SCC equation is presented in expression 1, where                were used, which for now on will be referred as
{SCC|SCC 2 R, 1 SCC  1}.                                       int en, int es, int it, respectively.
                                                                   In order to analyse how the DSMs perform
                                            P
                                            n                   the task of ranking documents based on their
                                     6⇤          s2i            similarity and filter out the unrelated ones,
                                           i=1
          SCC(dl , dm ) = 1                              (1)    it is necessary to introduce noisy documents,
                                       n3        n
                                                                i.e. out-of-domain documents to the various
                                                                subcorpora. To do that, we chose the well-
2.2   Chi-Square ( 2 )
                                                                known Europarl4 corpus (Koehn, 2005), a parallel
The Chi-square ( 2 ) measure also uses a list of                corpus composed by proceedings of the European
common entities (L). Similarly to SCC, it is also               Parliament. As mentioned further in section 5.2,
possible to use the top n most common entities                  we added different amounts of noise to the various
or all common entities between two documents,                   subcorpora, more precisely 5%, 10%, 15% and
and again, we use all the common entities for                   20%. These noisy documents were randomly
each document pair, i.e. n = |L|. The number                    selected from the “one per day” Europarl v.7 for
of occurrences of a common entity in L that                     the three working languages: English, Spanish
would be expected in each document is calculated                and Italian (eur en, eur es, eur it, respectively).
from the frequency lists. If the size of the                                                                       types
document dl and dm are Nl and Nm and the                                          nDocs    types     tokens       tokens

entity ei has the following observed frequencies                         int en     151    11,6k     496,2k       0.023
O(ei , dl ) and O(ei , dm ), then the expected values                    eur en      30     3.4k      29,8k       0.116
are eidl = Nl ⇤(O(eN    i ,dl )+O(ei ,dm ))
                                            and eidm =                   int es     224    13,2k     207,3k       0.063
                           l +Nm                                         eur es      44     5,6k      43,5k       0.129
Nm ⇤(O(ei ,dl )+O(ei ,dm ))
       Nl +Nm               . Equation 2 presents the                    int it     150    19,9k     386,2k       0.052
 2 formula, where O is the observed frequency
                                                                         eur it      30     4,7k      29,6k       0.159
and E the expected frequency. The resulted 2
score should be interpreted as the interdocument                  Table 1: Statistical information per subcorpora.
distance between two documents. It is also
important to mention that { 2 | 2 2 R, 1                          All the statistical information about both the
  2 < 1}, which means that as more unrelated the
                                                                INTELITERM subcorpora and the set of 20%
common entities in L are, the lower the 2 score                 of noisy documents, randomly selected for each
will be.                                                        working language, are presented in Table 1. In
                                                                detail, this Table shows: the number of documents
                                X (O        E)2                      3
             2
                 (dl , dm ) =                            (2)             http://www.lexytrad.es/proyectos.html
                                                                     4
                                        E                                http://www.statmt.org/europarl/
                  Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                              32


(nDocs); the number of types (types); the number                   we will use the acronym NCE), a co-
of tokens (tokens); and the ratio of types per                     occurrence matrix was built for each pair
          types
tokens ( tokens ) per subcorpus. These values                      of documents.              Only those that have at
were obtained using the Antconc 3.4.3 (Anthony,                    least one occurrence in both documents are
2014) software, a corpus analysis toolkit for                      considered. As required by the DSMs (see
concordancing and text analysis.                                   section 2), their frequency in both documents
                                                                   is also stored within this matrix (Ldl ,dm =
4       Methodology                                                {ei , (f (ei , dl ), f (ei , dm )); ej , (f (ej , dl ),
This section describes the methodology employed                    f (ej , dm )); ...; en , (f (en , dl ), f (en , dm ))},
to calculate and rank documents based on                           where f represents the frequency of an entity
their similarity using Distributional Similarity                   in a document). With the purpose of analysing
Measures (DSMs). All the tools, libraries and                      and comparing the performance of different
frameworks used for the purpose in hand are also                   DSMs, three different lists were created to be
pointed out.                                                       used as input features: the first one using the
                                                                   Number of Common Tokens (NCT), another
1) Data Preprocessing:            firstly all the                  using the Number of Common Lemmas
   INTELITERM documents were processed                             (NCL) and the third one using the Number of
   with the OpenNLP5 Sentence Detector and                         Common Stems (NCS).
   Tokeniser.     Then, the annotation process
                                                              3) Computing       the    similarity     between
   was done with the TT4J6 library, which is a
                                                                 documents:        the similarity between
   Java wrapper around the popular TreeTagger
                                                                 documents was calculated by applying
   (Schmid, 1995) – a tool specifically designed
                                                                 three different DSMs (DSM s                  =
   to annotate text with part-of-speech and lemma
                                                                 {DSMN CE , DSMSCC , DSM 2 },             where
   information. Regarding the stemming, we
                                                                 N CE , SCC and 2 refer to Number of Common
   used the Porter stemmer algorithm provided
                                                                 Entities,   Spearman’s Rank Correlation
   by the Snowball7 library. A method to remove
                                                                 Coefficient and Chi-Square, respectively),
   punctuation and special characters within the
                                                                 each one calculated using three different input
   words was also implemented. Finally, in order
                                                                 features (NCT, NCL and NCS).
   to get rid of the noise, a stopword list8 was
   compiled to filter out the most frequent words             4) Computing the document final score: the
   in the corpus. Once a document is computed                    document final score DSM (dl ) is the mean of
   and the sentences are tokenised, lemmatised                   the similarity scores of the document with all
   and stemmed, our system creates a new output                  the documents in the collection of documents,
   file with all this new information, i.e. a                                              nP1
                                                                                                 DSMi (dl ,di )
   new document containing: the original, the                      i.e. DSM (dl ) = i=1 n 1              , where n
   tokenised, the lemmatised and the stemmed                       corresponds to the total number of documents
   text. Using the stopword list mentioned above                   in the collection and DSMi (dl , di ) the resulted
   a Boolean vector describing if the entity is a                  similarity score between the document dl with
   stopword or not is also added to the document.                  all the documents in the collection.
   This way, the system will be able to use only
   the tokens, lemmas and stems that are not                  5) Ranking documents: finally, the documents
   stopwords.                                                    were ranked in a descending order according
                                                                 to their DSMs scores (i.e. NCE, SCC or 2 ).
2) Identifying the list of common entities
   between documents: in order to identify                    5     Results and Analysis
   a list of common entities (from now on
                                                              This experiment is divided into two parts. In the
    5
    https://opennlp.apache.org                                first part (section 5.1), we describe the corpus
  6
    http://reckart.github.io/tt4j/
  7
    http://snowball.tartarus.org
                                                              in hand by applying three different Distributional
  8
    Freely available to download through the following URL    Similarity Measures (DSMs): the Number of
https://github.com/hpcosta/stopwords.                         Common Entities (NCE), the Spearman’s Rank
                 Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                             33


Correlation Coefficient (SCC) and the Chi-Square             are some exception that we will discuss along this
( 2 ). As a input feature to the DSMs, three                 section. Another interesting observation is related
different lists of entities were used, i.e. the              with the high Number of Common Tokens (NCT)
Number of Common Tokens (NCT), the Number                    in English (int en) when compared with Italian
of Common Lemmas (NCL) and the Number of                     and Spanish (int it and int es, respectively), see
Common Stems (NCS). By a way of example,                     Table 2 and Figure 1a. Later in this section, we
Table 2 shows the NCT between documents, the                 will try to explain this phenomenon.
SCC and the 2 scores and averages (av) along
with the associated standard deviations ( ) per                      SubC.     Stats     NCT       SCC         2

measure and subcorpus. Figure 1 presents the                                   av       163.70     0.42    279.39
                                                                     int en
resulted average scores per document in a box plot                                      83.87      0.05    177.45
format for all the combinations DSM vs. feature.                               av        31.97     0.41     40.92
                                                                     int es
Each box plot displays the full range of variation                                      23.48      0.07     38.21
                                                                               av       101.08     0.39    201.97
(from min to max), the likely range of variation                     int it
                                                                                        55.71      0.05    144.68
(the interquartile range or IQR), the median, and
the high maximums and low minimums (also                     Table 2: Average and standard deviation of
know as outliers). It is important to mention                common tokens scores between documents per
that for the first part of this experiment (section          subcorpus.
5.1) we did not use a sample, but instead the
entire INTELITERM subcorpora in their original
                                                                Although the NCT per document on average is
size and form, which means that all obtained
                                                             higher for the int en subcorpus, the interquartile
results and made observations came from the
                                                             range (IQR) is larger than for the other subcorpora
entire population, in this case the English (int en),
                                                             (see Table 2 and Figure 1a), which means that the
Spanish (int es) and Italian (int it) subcorpora
                                                             middle 50% of the data is more distributed and
(for more details about the subcorpora see section
                                                             thus the average of NCT per document is more
3). Regarding the second part of this experiment,
                                                             variable. Moreover, longest whiskers (the lines
we used the same subcorpora, but an additional
                                                             extending vertically from the box) in Figure 1a
percentage of documents was added to them in
                                                             also indicates variability outside the upper and
order to test how the DSMs perform the task of
                                                             lower quartiles. Therefore, we can say that int en
filtering out these noisy documents, i.e. out-of-
                                                             has a wide type of documents and consequently
domain documents (see 5.2). In detail, Figure
                                                             some of them are only roughly correlated to the
2 shows how the average scores decrease when
                                                             rest of the subcorpus. Nevertheless, the data is
injecting noisy documents and Table 3 presents
                                                             skewed left and the longest whisker outside the
how the DSMs performed when that noise was
                                                             upper quartile indicates that the majority of the
injected.
                                                             data is strongly similar, i.e. the documents have
                                                             a high degree of relatedness between each other.
5.1   Describing the Corpus
                                                             This idea can be sustained not only by the positive
The first observation we can make from Figure                average SCC scores, but also by the set of outliers
1 is that the distributions between the features             above the upper whisker in Figure 1b. The average
are quite similar (see for instance Figures 1a,              of 0.42 SCC score and =0.05 also implies a
1d and 1g). This means that it is possible to                strong correlation between the documents in the
achieve acceptable results only using raw words              int en subcorpus (Table 2). Likewise, the longest
(i.e. tokens). Stems and lemmas require more                 whisker and the set of outliers outside the upper
processing power and time to be used as features             quartile in the 2 scores also indicate a high
– especially lemmas due to the part-of-speech                relatedness between the documents.
tagger dependency and time consuming process                    Regarding the int it subcorpus, the SCC and the
implied. In general, we can say that the scores for            2 scores (Figures 1b and 1c) and the average

each subcorpus are symmetric (roughly the same               of 101.08 common tokens per document and
on each side when cut down the middle), which                  =55.71 (Figure 1a and Table 2) suggest that the
means that the data is normally distributed. There           data is normally distributed (Figure 1b) and highly
                                                           Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                                                                                                                   34


                                                        Common Tokens                                                     Spearman's rank correlation coefficient (tokens)                                                         Chi Square scores (tokens)


                                                                                                               0.6


                                                                                                                                                                                                                   1000
                                        300


                                                                                                                                                                                                                   800
                                                                                                               0.5
Average of common tokens per document
                                        250


                                                                                  Average score per document


                                                                                                                                                                                      Average score per document
                                        200


                                                                                                                                                                                                                   600
                                                                                                               0.4
                                        150


                                                                                                                                                                                                                   400
                                                                                                               0.3
                                        100


                                                                                                                                                                                                                   200
                                        50


                                                                                                               0.2


                                                                                                                                                                                                                   0
                                        0


                                              int_en         int_es     int_it                                       int_en                    int_es                        int_it                                       int_en             int_es             int_it
                                                           Subcorpora                                                                        Subcorpora                                                                                    Subcorpora


                                                          (a)                                                                               (b)                                                                                            (c)
                                                        Common Lemmas                                                    Spearman's rank correlation coefficient (lemmas)                                                          Chi Square scores (lemmas)


                                                                                                               0.6
                                        300


                                                                                                                                                                                                                   1000
                                        250


                                                                                                               0.5
Average of common lemmas per document


                                                                                                                                                                                                                   800
                                                                                  Average score per document


                                                                                                                                                                                      Average score per document
                                        200


                                                                                                               0.4


                                                                                                                                                                                                                   600
                                        150


                                                                                                               0.3


                                                                                                                                                                                                                   400
                                        100


                                                                                                               0.2


                                                                                                                                                                                                                   200
                                        50


                                                                                                                                                                                                                   0
                                        0


                                                                                                               0.1


                                              int_en         int_es     int_it                                       int_en                    int_es                        int_it                                       int_en             int_es             int_it
                                                           Subcorpora                                                                        Subcorpora                                                                                    Subcorpora


                                                          (d)                                                                               (e)                                                                                            (f)
                                                        Common Stems                                                      Spearman's rank correlation coefficient (stems)                                                          Chi Square scores (stems)
                                                                                                               0.6
                                        300


                                                                                                                                                                                                                   1000
                                        250


                                                                                                               0.5
Average of common stems per document


                                                                                                                                                                                                                   800
                                                                                  Average score per document


                                                                                                                                                                                      Average score per document
                                        200


                                                                                                               0.4


                                                                                                                                                                                                                   600
                                        150


                                                                                                               0.3


                                                                                                                                                                                                                   400
                                        100


                                                                                                               0.2


                                                                                                                                                                                                                   200
                                        50


                                                                                                                                                                                                                   0
                                        0


                                                                                                               0.1


                                              int_en         int_es     int_it                                       int_en                    int_es                        int_it                                       int_en             int_es             int_it
                                                           Subcorpora                                                                        Subcorpora                                                                                    Subcorpora


                                                          (g)                                                                               (h)                                                                                            (i)

                                                       Figure 1: INTELITERM: average scores between documents per subcorpus.


correlated. Although this subcorpus got lower                                                                                                           and Figure 1a reveal a lower NCT compared with
average scores for all the DSMs when compared                                                                                                           int en and the int it subcorpora.
to the English subcorpus, Table 2, Figure 1a,                                                                                                              The subcorpus int en has 163 common tokens
1b and 1c show that the average scores and the                                                                                                          per document on average with a =83, and the
range of variation are quite similar to the English                                                                                                     subcorpora int it and int es only have 101 and
subcorpus. Therefore, we can conclude that the                                                                                                          31 common tokens per document on average with
documents inside the Italian subcorpus are highly                                                                                                       a =55 and =23, respectively (Table 2, NCT
related between each other.                                                                                                                             column). This means that the int it and int es
   From the three subcorpora, the int es                                                                                                                subcorpora are composed of documents with a
subcorpus is the biggest one with 224 documents                                                                                                         lower level of relatedness when compared with
(Table 1). Nevertheless, the average scores per                                                                                                         the English one. This fact could happen because
document are slightly different from the other                                                                                                          Italian and Spanish have a richer morphology
box plots (see Figures 1a, 1b and 1c). The 2                                                                                                            compared to English. Therefore, due to bigger
standard deviation practically equal to its average                                                                                                     number of inflection forms per lemma, there
(38.21 and 40.92, respectively) and the SCC                                                                                                             is a larger number of tokens and consequently
variability inside and outside the IQR indicates                                                                                                        less common tokens per document in Spanish.
some inconsistency in the data. Moreover, Table 2                                                                                                       Another explanation could come from the fact
                Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                            35


that the tourism and beauty services are more               Figure 2). As a result, at this point we have the
developed in Italy and Spain than in the UK and             documents ranked in a descending order according
therefore there are more variety on the vocabulary          to their DSMs scores.
used as well as in the services offered. Indeed,
                                                                                                                                                                     Common Tokens

Table 1 offers some evidences about the employed


                                                                                                     300
vocabulary. The English subcorpus has a lower
number of types and a higher number of tokens


                                                                                                     250
                                                             Average of common tokens per document
(11,6k and 496,2k, respectively) when compared


                                                                                                     200
with the Italian (19,9k types and 386,2k tokens)


                                                                                                     150
and Spanish subcorpora (13,2k types and 207,3k


                                                                                                     100
tokens). The high difference on the average of


                                                                                                     50
common tokens per document between Spanish


                                                                                                     0
and the other two languages can also be related
                                                                                                           int_en05    int_en10   int_en15   int_en20     int_es05   int_es10   int_es15   int_es20     int_it05   int_it10   int_it15   int_it20
                                                                                                                                                        Subcorpora with 5%, 10%, 15% and 20% of noise


with the marketing strategies used to advertise
                                                            Figure 2: Average scores between documents
tourism and beauty services, which is somehow
                                                            when injecting 5%, 10%, 15% and 20% of noise
hard to confirm. Despite that our method is able
                                                            to the various subcorpora.
to catch the lexical level of similarity between
the documents, the semantic level is not taken
into account, i.e. does not consider synonyms                  In order to evaluate the DSMs precision, we
as similar words for example, and consequently              analysed the first n positions in the ranking lists
would result on slightly different similarity scores        produced by the three DSMs (individually), and
(again, another explanation difficult to confirm).          in this case n is the number of original documents
   To conclude, we can state from the statistical           in a given INTELITERM subcorpus. Table 3
and theoretical evidences that the int en and the           presents the precision values obtained by the
int it subcorpora look like they assemble highly            DSMs when injecting different amounts of noise
correlated documents. We can not say the same               to the various original subcorpora.
for the int es subcorpus. Due to the scarceness
                                                                                                                      SubC                   Noise                       NCT                    SCC                            2
of evidences, we can only not reject the idea that
                                                                                                                                             5%                          0.89                   0.22                     1.00
this subcorpus is composed of similar documents.                                                                                             10%                         0.73                   0.33                     1.00
Nevertheless, as we will see in the next section,                                                                     int en
                                                                                                                                             15%                         0.73                   0.36                     0.95
the fact that int es is composed by low related                                                                                              20%                         0.80                   0.37                     0.90
documents (according to our findings) will affect                                                                                            5%                          0.00                   0.00                     0.38
the ranking task.                                                                                                                            10%                         0.07                   0.07                     0.20
                                                                                                                      int es
                                                                                                                                             15%                         0.09                   0.09                     0.17
5.2    Measuring DSMs Performance                                                                                                            20%                         0.14                   0.18                     0.23
The second part of this experiment aims at                                                                                                   5%                          0.88                   0.13                     0.88
assessing how the DSMs perform the task of                                                                                                   10%                         0.82                   0.06                     0.82
                                                                                                                      int it
                                                                                                                                             15%                         0.74                   0.09                     0.83
filtering out documents with a low level of
                                                                                                                                             20%                         0.73                   0.13                     0.87
relatedness. To do that, we injected different
sets of out-of-domain documents, randomly                   Table 3: DSMs precision when injecting different
selected from the Europarl corpus to the original           amounts of noise to the various subcorpora.
INTELITERM subcorpora. More precisely, we
injected 5%, 10%, 15% and 20%9 to the various                  As expected, none of the DSMs got acceptable
subcorpora. As we can see in Figure 2, the more             results for Spanish, being incapable of correctly
noisy documents are injected, the lower is the              identify noisy documents. However, we need to
NCT. Then, the methodology described in Section             be aware that this happened due to the pre-existing
4 was applied to these “new twelve subcorpora”              low level of relatedness between the original
(int en05, int en10, ..., int it15 and int it20, see        documents in the int es subcorpus (see Section
   9
    The number of documents that correspond to these        5.1 for more details). On the other hand, the DSMs
percentages can be inferred from Table 1.                   show promising results for English and Italian. By
                Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                            36


a way of example, the 2 was capable of reaching             performance.
100% when injected 5% and 10% of noise to the
int en subcorpus, and even 90% when injected                Acknowledgements
20%. Although the NCT got lower precision,                  Hernani Costa is supported by the People
in general, when compared with the 2 , it still             Programme (Marie Curie Actions) of the
reached 80% and 73% when injected 20% of                    European Union’s Framework Programme
noise to the English and to the Italian subcopora,          (FP7/2007-2013) under REA grant agreement
respectively. From the evidences shown in Table             no 317471.     The research reported in this
3, we can say that the NCT and the 2 are suitable           work has also been partially carried out in the
for the task of filtering out low related documents         framework of the Educational Innovation Project
with a high precision degree. The same cannot be            TRADICOR (PIE 13-054, 2014-2015); the R&D
say to the SCC measure, at least for this specific          project INTELITERM (ref. no FFI2012-38881,
task.                                                       2012-2015); the R&D Project for Excellence
                                                            TERMITUR (ref. no HUM2754, 2014-2017); and
6   Conclusions and Future Work                             the LATEST project (ref. 327197-FP7-PEOPLE-
In this paper we presented a simple methodology             2012-IEF).
and studied various Distributional Similarity
Measures (DSMs) for the purpose of measuring                References
the relatedness between documents in specialised
comparable corpora. As input for these DSMs,                Laurence Anthony.        2014.     AntConc (Version
                                                              3.4.3) Machintosh OS X. Waseda University.
we used three different input features (lists of
                                                              Tokyo, Japan. Available from http://www.
common tokens, lemmas and stems). In the                      laurenceanthony.net.
end, we conclude that for the data in hand these            Douglas Biber. 1988. Variation across speech and
features had similar performance. In fact, our                writing. Cambridge University Press, Cambridge,
findings show that instead of using common                    UK.
lemmas or stems, which require external libraries,          Gloria Corpas Pastor and Mı́riam Seghiri. 2009.
processing power and time, a simple list of                   Virtual Corpora as Documentation Resources:
common tokens was enough to describe our                      Translating Travel Insurance Documents (English-
data. Moreover, we proved that it is possible to              Spanish). In A. Beeby, P.R. Inés, and P. Sánchez-
                                                              Gijón, editors, Corpus Use and Translating: Corpus
assess and describe comparable corpora through
                                                              Use for Learning to Translate and Learning
statistical methods. The number of entities shared            Corpus Use to Translate, Benjamins translation
by their documents, the average scores obtained               library, chapter 5, pages 75–107. John Benjamins
with the SCC and the 2 measure resulted to                    Publishing Company.
be an important surgical toolbox to dissect and             Gloria Corpas Pastor. 2001. Compilación de un corpus
microscopically analyse comparable corpora.                   ad hoc para la enseñanza de la traducción inversa
   Furthermore, these DSMs can be seen as                     especializada. TRANS, Revista de Traductologı́a,
                                                              5(1):155–184.
a suitable tool to rank documents by their
                                                            Hernani Costa, Hugo Gonçalo Oliveira, and Paulo
similarities. A handy feature to those who
                                                              Gomes. 2010. The Impact of Distributional
manually or semi-automatically compile corpora                Metrics in the Quality of Relational Triples. In 19th
mined from the Internet and want to retrieve                  European Conf. on Artificial Intelligence, Workshop
the most similar ones and filter out documents                on Language Technology for Cultural Heritage,
with a low level of relatedness. Our findings                 Social Sciences, and Humanities, ECAI’10, pages
show promising results when filtering out noisy               23–29, Lisbon, Portugal, August.
documents. Indeed, two of the measures got very             Hernani Costa, Hugo Gonçalo Oliveira, and Paulo
high precision results, even when dealing with                Gomes. 2011. Using the Web to Validate Lexico-
                                                              Semantic Relations. In 15th Portuguese Conf. on
20% of noise.
                                                              Artificial Intelligence, volume 7026 of EPIA’11,
   In the future, we intend not only to perform               pages 597–609, Lisbon, Portugal, October. Springer.
more experiments with these DSMs in other                   Hernani Costa, Hanna Béchara, Shiva Taslimipoor,
corpora and languages, but also test other                    Rohit Gupta,         Constantin Orasan,        Gloria
DSMs, like Jaccard or Cosine and compare their                Corpas Pastor, and Ruslan Mitkov.               2015.
                 Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                             37


  MiniExperts: An SVM approach for Measuring
  Semantic Textual Similarity. In 9th Int. Workshop
  on Semantic Evaluation, SemEval’15, pages
  96–101, Denver, Colorado, June. ACL.
Hernani Costa. 2010. Automatic Extraction and
  Validation of Lexical Ontologies from text.
  Master’s thesis, University of Coimbra, Faculty
  of Sciences and Technology, Department of
  Informatics Engineering, Coimbra, Portugal,
  September.
Hernani Costa. 2015. Assessing Comparable Corpora
  through Distributional Similarity Measures. In
  EXPERT Scientific and Technological Workshop,
  pages 23–32, Malaga, Spain, June.
EAGLES. 1996. Preliminary Recommendations
  on Corpus Typology.               Technical report,
  EAGLES Document EAG-TCWG-CTYP/P., May.
  http://www.ilc.cnr.it/EAGLES96/
  corpustyp/corpustyp.html.
Zelig Harris. 1970. Distributional Structure. In Papers
  in Structural and Transformational Linguistics,
  pages 775–794. D. Reidel Publishing Company,
  Dordrecht, Holland.
Oktay Ibrahimov, Ishwar Sethi, and Nevenka
  Dimitrova. 2002. The Performance Analysis
  of a Chi-square Similarity Measure for Topic
  Related Clustering of Noisy Transcripts. In 16th
  Int. Conf. on Pattern Recognition, volume 4, pages
  285–288. IEEE Computer Society.
Adam Kilgarriff. 2001. Comparing Corpora. Int.
  Journal of Corpus Linguistics, 6(1):97–133.
Philipp Koehn. 2005. Europarl: A Parallel Corpus for
  Statistical Machine Translation. In MT Summit.
Paul Rayson, Geoffrey Leech, and Mary Hodges.
  1997. Social Differentiation in the Use of English
  Vocabulary: Some Analyses of the Conversational
  Component of the British National Corpus. Int.
  Journal of Corpus Linguistics, 2(1):133–152.
Gerard Salton and Christopher Buckley. 1988. Term-
  Weighting Approaches in Automatic Text Retrieval.
  Information Processing & Management, 24(5):513–
  523.
Helmut Schmid. 1995. Improvements In Part-of-
  Speech Tagging With an Application To German.
  In ACL SIGDAT-Workshop, pages 47–50, Dublin,
  Ireland.
Amit Singhal. 2001. Modern Information Retrieval:
  A Brief Overview. Bulletin of the IEEE Computer
  Society Technical Committee on Data Engineering,
  24(4):35–42.