Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)


     Measuring Comparability of Multilingual Corpora Extracted
                        from Wikipedia ∗
       Midiendo la comparabilidad de copus multilingües extraı́dos de la
                                 Wikipedia

              Pablo Gamallo Otero                                 Issac González López
       Centro de Investigación en Tecnoloxı́as                         Cilenis S.L.
             da Información (CITIUS),                         Language Engineering Solutions
      Universidade de Santiago de Compostela                      Santiago de Compostela
                   Galiza, Spain                                       Galiza, Spain
              pablo.gamallo@usc.es                             isaacjgonzalez@cilenis.com

       Resumen: Los corpus comparables son muy útiles en variadas tareas del procesa-
       miento del lenguaje tales como la extracción de léxicos bilingües. Con la mejora de
       la calidad de los corpus comparables, podemos mejorar la calidad de la extracción.
       Este artı́culo describe algunas estrategias para construir corpus comparables a par-
       tir de la Wikipedia, y propone una medida de comparabilidad. Fueron realizados
       algunos experimentos utilizando la Wikipedia portuguesa, española e inglesa.
       Palabras clave: Extracción de Información, Corpus Comparables, Léxicos Bi-
       lingües, Comparabilidad
       Abstract: Comparable corpora can be used for many linguistic tasks such as bilin-
       gual lexicon extraction. By improving the quality of comparable corpora, we improve
       the quality of the extraction. This article describes some strategies to build compara-
       ble corpora from Wikipedia and proposes a measure of comparability. Experiments
       were performed on Portuguese, Spanish, and English Wikipedia.
       Keywords: Information Extraction, Comparable Corpora, Bilingual Lexicons,
       Comparability


1.    Introduction                                            theoretical work analysing symmetries and
    Wikipedia is a free, multilingual, and co-                asymmetries among the different multilingual
llaborative encyclopedia containing entries                   versions of an entry/article in Wikipedia (Fi-
(called “articles”) for almost 300 languages                  latova, 2009).
(281 in July 2011). English is the more re-                       In addition, multilingual articles of Wiki-
presentative one with about 3 million arti-                   pedia have been used as a source to build
cles. However, Wikipedia is not a parallel                    comparable corpora (Gamallo y González,
corpus as their articles are not translations                 2010). The EAGLES - Expert Advisory
from one language into another. Many works                    Group on Language Engineering Standards
have been published in the last years focu-                   Guidelines (see http://www.ilc.pi.cnr.
sed on its use and exploitation for multilin-                 it/EAGLES96/browse.html) defines a “com-
gual tasks in natural language processing: ex-                parable corpus” as one which selects simi-
traction of bilingual dictionaries (Yu y Tsu-                 lar texts in more than one language or va-
jii, 2009; Tyers y Pieanaar, 2008), alignment                 riety. One of the main advantages of compa-
and machine translation (Adafre y de Rijke,                   rable corpora is their versatility to be used
2006; Tomás, Bataller, y Casacuberta, 2001),                 in many linguistic tasks (Maia, 2003), like
multilingual information retrieval (Pottast,                  bilingual lexicon extraction (Gamallo y Pi-
Stein, y Anderka, 2008). There also exists                    chel, 2008; Saralegui, Vicente, y Gurrutxaga,
∗
  This work has been supported by Ministerio de
                                                              2008), information retrieval, and knowledge
Educación y Ciencia of Spain, within the project On-         engineering. Besides, they can also be used
toPedia, ref: FF12010-14986.                                  as training corpus to improve statistic machi-


                                                          8
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)


ne learning systems, in particular when pa-                   2.       Two strategies to Build
rallel corpora are scarce for a given pair of                          Wikipedia-Based Comparable
languages. Another advantage concerns their                            Corpora
availability. In contrast with parallel corpora,
which require (not always available) transla-                     The input of our strategies is CorpusPe-
ted texts, comparable corpora are easily re-                  dia1 , a friendly and easy-to-use XML struc-
trieved from the web. Among the different                     ture, generated from Wikipedia dump files.
web sources of comparable corpora, Wikipe-                    In CorpusPedia, all the internal links found
dia is likely the largest repository of simi-                 in the text are put in a vocabulary list iden-
lar texts in many languages. We only require                  tified with the tag links. In the same way, all
the appropriate computational tools to make                   the categories (or topics) used to classify each
them comparable.                                              article are inserted in the tag category. In ad-
                                                              dition, there is a tag called translations which
                                                              codifies a list of interlanguage links (i.e., links
                                                              to the same articles in other languages) found
    By taking into account multilingual po-                   in each article. Categories and translations
tentialities of Wikipedia, our main objective                 are very useful features to build comparable
is to define a method to measure the simi-                    corpora. Given these features, we developed
larity (or degree of comparability) of diffe-                 two strategies aimed to extract corpora with
rent comparable corpora built from Wikipe-                    different degrees of comparability.
dia. For this purpose, first we describe some
strategies to extract monolingual corpora in                  Not-Aligned Corpus This strategy ex-
Portuguese, Spanish, and English from Wi-                        tracts those articles in two languages ha-
kipedia, by making use of some categories                        ving in common the same topic, whe-
(“Archaeology”, “Biology”, “Physics”, etc.)                      re the topic is represented by a cate-
to make them comparable according to a                           gory and its translation (for instance,
specific topic. These strategies were descri-                    the English-Spanish pair “Archaeology-
bed in detail in (Gamallo y González, 2010).                    Arqueologı́a”). It results in a not-aligned
Then, we propose a measure of comparabi-                         comparable corpus, consisting of texts
lity to verify whether the corpora are lowly                     in two languages. We called it “not-
or highly comparable. For many extraction                        aligned” because the version of an article
tasks, such as bilingual lexicon extraction,                     in one language may have not its corres-
using highly comparable corpora often leads                      ponding version in the other language.
to better results. There are some works pro-                  Aligned Corpus The goal is to extract
posing comparability measures between mo-                         pairs of bilingual articles related by in-
nolingual corpora (Li y Gaussier, 2010; Sa-                       terlanguage links if, at least, one of both
ralegui y Alegria, 2007), based on the use of                     contains a required category. It results
existing bilingual dictionaries. However, ins-                    in a comparable corpus that is aligned
tead of exploiting dictionaries to compute the                    article by article.
comparability degree, we take advantage of
the translation equivalents inserted in Wiki-                    In Section 4, we will measure the degree
pedia by means of interlanguage links.                        of comparability of corpora built by means
                                                              of these two strategies. Before that, we will
                                                              define how to measure comparability between
    This paper is organized as follows. Section               Wikipedia-based corpora.
2 introduces two strategies to build compara-
ble corpora from Wikipedia. Next, in Section
                                                              3.       Comparability Measures
3, we propose some comparability measures.                       For a comparable corpus C of Wikipedia
Then, Section 4 describe some experiments                     articles, constituted for instance by a Portu-
performed in order to measure the compara-                    guese part Cp and a Spanish part Cs , a compa-
bility between different corpora built using                  rability coefficient can be defined on the basis
the strategies defined in Sec. 2 . The last sec-                   1
                                                                  The software to build CorpusPedia, as well as
tion discusses future tasks that will be imple-               CorpusPedia files for English, French, Spanish, Por-
mented in order to extend and improve our                     tuguese, and Galician, are freely available at http:
tools.                                                        //gramatica.usc.es/pln/


                                                          9
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)


of finding, for each Portuguese term tp in the                                   Portuguese-English, and Spanish-English.
vocabulary Cpv of Cp , its interlanguage link (or                                These corpora were built using the two stra-
translation) in the vocabulary Csv of Cs . The                                   tegies described in Section 2 and five domain
vocabulary of a Wikipedia corpus is the set of                                   specific seed terms (in the three languages)
“internal links” found in that corpus. So, the                                   considered as representative of five domain
two corpus parts, Cp and Cs , tend to have a                                     topics: “Archaeology”, “Linguistics”, “Phy-
high degree of comparability if we find many                                     sics”, “Biology”, and “Sport”.
internal links in Cpv that can be translated (by                                     Table 1 shows the (binary and tf idf) Dice
means of interlanguage links) into many in-                                      scores obtained from measuring the compara-
ternal links in Csv . Let T ransbin (tp , Csv ) be a                             bility degree of 30 different comparable cor-
binary function which returns 1 if the trans-                                    pora. For each corpus, the table also shows
lation of the Portuguese term tp is found in                                     the size (in Mb) of its two parts. In particu-
the Spanish vocabulary Csv . The binary Dice                                     lar, the first column introduces the two lan-
coefficient, Dicebin , between two parts of a                                    guages of the corpus (pt = Portuguese, sp =
comparable corpus C is then defined as:                                          Spanish, en = English) and the type of stra-
                                                                                 tegy (aligned or not aligned) used to build
                                P
                            2                                v
                                    tp ∈Cpv T ransbin (tp , Cs )
                                                                                 it. In the second and third columns, we show
Dicebin (Cp , Cs ) =                                                             the two Dice scores. The forth column shows
                                         |Cpv | + |Csv |                         the size of the two parts of the corpus, and
    We consider that it is not necessary to de-                                  the last column contains the two seed terms
fine the counterpart of the translation fun-                                     employed to generate the corpus. In Table 2,
ction, since the number of ambiguous terms                                       we show the Dice scores as well as the size of
is very low in Wikipedia, and most cases of                                      nine pairs of monolingual corpora randomly
ambiguity are solved with the so-called “di-                                     generated from Wikipedia.
sambiguated pages”.                                                                  We can observe first that there are signi-
    To avoid a bias towards common internal                                      ficant differences in terms of comparability
links, that is, towards those links occurring                                    between the Dice scores in Table 1 and those
in most articles, we define a specific version                                   obtained from the randomly generated mono-
of tf idf weight for each term. In particular,                                   lingual pairs in Table 2. It follows that cor-
tf idf (tp ) is the frequency of term tp in the                                  pora built by means of our strategies (not
Portuguese part of the comparable corpus,                                        aligned and aligned) are actually comparable.
multiplied by its inverse article frequency in                                   Then, we should note that in the compara-
the whole Portuguese Wikipedia. By taking                                        ble corpora of Table 1, the Dice scores based
into account the tf idf of terms, we can defi-                                   on tf idf are about 70 % higher than those
ne a weighted measure of comparability. Let                                      based on the binary function. By contrast, in
T ranstf idf (tp , Csv ) be a function which re-                                 randomly generated corpora (Table 2), there
turns the smallest value (min) of two tf idf                                     are no significant differences between Dicebin
scores, both tf idf (tp ) and tf idf (ts ), where                                and Dicetd idf . It means that our tf idf ma-
ts is the Spanish translation of tp in the Spa-                                  kes the Dice similarity score higher if the two
nish part Cs . The weighted Dice coefficient,                                    evaluated corpus parts are actually compara-
Dicetf idf , between two parts of a compara-                                     ble.
ble corpus C is then defined as follows:                                             As it was expected, not-aligned corpora
                                    P                                            tend to be larger than the aligned ones. Ho-
                                2                                  v
                                             v T ranstf idf (tp , Cs )
                                        tp ∈Cp                                   wever, if we just compare the smallest parts
Dicetf idf (Cp , Cs ) = P                             P
                                 v tf idf (tp ) +
                            tp ∈Cp                        ts ∈Csv tf idf (ts )   of each corpus, the differences are not very
                                                                                 important: the smallest parts of not-aligned
   The experiments described in the next sec-                                    corpora are only 15 % larger than those of
tion will be performed with the two compa-                                       aligned corpora. This is in accordance with
rability measures defined here.                                                  the fact that aligned corpora are more balan-
                                                                                 ced in terms of size, since no part is much
4.     Experiments and Results                                                   larger than the other one. As far the corpus
   Taking CorpusPedia as input source, we                                        size is concerned, let us note that, in avera-
performed several experiments to build dif-                                      ge, English parts are clearly larger than the
ferent comparable corpora for three lan-                                         Spanish ones, which are slightly larger than
guage pairs, namely Portuguese-Spanish,                                          the Portuguese ones. In general, English ar-


                                                                            10
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)


            Corpora                 Dice       Dice            Size                 Seed terms
                                    (bin)     (tf-idf )      (in Mb)
            pt-sp (not aligned)      .068       .086       0.6Mb/3.4Mb       Arqueologia, Arqueologı́a
            pt-en (not aligned)      .041       .067       0.6Mb/8.4Mb       Arqueologia, Archaeology
            sp-en (not aligned)      .090       .140       0.4Mb/8.4Mb       Arqueologı́a, Archaeology
            pt-sp (aligned)          .179       .199       0.4Mb/0.2Mb       Arqueologia, Arqueologı́a
            pt-en (aligned)          .127       .140       0.4Mb/1.1Mb       Arqueologia, Archaeology
            sp-en (aligned)          .181       .226       2.0Mb/2.9Mb       Arqueologı́a, Archaeology
            pt-sp (not aligned)      .078       .129       0.8Mb/1.7Mb        Linguı́stica, Lingüı́stica
            pt-en (not aligned)      .054       .136       0.8Mb/5.1Mb        Linguı́stica, Linguistics
            sp-en (not aligned)      .074       .170       1.7Mb/5.1Mb        Lingüı́stica, Linguistics
            pt-sp (aligned)          .140       .214       0.6Mb/0.8Mb        Linguı́stica, Lingüı́stica
            pt-en (aligned)          .128       .194       0.5Mb/1.2Mb        Linguı́stica, Linguistics
            sp-en (aligned)          .150       .257       0.9Mb/1.7Mb        Lingüı́stica, Linguistics
            pt-sp (not aligned)      .200       .374       4.4Mb/4.8Mb             Fı́sica, Fı́sica
            pt-en (not aligned)      .123       .287       4.4Mb/12Mb             Fı́sica, Physics
            sp-en (not aligned)      .270       .403       4.8Mb/12Mb             Fı́sica, Physics
            pt-sp (aligned)          .237       .390       3.6Mb/4.7Mb             Fı́sica, Fı́sica
            pt-en (aligned)          .178       .348       3.8Mb/11Mb             Fı́sica, Physics
            sp-en (aligned)          .220       .387       3.4Mb/7.6Mb            Fı́sica, Physics
            pt-sp (not aligned)      .130       .227       2.4Mb/1.5Mb           Biologia, Biologı́a
            pt-en (not aligned)      .102       .193       2.4Mb/9.4Mb           Biologia, Biology
            sp-en (not aligned)      .068       .129       1.5Mb/9.4Mb           Biologı́a, Biology
            pt-sp (aligned)          .197       .328       1.6Mb/2.8Mb           Biologia, Biologı́a
            pt-en (aligned)          .186       .308       1.8Mb/4.5Mb           Biologia, Biology
            sp-en (aligned)          .213       .294       0.9Mb/1.3Mb           Biologı́a, Biology
            pt-sp (not aligned)      .083       .148        11Mb/35Mb           Desporto, Deporte
            pt-en (not aligned)      .026       .085       11Mb/333Mb            Desporto, Sport
            sp-en (not aligned)      .047       .136       35Mb/333Mb             Deporte, Sport
            pt-sp (aligned)          .175       .266       9.7Mb/15Mb           Desporto, Deporte
            pt-en (aligned)          .189       .334        11Mb/20Mb            Desporto, Sport
            sp-en (aligned)          .206       .290        20Mb/29Mb             Deporte, Sport
            pt-sp (not aligned)      .111       .192       3.8Mb/9.3Mb                 Overall
            pt-en (not aligned)      .069       .153       3.8Mb/73Mb                  Overall
            sp-en (not aligned)      .109       .195       9.3Mb/73Mb                  Overall
            pt-sp (aligned)          .185       .279       3.2Mb/4.7Mb                 Overall
            pt-en (aligned)          .161       .264       3.5Mb/7.6Mb                 Overall
            sp-en (aligned)          .194       .290       6.2Mb/8.5Mb                 Overall

Cuadro 1: Dice similarity between several comparable corpora in Portuguese, Spanish, and
English.
                             Corpora               Dice         Dice           Size
                                                   (bin)       (tf-idf )     (in Mb)
                             pt-sp1 (random)        .012         .012      2.2Mb/0.9Mb
                             pt-en1 (random)        .003         .003      2.2Mb/0.4Mb
                             sp-en1 (random)        .003         .003      0.9Mb/0.4Mb
                             pt-sp2 (random)        .016         .014      1.5Mb/3.0Mb
                             pt-en2 (random)        .017         .014      1.5Mb/42Mb
                             sp-en2 (random)        .017         .015      3.0Mb/42Mb
                             pt-sp3 (random)        .008         .006      0.2Mb/0.5Mb
                             pt-en3 (random)        .001         .001      0.2Mb/1.4Mb
                             sp-en3 (random)        .005         .005      0.5Mb/1.4Mb

     Cuadro 2: Dice similarity between randomly generated pairs of monolingual corpora.


                                                          11
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)


ticles tend to have more words than Spanish                   the other hand, we will evaluate comparabi-
and Portuguese articles. As it was suggested                  lity in an indirect way. In particular, we will
by one of the reviewers of the article, one of                use the generated corpora on tasks requiring
the reasons for the difference in size in the                 comparable corpora as input (e.g., bilingual
case of aligned corpora is that Spanish and                   lexicon extraction). The better the extracted
Portuguese entries seem to be summaries of                    lexicon, the more comparable the input cor-
the English ones. So, to increase comparabi-                  pus should be. Finally, we believe that our
lity between an aligned pair of articles, the                 method for aligning pairs of articles could be
longer article could be shortened by remo-                    useful for related tasks, such as Wikipedia
ving those parts which are not present in the                 infoboxes alignment in different languagues
other language, obtaining, this way, a more                   (Adar, Skinner, y Weld, 2009).
comparable pair of articles.
    Finally, as it was expected, aligned cor-                 Bibliografı́a
pora are significantly more comparable (i.e.,                 Adafre, S.F. y M. de Rijke. 2006. Finding si-
higher Dice coefficient) than not-aligned cor-                  milar sentences across multiple languages
pora. In average, Dicetd idf increases 80 % the                 in wikipedia. En 11th Conference of the
comparability of aligned-corpora with regard                    European Chapter of the Association for
to not-aligned ones. So, considering that alig-                 Computational Linguistics, páginas 62–69.
ned corpora only decreases 15 % in size in re-
lation to not-aligned corpora, we can conclu-                 Adar, Eytan, Michael Skinner, y Daniel S.
de that the aligned strategy seems to be mo-                    Weld. 2009. Information arbitrage across
re appropriate to build comparable corpora                      multi-lingual wikipedia. En Second ACM
from Wikipedia.                                                 International Conference on Web Search
                                                                and Data Mining , WSDM.
5.    Conclusions and Future Work
                                                              Chernov, Sergey, Tereza Iofciu, Wolfgang
    The emergence of multilingual resources,                    Nejdl, y Xuan Zhou. 2006. Extracting
such a Wikipedia, makes it possible to de-                      semantic relationships between wikipedia
sign new methods and strategies to compile                      categories. En SemWiki2006 - From Wiki
corpus from the web, methods that are mo-                       to Semantics, Budva, Montenegro.
re efficient and powerful than the traditio-
nal ones. In particular, the semi-structured                  de Melo, Gerard y Gerhard Weikum. 2010.
information underlying Wikipedia turns out                       Menta: inducing multilingual taxonomies
to be very useful to build comparable corpo-                     from wikipedia. En Proceedings of the
ra. In this article, we proposed two strategies                  19th ACM international conference on
to build comparable corpora from Wikipedia                       Information and knowledge management,
and a way to measure their degree of com-                        CIKM ’10, páginas 1099–1108.
parability. The experiments led us to conclu-                 Filatova, Elena. 2009. Directions for Exploi-
de that corpora aligned article by article are                   ting Asymmetries in Multilingual Wikipe-
more comparable than not aligned corpora.                        dia. En CLEAWS3, páginas 30–37, Colo-
Besides, they consist of two balanced corpus                     rado.
parts in terms of size. Finally, they are not
much smaller than not aligned corpora.                        Gamallo, Pablo y Isaac González. 2010. Wi-
    In future work, we will be focused on how                   kipedia as a multilingual source of compa-
to improve the strategies to build compara-                     rable corpora. En LREC 2010 Workshop
ble corpora by extending coverage (more ar-                     on Building and Using Comparable Cor-
ticles) without losing comparability. For this                  pora, páginas 19–26, Valeta, Malta.
purpose, we will test and evaluate techni-                    Gamallo, Pablo y José Ramom Pichel.
ques to expand categories using a list of si-                   2008. Learning Spanish-Galician Transla-
milar terms identified as hyponyms or co-                       tion Equivalents Using a Comparable Cor-
hyponyms of the source category. In order to                    pus and a Bilingual Dictionary. LNCS,
find hyponyms and co-hyponyms of a term, it                     4919:413–423.
will be required to build an ontology of cate-
gories using the semi-structured information                  Li, Bo y Eric Gaussier. 2010. Improving
of Wikipedia (Chernov et al., 2006; Ponzetto                     corpus comparability for bilingual lexicon
y Navigli, 2009; de Melo y Weikum, 2010). On                     extraction from comparable corpora. En


                                                         12
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)


   20th International Conference on Compu-
   tational Linguistics (COLING 2010, pági-
   nas 644–652.
Maia, Belinda. 2003. What Are Comparable
  Corpora. En Workshop on Multilingual
  Corpora: Linguistic Requirements and Te-
  chnical Perspectives, páginas 27–34, Lan-
  caster, UK.
Ponzetto, Simone Paolo y Roberto Navigli.
  2009. Large-scale taxonomy mapping for
  restructuring and integrating wikipedia.
  En Proceedings of the 21st international
  jont conference on Artifical intelligence,
  páginas 2083–2088.
Pottast, M., B. Stein, y M. Anderka. 2008. A
  wikipedia-based multilingual retrieval mo-
  del. En Advances in Information Retrie-
  val, páginas 522–530.
Saralegui, X. y I. Alegria. 2007. Similitud
   entre documentos multilı́ngües de carácter
   cientı́fico-técnico en un entorno Web. En
   Procesamiento del Lenguaje Natural, pági-
   na 39.
Saralegui, X., I. San Vicente, y A. Gurrutxa-
   ga. 2008. Automatic generation of bilin-
   gual lexicons from comparable corpora in
   a popular science domain. En LREC 2008
   Workshop on Building and Using Compa-
   rable Corpora.
Tomás, J., J. Bataller, y F. Casacuberta.
  2001. Mining Wikipedia as a Parallel and
  Comparable Corpus. En Language Fo-
  rum, volumen 1, página 34.
Tyers, M.F. y J.A. Pieanaar. 2008. Extrac-
  ting Bilingual Word Pairs from Wikipe-
  dia. En LREC 2008, SALTMIL Works-
  hop, Marrakesh, Marocco.
Yu, Kun y Junichi Tsujii. 2009. Bilingual
  dictionary extraction from wikipedia. En
  Machine Translation Summit XII, Otta-
  wa, Canada.


                                                         13