Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011) Measuring Comparability of Multilingual Corpora Extracted from Wikipedia ∗ Midiendo la comparabilidad de copus multilingües extraı́dos de la Wikipedia Pablo Gamallo Otero Issac González López Centro de Investigación en Tecnoloxı́as Cilenis S.L. da Información (CITIUS), Language Engineering Solutions Universidade de Santiago de Compostela Santiago de Compostela Galiza, Spain Galiza, Spain pablo.gamallo@usc.es isaacjgonzalez@cilenis.com Resumen: Los corpus comparables son muy útiles en variadas tareas del procesa- miento del lenguaje tales como la extracción de léxicos bilingües. Con la mejora de la calidad de los corpus comparables, podemos mejorar la calidad de la extracción. Este artı́culo describe algunas estrategias para construir corpus comparables a par- tir de la Wikipedia, y propone una medida de comparabilidad. Fueron realizados algunos experimentos utilizando la Wikipedia portuguesa, española e inglesa. Palabras clave: Extracción de Información, Corpus Comparables, Léxicos Bi- lingües, Comparabilidad Abstract: Comparable corpora can be used for many linguistic tasks such as bilin- gual lexicon extraction. By improving the quality of comparable corpora, we improve the quality of the extraction. This article describes some strategies to build compara- ble corpora from Wikipedia and proposes a measure of comparability. Experiments were performed on Portuguese, Spanish, and English Wikipedia. Keywords: Information Extraction, Comparable Corpora, Bilingual Lexicons, Comparability 1. Introduction theoretical work analysing symmetries and Wikipedia is a free, multilingual, and co- asymmetries among the different multilingual llaborative encyclopedia containing entries versions of an entry/article in Wikipedia (Fi- (called “articles”) for almost 300 languages latova, 2009). (281 in July 2011). English is the more re- In addition, multilingual articles of Wiki- presentative one with about 3 million arti- pedia have been used as a source to build cles. However, Wikipedia is not a parallel comparable corpora (Gamallo y González, corpus as their articles are not translations 2010). The EAGLES - Expert Advisory from one language into another. Many works Group on Language Engineering Standards have been published in the last years focu- Guidelines (see http://www.ilc.pi.cnr. sed on its use and exploitation for multilin- it/EAGLES96/browse.html) defines a “com- gual tasks in natural language processing: ex- parable corpus” as one which selects simi- traction of bilingual dictionaries (Yu y Tsu- lar texts in more than one language or va- jii, 2009; Tyers y Pieanaar, 2008), alignment riety. One of the main advantages of compa- and machine translation (Adafre y de Rijke, rable corpora is their versatility to be used 2006; Tomás, Bataller, y Casacuberta, 2001), in many linguistic tasks (Maia, 2003), like multilingual information retrieval (Pottast, bilingual lexicon extraction (Gamallo y Pi- Stein, y Anderka, 2008). There also exists chel, 2008; Saralegui, Vicente, y Gurrutxaga, ∗ This work has been supported by Ministerio de 2008), information retrieval, and knowledge Educación y Ciencia of Spain, within the project On- engineering. Besides, they can also be used toPedia, ref: FF12010-14986. as training corpus to improve statistic machi- 8 Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011) ne learning systems, in particular when pa- 2. Two strategies to Build rallel corpora are scarce for a given pair of Wikipedia-Based Comparable languages. Another advantage concerns their Corpora availability. In contrast with parallel corpora, which require (not always available) transla- The input of our strategies is CorpusPe- ted texts, comparable corpora are easily re- dia1 , a friendly and easy-to-use XML struc- trieved from the web. Among the different ture, generated from Wikipedia dump files. web sources of comparable corpora, Wikipe- In CorpusPedia, all the internal links found dia is likely the largest repository of simi- in the text are put in a vocabulary list iden- lar texts in many languages. We only require tified with the tag links. In the same way, all the appropriate computational tools to make the categories (or topics) used to classify each them comparable. article are inserted in the tag category. In ad- dition, there is a tag called translations which codifies a list of interlanguage links (i.e., links to the same articles in other languages) found By taking into account multilingual po- in each article. Categories and translations tentialities of Wikipedia, our main objective are very useful features to build comparable is to define a method to measure the simi- corpora. Given these features, we developed larity (or degree of comparability) of diffe- two strategies aimed to extract corpora with rent comparable corpora built from Wikipe- different degrees of comparability. dia. For this purpose, first we describe some strategies to extract monolingual corpora in Not-Aligned Corpus This strategy ex- Portuguese, Spanish, and English from Wi- tracts those articles in two languages ha- kipedia, by making use of some categories ving in common the same topic, whe- (“Archaeology”, “Biology”, “Physics”, etc.) re the topic is represented by a cate- to make them comparable according to a gory and its translation (for instance, specific topic. These strategies were descri- the English-Spanish pair “Archaeology- bed in detail in (Gamallo y González, 2010). Arqueologı́a”). It results in a not-aligned Then, we propose a measure of comparabi- comparable corpus, consisting of texts lity to verify whether the corpora are lowly in two languages. We called it “not- or highly comparable. For many extraction aligned” because the version of an article tasks, such as bilingual lexicon extraction, in one language may have not its corres- using highly comparable corpora often leads ponding version in the other language. to better results. There are some works pro- Aligned Corpus The goal is to extract posing comparability measures between mo- pairs of bilingual articles related by in- nolingual corpora (Li y Gaussier, 2010; Sa- terlanguage links if, at least, one of both ralegui y Alegria, 2007), based on the use of contains a required category. It results existing bilingual dictionaries. However, ins- in a comparable corpus that is aligned tead of exploiting dictionaries to compute the article by article. comparability degree, we take advantage of the translation equivalents inserted in Wiki- In Section 4, we will measure the degree pedia by means of interlanguage links. of comparability of corpora built by means of these two strategies. Before that, we will define how to measure comparability between This paper is organized as follows. Section Wikipedia-based corpora. 2 introduces two strategies to build compara- ble corpora from Wikipedia. Next, in Section 3. Comparability Measures 3, we propose some comparability measures. For a comparable corpus C of Wikipedia Then, Section 4 describe some experiments articles, constituted for instance by a Portu- performed in order to measure the compara- guese part Cp and a Spanish part Cs , a compa- bility between different corpora built using rability coefficient can be defined on the basis the strategies defined in Sec. 2 . The last sec- 1 The software to build CorpusPedia, as well as tion discusses future tasks that will be imple- CorpusPedia files for English, French, Spanish, Por- mented in order to extend and improve our tuguese, and Galician, are freely available at http: tools. //gramatica.usc.es/pln/ 9 Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011) of finding, for each Portuguese term tp in the Portuguese-English, and Spanish-English. vocabulary Cpv of Cp , its interlanguage link (or These corpora were built using the two stra- translation) in the vocabulary Csv of Cs . The tegies described in Section 2 and five domain vocabulary of a Wikipedia corpus is the set of specific seed terms (in the three languages) “internal links” found in that corpus. So, the considered as representative of five domain two corpus parts, Cp and Cs , tend to have a topics: “Archaeology”, “Linguistics”, “Phy- high degree of comparability if we find many sics”, “Biology”, and “Sport”. internal links in Cpv that can be translated (by Table 1 shows the (binary and tf idf) Dice means of interlanguage links) into many in- scores obtained from measuring the compara- ternal links in Csv . Let T ransbin (tp , Csv ) be a bility degree of 30 different comparable cor- binary function which returns 1 if the trans- pora. For each corpus, the table also shows lation of the Portuguese term tp is found in the size (in Mb) of its two parts. In particu- the Spanish vocabulary Csv . The binary Dice lar, the first column introduces the two lan- coefficient, Dicebin , between two parts of a guages of the corpus (pt = Portuguese, sp = comparable corpus C is then defined as: Spanish, en = English) and the type of stra- tegy (aligned or not aligned) used to build P 2 v tp ∈Cpv T ransbin (tp , Cs ) it. In the second and third columns, we show Dicebin (Cp , Cs ) = the two Dice scores. The forth column shows |Cpv | + |Csv | the size of the two parts of the corpus, and We consider that it is not necessary to de- the last column contains the two seed terms fine the counterpart of the translation fun- employed to generate the corpus. In Table 2, ction, since the number of ambiguous terms we show the Dice scores as well as the size of is very low in Wikipedia, and most cases of nine pairs of monolingual corpora randomly ambiguity are solved with the so-called “di- generated from Wikipedia. sambiguated pages”. We can observe first that there are signi- To avoid a bias towards common internal ficant differences in terms of comparability links, that is, towards those links occurring between the Dice scores in Table 1 and those in most articles, we define a specific version obtained from the randomly generated mono- of tf idf weight for each term. In particular, lingual pairs in Table 2. It follows that cor- tf idf (tp ) is the frequency of term tp in the pora built by means of our strategies (not Portuguese part of the comparable corpus, aligned and aligned) are actually comparable. multiplied by its inverse article frequency in Then, we should note that in the compara- the whole Portuguese Wikipedia. By taking ble corpora of Table 1, the Dice scores based into account the tf idf of terms, we can defi- on tf idf are about 70 % higher than those ne a weighted measure of comparability. Let based on the binary function. By contrast, in T ranstf idf (tp , Csv ) be a function which re- randomly generated corpora (Table 2), there turns the smallest value (min) of two tf idf are no significant differences between Dicebin scores, both tf idf (tp ) and tf idf (ts ), where and Dicetd idf . It means that our tf idf ma- ts is the Spanish translation of tp in the Spa- kes the Dice similarity score higher if the two nish part Cs . The weighted Dice coefficient, evaluated corpus parts are actually compara- Dicetf idf , between two parts of a compara- ble. ble corpus C is then defined as follows: As it was expected, not-aligned corpora P tend to be larger than the aligned ones. Ho- 2 v v T ranstf idf (tp , Cs ) tp ∈Cp wever, if we just compare the smallest parts Dicetf idf (Cp , Cs ) = P P v tf idf (tp ) + tp ∈Cp ts ∈Csv tf idf (ts ) of each corpus, the differences are not very important: the smallest parts of not-aligned The experiments described in the next sec- corpora are only 15 % larger than those of tion will be performed with the two compa- aligned corpora. This is in accordance with rability measures defined here. the fact that aligned corpora are more balan- ced in terms of size, since no part is much 4. Experiments and Results larger than the other one. As far the corpus Taking CorpusPedia as input source, we size is concerned, let us note that, in avera- performed several experiments to build dif- ge, English parts are clearly larger than the ferent comparable corpora for three lan- Spanish ones, which are slightly larger than guage pairs, namely Portuguese-Spanish, the Portuguese ones. In general, English ar- 10 Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011) Corpora Dice Dice Size Seed terms (bin) (tf-idf ) (in Mb) pt-sp (not aligned) .068 .086 0.6Mb/3.4Mb Arqueologia, Arqueologı́a pt-en (not aligned) .041 .067 0.6Mb/8.4Mb Arqueologia, Archaeology sp-en (not aligned) .090 .140 0.4Mb/8.4Mb Arqueologı́a, Archaeology pt-sp (aligned) .179 .199 0.4Mb/0.2Mb Arqueologia, Arqueologı́a pt-en (aligned) .127 .140 0.4Mb/1.1Mb Arqueologia, Archaeology sp-en (aligned) .181 .226 2.0Mb/2.9Mb Arqueologı́a, Archaeology pt-sp (not aligned) .078 .129 0.8Mb/1.7Mb Linguı́stica, Lingüı́stica pt-en (not aligned) .054 .136 0.8Mb/5.1Mb Linguı́stica, Linguistics sp-en (not aligned) .074 .170 1.7Mb/5.1Mb Lingüı́stica, Linguistics pt-sp (aligned) .140 .214 0.6Mb/0.8Mb Linguı́stica, Lingüı́stica pt-en (aligned) .128 .194 0.5Mb/1.2Mb Linguı́stica, Linguistics sp-en (aligned) .150 .257 0.9Mb/1.7Mb Lingüı́stica, Linguistics pt-sp (not aligned) .200 .374 4.4Mb/4.8Mb Fı́sica, Fı́sica pt-en (not aligned) .123 .287 4.4Mb/12Mb Fı́sica, Physics sp-en (not aligned) .270 .403 4.8Mb/12Mb Fı́sica, Physics pt-sp (aligned) .237 .390 3.6Mb/4.7Mb Fı́sica, Fı́sica pt-en (aligned) .178 .348 3.8Mb/11Mb Fı́sica, Physics sp-en (aligned) .220 .387 3.4Mb/7.6Mb Fı́sica, Physics pt-sp (not aligned) .130 .227 2.4Mb/1.5Mb Biologia, Biologı́a pt-en (not aligned) .102 .193 2.4Mb/9.4Mb Biologia, Biology sp-en (not aligned) .068 .129 1.5Mb/9.4Mb Biologı́a, Biology pt-sp (aligned) .197 .328 1.6Mb/2.8Mb Biologia, Biologı́a pt-en (aligned) .186 .308 1.8Mb/4.5Mb Biologia, Biology sp-en (aligned) .213 .294 0.9Mb/1.3Mb Biologı́a, Biology pt-sp (not aligned) .083 .148 11Mb/35Mb Desporto, Deporte pt-en (not aligned) .026 .085 11Mb/333Mb Desporto, Sport sp-en (not aligned) .047 .136 35Mb/333Mb Deporte, Sport pt-sp (aligned) .175 .266 9.7Mb/15Mb Desporto, Deporte pt-en (aligned) .189 .334 11Mb/20Mb Desporto, Sport sp-en (aligned) .206 .290 20Mb/29Mb Deporte, Sport pt-sp (not aligned) .111 .192 3.8Mb/9.3Mb Overall pt-en (not aligned) .069 .153 3.8Mb/73Mb Overall sp-en (not aligned) .109 .195 9.3Mb/73Mb Overall pt-sp (aligned) .185 .279 3.2Mb/4.7Mb Overall pt-en (aligned) .161 .264 3.5Mb/7.6Mb Overall sp-en (aligned) .194 .290 6.2Mb/8.5Mb Overall Cuadro 1: Dice similarity between several comparable corpora in Portuguese, Spanish, and English. Corpora Dice Dice Size (bin) (tf-idf ) (in Mb) pt-sp1 (random) .012 .012 2.2Mb/0.9Mb pt-en1 (random) .003 .003 2.2Mb/0.4Mb sp-en1 (random) .003 .003 0.9Mb/0.4Mb pt-sp2 (random) .016 .014 1.5Mb/3.0Mb pt-en2 (random) .017 .014 1.5Mb/42Mb sp-en2 (random) .017 .015 3.0Mb/42Mb pt-sp3 (random) .008 .006 0.2Mb/0.5Mb pt-en3 (random) .001 .001 0.2Mb/1.4Mb sp-en3 (random) .005 .005 0.5Mb/1.4Mb Cuadro 2: Dice similarity between randomly generated pairs of monolingual corpora. 11 Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011) ticles tend to have more words than Spanish the other hand, we will evaluate comparabi- and Portuguese articles. As it was suggested lity in an indirect way. In particular, we will by one of the reviewers of the article, one of use the generated corpora on tasks requiring the reasons for the difference in size in the comparable corpora as input (e.g., bilingual case of aligned corpora is that Spanish and lexicon extraction). The better the extracted Portuguese entries seem to be summaries of lexicon, the more comparable the input cor- the English ones. So, to increase comparabi- pus should be. Finally, we believe that our lity between an aligned pair of articles, the method for aligning pairs of articles could be longer article could be shortened by remo- useful for related tasks, such as Wikipedia ving those parts which are not present in the infoboxes alignment in different languagues other language, obtaining, this way, a more (Adar, Skinner, y Weld, 2009). comparable pair of articles. Finally, as it was expected, aligned cor- Bibliografı́a pora are significantly more comparable (i.e., Adafre, S.F. y M. de Rijke. 2006. Finding si- higher Dice coefficient) than not-aligned cor- milar sentences across multiple languages pora. In average, Dicetd idf increases 80 % the in wikipedia. En 11th Conference of the comparability of aligned-corpora with regard European Chapter of the Association for to not-aligned ones. So, considering that alig- Computational Linguistics, páginas 62–69. ned corpora only decreases 15 % in size in re- lation to not-aligned corpora, we can conclu- Adar, Eytan, Michael Skinner, y Daniel S. de that the aligned strategy seems to be mo- Weld. 2009. Information arbitrage across re appropriate to build comparable corpora multi-lingual wikipedia. En Second ACM from Wikipedia. International Conference on Web Search and Data Mining , WSDM. 5. Conclusions and Future Work Chernov, Sergey, Tereza Iofciu, Wolfgang The emergence of multilingual resources, Nejdl, y Xuan Zhou. 2006. Extracting such a Wikipedia, makes it possible to de- semantic relationships between wikipedia sign new methods and strategies to compile categories. En SemWiki2006 - From Wiki corpus from the web, methods that are mo- to Semantics, Budva, Montenegro. re efficient and powerful than the traditio- nal ones. In particular, the semi-structured de Melo, Gerard y Gerhard Weikum. 2010. information underlying Wikipedia turns out Menta: inducing multilingual taxonomies to be very useful to build comparable corpo- from wikipedia. En Proceedings of the ra. In this article, we proposed two strategies 19th ACM international conference on to build comparable corpora from Wikipedia Information and knowledge management, and a way to measure their degree of com- CIKM ’10, páginas 1099–1108. parability. The experiments led us to conclu- Filatova, Elena. 2009. Directions for Exploi- de that corpora aligned article by article are ting Asymmetries in Multilingual Wikipe- more comparable than not aligned corpora. dia. En CLEAWS3, páginas 30–37, Colo- Besides, they consist of two balanced corpus rado. parts in terms of size. Finally, they are not much smaller than not aligned corpora. Gamallo, Pablo y Isaac González. 2010. Wi- In future work, we will be focused on how kipedia as a multilingual source of compa- to improve the strategies to build compara- rable corpora. En LREC 2010 Workshop ble corpora by extending coverage (more ar- on Building and Using Comparable Cor- ticles) without losing comparability. For this pora, páginas 19–26, Valeta, Malta. purpose, we will test and evaluate techni- Gamallo, Pablo y José Ramom Pichel. ques to expand categories using a list of si- 2008. Learning Spanish-Galician Transla- milar terms identified as hyponyms or co- tion Equivalents Using a Comparable Cor- hyponyms of the source category. In order to pus and a Bilingual Dictionary. LNCS, find hyponyms and co-hyponyms of a term, it 4919:413–423. will be required to build an ontology of cate- gories using the semi-structured information Li, Bo y Eric Gaussier. 2010. Improving of Wikipedia (Chernov et al., 2006; Ponzetto corpus comparability for bilingual lexicon y Navigli, 2009; de Melo y Weikum, 2010). On extraction from comparable corpora. En 12 Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011) 20th International Conference on Compu- tational Linguistics (COLING 2010, pági- nas 644–652. Maia, Belinda. 2003. What Are Comparable Corpora. En Workshop on Multilingual Corpora: Linguistic Requirements and Te- chnical Perspectives, páginas 27–34, Lan- caster, UK. Ponzetto, Simone Paolo y Roberto Navigli. 2009. Large-scale taxonomy mapping for restructuring and integrating wikipedia. En Proceedings of the 21st international jont conference on Artifical intelligence, páginas 2083–2088. Pottast, M., B. Stein, y M. Anderka. 2008. A wikipedia-based multilingual retrieval mo- del. En Advances in Information Retrie- val, páginas 522–530. Saralegui, X. y I. Alegria. 2007. Similitud entre documentos multilı́ngües de carácter cientı́fico-técnico en un entorno Web. En Procesamiento del Lenguaje Natural, pági- na 39. Saralegui, X., I. San Vicente, y A. Gurrutxa- ga. 2008. Automatic generation of bilin- gual lexicons from comparable corpora in a popular science domain. En LREC 2008 Workshop on Building and Using Compa- rable Corpora. Tomás, J., J. Bataller, y F. Casacuberta. 2001. Mining Wikipedia as a Parallel and Comparable Corpus. En Language Fo- rum, volumen 1, página 34. Tyers, M.F. y J.A. Pieanaar. 2008. Extrac- ting Bilingual Word Pairs from Wikipe- dia. En LREC 2008, SALTMIL Works- hop, Marrakesh, Marocco. Yu, Kun y Junichi Tsujii. 2009. Bilingual dictionary extraction from wikipedia. En Machine Translation Summit XII, Otta- wa, Canada. 13