1 Introduction

Dealing with Large Corpora for Ontology Population

Yuliya Korenchuk (

yuliya.korenchuk@yahoo.fr 0 0 (1) LiLPa (Linguistique , Langues, Parole), EA 1339, Universit de Strasbourg (2) Rebuz SAS, Strasbourg

2015

201 204

1 Introduction

Multilingual ontology population from texts, i.e. addition of new terms in an ontology, requires a suitable parallel or comparable corpus. In this paper, we aim to check whether the corpus selected for our project suits the ontology we want to populate. The corpus for ontology population should not only reflect a specific domain and have a sufficient volume of data, as discussed in (Delpech et al., 2012), but also suit the initial ontology. Using an existing corpus can be an efficient solution used in many projects (Cimiano, 2006; Bouamor, 2014; Pinnis, 2014). However this option is less reliable in the case of a large multi-domain corpus and an ontology which might not cover all the domain concepts. The need for suitability between text corpora and ontology is expressed by (Aussenac-Gilles et al., 2006) who underlined the importance of text type in the corpus, the ontology application, the validation criteria and set up. The text layout can also play an important role: some projects aim to use extralinguistic information for ontology population (Kamel et al., 2013), while others concentrate on the comprehensiveness of the text (Faber et al., 2006).

In this case study, we set up an experiment checking whether a corpus is suitable for ontology population, based on the example of the large parallel (English, French and German) corpus PatTR1 (Wa¨schle and Riezler, 2012) and the EcoLexicon2 terminology knowledge base which we use in our project.

1http://www.cl.uni-heidelberg.de/ statnlpgroup/pattr/ 2http://ecolexicon.ugr.es/en/index.htm 2 2.1

Resources Corpus

The PatTR corpus is a large3 collection of parallel segments from patents organized by language pairs. These segments are classified into files according to their position in the patent structure (title, abstract, description or claims) (Wa¨schle and Riezler, 2012). All the language pairs have their metadata files which contain essential information (the IPC4 code, the reference, etc.) for each segment. As the different domains are mixed, the metadata play a crucial role for our project. 2.2

Ontology

The terminological knowledge base EcoLexicon is developped by the LexiCon research group at the University of Granada. The resource is designed according to the principles of Frame Based Terminology (Faber et al., 2005; Faber et al., 2006; Faber et al., 2009; Faber, 2011; Arau´z et al., 2011). It contains 3,547 concepts and 19,712 terms (cf Table 1) on the topic of environment in seven languages, including English, German and French. The terms are connected by generic-specific, part-whole and non-hierarchical relations. The latter refer to the behaviour of the concepts in a domain-specific or a general semantic frame (Faber et al., 2009).

EcoLexicon was built using two types of resources: manually selected domain corpora (bottom-up approach) and a collection of domain thesauri, dictionaries and lexicons (top-down ap322,998,357 segments for EN-DE pair; 18,764,038 for EN-FR and 5,110,262 for FR-DE (PatTR web site) 4International Patent Classification, http://www. wipo.int/classifications/ipc/en/

Language FR EN DE

proach) (Faber et al., 2006). The multilingual corpora were built manually from reliable domain sources, taking into account multiple criteria (quantity, quality, simplicity and documentation). The domain-specific terminological resources were compared and evaluated in order to obtain a representative dataset. 3

Main issues

The PatTR corpus represents two main challenges: its size and its domain diversity. In fact, we can hardly estimate the amount of data for each IPC category without getting into the metadata analysis. Domain diversity can also be addressed through the metadata. However, a manual analysis is required: unless being a specialist of the IPC, one needs to manually establish a list of categories potentially corresponding to the ontology domain. Since this intervention is guided by human intuition, we need to validate the sub-corpora choice. Due to its size, the corpus is not designed to be read by a human user, so it is difficult to perform any manual check on the selected domainspecific sub-corpus. We address the validation by counting the concepts occurrences in the selected sub-corpora and checking that these occurrences belong mainly to domain-specific concepts of the ontology. 4

Set up We defined a set up based on three main steps: (i) manually matching IPC categories to select the sub-corpora, (ii) counting concept occurrences in the selected sub-corpora and (iii) performing a semi-automatic validation of the concept occurrences. 4.1

Manual selection of IPC categories

The main challenge is to select the IPC categories that are suitable for the EcoLexicon ontology population and enrichment. As the corpus is very large, we cannot take all the data to check the concepts occurences. Therefore we started by looking up the domains defined in EcoLexicon and limited our interest to the domains enumerated in Table 2. Then we selected the IPC categories which might suit the EcoLexion ones. As one can notice, this manual correlation is subjective and not transparent, so we need an automated validation.

IPC EcoLexicon C02F Treatment of water, 3.2.5.1 Waste treatment waste water, sewage, or and 3.2.5.2 Water treatsludge ment B09C Disposal of solid 3.2.5.4 Soil quality waste; reclamation of management contaminated soil H01(G,M) Basic elec- 3.5 Energy engeneering tric elements, C01G Inorganic chemistry, H02(J,M) Generation, conversion, or distribution of electric power, C25(B,C,D,F) Electrolytic or electrophoretic processes; apparatus therefor We counted the occurrences of the concept labels to validate the selected sub-corpora. In fact, this approach is used to evaluate the ontology coverage regarding a domain corpus (Oostdijk et al., 2010). To do so, we lemmatized the corpus with the TreeTagger (Schmid, 1994) tool and transformed both the corpus and the concept labels to lowercase. This caused some problems, because some labels lost their domain specificity (for example, Be@en for berrilium became be and was found nearly in every English phrase). So we had to limit the labels to words longer than 2 characters.

We calculated the percentage of the concept occurrences in the total amount of tokens in the domain sub-corpus. For example, the English subcorpus for the C02F category has 1,339,946 occurrences for 7,806,687 tokens, so the concept occurrences represent 17% of the tokens (the highest rate in our data collection). The least covered subcorpus is the French H02M one with 1% of occurrences (55,803 occurrences for 4,359,434 tokens).

5As the category titles are too complex, we took in this ta

ble the generic IPC descriptions (i.e. Basic electric elements is the title of the whole H01 category) of the general concepts for German and French are respectively 9.14% and 1.19% of the concept occurrences.

IPC

Lang

Occurrences % Our hypothesis is that the sub-corpora containing more ontology concepts are more likely to be efficient for ontology population, so we will start the ontology population from the most covered subcorpora.

The disparity in the coverage among languages observed in the Table 4 (17.16% maximum for English, 3.67% for German and 3.60% for French) can be explained by the difference in the number of EcoLexicon labels for these languages (cf Table 1). As we use a parallel corpus, we will base the suitability analysis on the occurrence percentages for English and try to find the terms translations for the other languages from the corpus. 4.3

Semi-automatic validation

The purpose of this step is to see which concepts appear in the corpus and to validate that their meaning in the corpus matches the one described in the ontology.

We noticed that a part of the occurrences belongs to quite general concepts that are quite close to the definition of transdisciplinary vocabulary (Tutin, 2007; Jacquey et al., 2013), such as method, device, process which is due to the fact that the corpus contains segments from patents.

We want to be sure that the total occurrences count is not made only of these concepts. To do so, we definded a set of five recurrent concepts and their labels in the three languages (cf Table 3) in order to calculate their percentage in the total occurrences count.

The combination of the concept occurence and the general concept percentages (cf Table 4) gives a better idea of the best sub-corpora to be used in the next steps. The highest percentage of general concepts is 19% (C25F for English), that means that almost every 5th occurrence is a general concept one. Without final results of the ontology population and enrichment, we cannot judge if this proportion is too high. The maximal percentages

We also manually checked 5 random segments for 10 randomly selected terms, for example surface water, waste, biomass, etc., to be sure that they preserve their terminological meaning. This quick validation helped us to confirm that the selected sub-corpora can be used for future treatments.

Regarding the meaning of the matched terms, the patent titles and abstracts preserve the terminological sense, while the claims part has more rigid style and uses some specific expressions, like method as in claim X, product accord to one of the Arauz, Carlos Ma´rquez Linares, and Miguel Vega claim X, a process along the line of claim, etc. In Expo´sito. 2006. Process-oriented terminology the same time, domain specific terms contained in management in the domain of Coastal Engineering. claims can still be used as such. Terminology, 12(2):189–213.

[Faber et al.2009] Pamela Faber, Pilar Leon, and 5 Conclusion Juan Antonio Prieto. 2009. Semantic Relations, Dynamicity, And Terminological Knowledge Bases.

The described set up can save time while using a Current Issues in Language Studies, 1:1–23. large corpus for the ontology population task. The [Faber2011] Pamela Faber. 2011. The dynamics combined use of metadata and occurrences count of specialized knowledge representation: Simulational reconstruction or the perceptionaction intershow the best sub-corpora that we should keep for face. Terminology, 17(1):9–29. further treatment. The semi-automatic validation [Jacquey et al.2013] Evelyne Jacquey, Agne`s Tutin, of occurrences is a useful step which helps to en- Laurence Kister, Marie-paule Jacques, Sylvain sure that we know the data used in the project. Hatier, and Sandrine Ollinger. 2013. Filtrage terminologique par le lexique transdisciplinaire scienAcknowlegments tifique : une expe´rimentation en sciences humaines. In Terminologie et Intelligence Artificielle (TIA), The research project is supported by the CIFRE Paris. grant 2013/0744 delivered by the ANRT. We are [Kamel et al.2013] Mouna Kamel, Nathalie Aussenacgrateful to the Lexicon research group6 of the Uni- Gilles, Davide Buscaldi, and Catherine Comparot. versity of Granada for the access to the EcoLexi- 2013. A semi-automatic approach for building oncon ontology. tologies from a collection of structured web documents. In K-Cap’13 Proceedings of the seventh international conference on Knowledge capReferences ture, pages 139–140.

[Oostdijk et al.2010] Nelleke Oostdijk, Suzan Ver[Arau´z et al.2011] Pilar Arau´z, Arianne Reimerink, and berne, and Cornelis Koster. 2010. Constructing a Pamela Faber. 2011. Environmental knowledge broad-coverage lexicon for text mining in the patent in EcoLexicon. In Computational Linguistics- domain. In LREC, pages 2292–2299.

Applications Conference, number 14, pages 9–16. [Pinnis2014] Marcis Pinnis. 2014. Bootstrapping of [Aussenac-Gilles et al.2006] Nathalie Aussenac-Gilles, a Multilingual Transliteration Dictionary for EuroAnne Condamines, and Florence Se`des. 2006. pean Languages. In Proceedings of the Sixth InterEvolution et maintenance des ressources termino- national Conference Baltic HLT. ontologique: une question a` approfondir. Informa- [Schmid1994] Helmut Schmid. 1994. Probabilistic tion interaction intelligence, HS. Part-of-Speech Tagging Using Decision Trees. In [Bouamor2014] Dhouha Bouamor. 2014. Constitution Proceedings of International Conference on New de ressources linguistiques multilingues a` partir de Methods in Language Processing, Manchester. corpus de textes paralle`les et comparables. Ph.D. [Tutin2007] Agne`s Tutin. 2007. Autour du lexique et thesis, Universite´ Paris Sud - Paris XI. de la phrase´ologie des e´crits scientifiques. Revue [Cimiano2006] Philipp Cimiano. 2006. Ontology franc¸aise de linguistique applique´e, XII:5–14.

Learning and Population from Text: Algorithms, [Wa¨schle and Riezler2012] Katharina Wa¨schle and Evaluation and Application. Springer US. Stefan Riezler. 2012. Structural and Topical [Delpech et al.2012] Estelle Delpech, Be´atrice Daille, Dimensions in Multi-Task Patent Translation. In Emmanuel Morin, and Claire Lemaire. 2012. Ex- The 13th Conference of the European Chapter of the traction of domain-specific bilingual lexicon from Association for Computational Linguistics (EACL comparable corpora : compositional translation and 2012), pages 818–828, Avignon, France. ranking. In COLING, volume 3. [Faber et al.2005] Pamela Faber, Carlos Ma´rquez

Linares, and Miguel Vega Exposito. 2005. Framing Terminology: A Process Oriented Approach. Meta: journal des traducteurs, 50(4):1492–1421. [Faber et al.2006] Pamela Faber, Silvia Montero

Mart´ınez, Mar´ıa Rosa Castro Prieto, Jose´ Senso Ruiz, Juan Antonio Prieto Velasco, Pilar Leo´n