=Paper=
{{Paper
|id=Vol-1495/paper_26
|storemode=property
|title=Dealing with Large Corpora for Ontology Population
|pdfUrl=https://ceur-ws.org/Vol-1495/paper_26.pdf
|volume=Vol-1495
|dblpUrl=https://dblp.org/rec/conf/tia/Yuliya15
}}
==Dealing with Large Corpora for Ontology Population==
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
201
Dealing with Large Corpora for Ontology Population
Yuliya Korenchuk (1,2)
(1) LiLPa (Linguistique, Langues, Parole), EA 1339, Universit de Strasbourg
(2) Rebuz SAS, Strasbourg
yuliya.korenchuk@yahoo.fr
1 Introduction 2 Resources
2.1 Corpus
Multilingual ontology population from texts, i.e.
addition of new terms in an ontology, requires a The PatTR corpus is a large3 collection of paral-
suitable parallel or comparable corpus. In this pa- lel segments from patents organized by language
per, we aim to check whether the corpus selected pairs. These segments are classified into files ac-
for our project suits the ontology we want to pop- cording to their position in the patent structure (ti-
ulate. The corpus for ontology population should tle, abstract, description or claims) (Wäschle and
not only reflect a specific domain and have a suf- Riezler, 2012). All the language pairs have their
ficient volume of data, as discussed in (Delpech et metadata files which contain essential information
al., 2012), but also suit the initial ontology. Us- (the IPC4 code, the reference, etc.) for each seg-
ing an existing corpus can be an efficient solution ment. As the different domains are mixed, the
used in many projects (Cimiano, 2006; Bouamor, metadata play a crucial role for our project.
2014; Pinnis, 2014). However this option is less
reliable in the case of a large multi-domain cor- 2.2 Ontology
pus and an ontology which might not cover all The terminological knowledge base EcoLexicon
the domain concepts. The need for suitability be- is developped by the LexiCon research group at
tween text corpora and ontology is expressed by the University of Granada. The resource is de-
(Aussenac-Gilles et al., 2006) who underlined the signed according to the principles of Frame Based
importance of text type in the corpus, the ontology Terminology (Faber et al., 2005; Faber et al.,
application, the validation criteria and set up. The 2006; Faber et al., 2009; Faber, 2011; Araúz
text layout can also play an important role: some et al., 2011). It contains 3,547 concepts and
projects aim to use extralinguistic information for 19,712 terms (cf Table 1) on the topic of envi-
ontology population (Kamel et al., 2013), while ronment in seven languages, including English,
others concentrate on the comprehensiveness of German and French. The terms are connected by
the text (Faber et al., 2006). generic-specific, part-whole and non-hierarchical
In this case study, we set up an experiment relations. The latter refer to the behaviour of the
checking whether a corpus is suitable for ontology concepts in a domain-specific or a general seman-
population, based on the example of the large par- tic frame (Faber et al., 2009).
allel (English, French and German) corpus PatTR1 EcoLexicon was built using two types of
(Wäschle and Riezler, 2012) and the EcoLexicon2 resources: manually selected domain corpora
terminology knowledge base which we use in our (bottom-up approach) and a collection of domain
project. thesauri, dictionaries and lexicons (top-down ap-
3
22,998,357 segments for EN-DE pair; 18,764,038 for
1
http://www.cl.uni-heidelberg.de/ EN-FR and 5,110,262 for FR-DE (PatTR web site)
4
statnlpgroup/pattr/ International Patent Classification, http://www.
2
http://ecolexicon.ugr.es/en/index.htm wipo.int/classifications/ipc/en/
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
202
Language Nb of terms up the domains defined in EcoLexicon and limited
FR 640 our interest to the domains enumerated in Table 2.
EN 3079 Then we selected the IPC categories which might
DE 3713 suit the EcoLexion ones. As one can notice, this
manual correlation is subjective and not transpar-
Table 1: Number of terms by language in EcoLexicon
ent, so we need an automated validation.
IPC EcoLexicon
proach) (Faber et al., 2006). The multilingual C02F Treatment of water, 3.2.5.1 Waste treatment
corpora were built manually from reliable do- waste water, sewage, or and 3.2.5.2 Water treat-
main sources, taking into account multiple cri- sludge ment
teria (quantity, quality, simplicity and documen- B09C Disposal of solid 3.2.5.4 Soil quality
tation). The domain-specific terminological re- waste; reclamation of management
sources were compared and evaluated in order to contaminated soil
H01(G,M) Basic elec- 3.5 Energy engeneering
obtain a representative dataset.
tric elements, C01G
3 Main issues Inorganic chemistry,
H02(J,M) Generation,
The PatTR corpus represents two main challenges: conversion, or distribu-
its size and its domain diversity. In fact, we can tion of electric power,
hardly estimate the amount of data for each IPC C25(B,C,D,F) Elec-
trolytic or electrophoretic
category without getting into the metadata anal-
processes; apparatus
ysis. Domain diversity can also be addressed therefor
through the metadata. However, a manual anal-
ysis is required: unless being a specialist of the Table 2: Manual IPC categories selection5
IPC, one needs to manually establish a list of cat-
egories potentially corresponding to the ontology
domain. Since this intervention is guided by hu- 4.2 Occurrences count
man intuition, we need to validate the sub-corpora We counted the occurrences of the concept labels
choice. Due to its size, the corpus is not designed to validate the selected sub-corpora. In fact, this
to be read by a human user, so it is difficult to per- approach is used to evaluate the ontology coverage
form any manual check on the selected domain- regarding a domain corpus (Oostdijk et al., 2010).
specific sub-corpus. We address the validation by To do so, we lemmatized the corpus with the Tree-
counting the concepts occurrences in the selected Tagger (Schmid, 1994) tool and transformed both
sub-corpora and checking that these occurrences the corpus and the concept labels to lowercase.
belong mainly to domain-specific concepts of the This caused some problems, because some labels
ontology. lost their domain specificity (for example, Be@en
for berrilium became be and was found nearly in
4 Set up
every English phrase). So we had to limit the la-
We defined a set up based on three main steps: bels to words longer than 2 characters.
(i) manually matching IPC categories to select the We calculated the percentage of the concept oc-
sub-corpora, (ii) counting concept occurrences in currences in the total amount of tokens in the do-
the selected sub-corpora and (iii) performing a main sub-corpus. For example, the English sub-
semi-automatic validation of the concept occur- corpus for the C02F category has 1,339,946 oc-
rences. currences for 7,806,687 tokens, so the concept oc-
currences represent 17% of the tokens (the highest
4.1 Manual selection of IPC categories rate in our data collection). The least covered sub-
The main challenge is to select the IPC categories corpus is the French H02M one with 1% of occur-
that are suitable for the EcoLexicon ontology pop- rences (55,803 occurrences for 4,359,434 tokens).
ulation and enrichment. As the corpus is very 5
As the category titles are too complex, we took in this ta-
large, we cannot take all the data to check the con- ble the generic IPC descriptions (i.e. Basic electric elements
cepts occurences. Therefore we started by looking is the title of the whole H01 category)
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
203
Our hypothesis is that the sub-corpora containing of the general concepts for German and French are
more ontology concepts are more likely to be effi- respectively 9.14% and 1.19% of the concept oc-
cient for ontology population, so we will start the currences.
ontology population from the most covered sub-
corpora. IPC Lang Occurrences General
% concepts
The disparity in the coverage among languages
%
observed in the Table 4 (17.16% maximum for En- C02F en 17.16 11.86
glish, 3.67% for German and 3.60% for French) B09C en 16.17 13.40
can be explained by the difference in the number C25C en 12.54 11.91
of EcoLexicon labels for these languages (cf Ta- C25D en 11.66 14.88
ble 1). As we use a parallel corpus, we will base C01G en 11.57 14.72
the suitability analysis on the occurrence percent- C25B en 11.18 13.43
ages for English and try to find the terms transla- C25F en 11.04 19.00
H01M en 10.32 10.73
tions for the other languages from the corpus.
H02J en 9.57 15.49
4.3 Semi-automatic validation H01G en 8.15 12.12
H02M en 8.08 9.54
The purpose of this step is to see which con- B09C de 3.67 6.66
cepts appear in the corpus and to validate that their B09C fr 3.60 0.99
meaning in the corpus matches the one described C02F de 3.36 7.29
in the ontology. C25C fr 3.33 0.88
We noticed that a part of the occurrences be- C25C de 3.12 2.66
C01G fr 3.10 0.46
longs to quite general concepts that are quite
C25B fr 2.93 0.79
close to the definition of transdisciplinary vocab- H01M fr 2.69 0.55
ulary (Tutin, 2007; Jacquey et al., 2013), such as C25D fr 2.63 0.98
method, device, process which is due to the fact C01G de 2.57 2.91
that the corpus contains segments from patents. C25F de 2.55 6.75
We want to be sure that the total occurrences count C25D de 2.48 4.41
is not made only of these concepts. To do so, we C25B de 2.25 5.31
C25F fr 2.18 1.09
definded a set of five recurrent concepts and their
H01G de 1.94 2.70
labels in the three languages (cf Table 3) in or- H01M de 1.86 4.17
der to calculate their percentage in the total occur- H01G fr 1.79 1.13
rences count. H02J de 1.68 9.14
C02F fr 1.63 1.19
Concept Labels
H02J fr 1.57 0.94
Method method@en, mthode@fr, Methode@de H02M de 1.39 4.03
Process process@en, processus@fr, Prozess@de H02M fr 1.28 0.45
Treatment treatment@en, traitement@fr, Verar-
beitung@de, Behandlung@de Table 4: Concept occurrences and general concepts
Device device@en, outil@fr, Mechanismus@de percentages
System system@en, systme@fr, System@de
Table 3: Manual concepts and labels selection We also manually checked 5 random segments
for 10 randomly selected terms, for example sur-
The combination of the concept occurence and face water, waste, biomass, etc., to be sure that
the general concept percentages (cf Table 4) gives they preserve their terminological meaning. This
a better idea of the best sub-corpora to be used in quick validation helped us to confirm that the se-
the next steps. The highest percentage of general lected sub-corpora can be used for future treat-
concepts is 19% (C25F for English), that means ments.
that almost every 5th occurrence is a general con- Regarding the meaning of the matched terms,
cept one. Without final results of the ontology the patent titles and abstracts preserve the termi-
population and enrichment, we cannot judge if this nological sense, while the claims part has more
proportion is too high. The maximal percentages rigid style and uses some specific expressions, like
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
204
method as in claim X, product accord to one of the Arauz, Carlos Márquez Linares, and Miguel Vega
claim X, a process along the line of claim, etc. In Expósito. 2006. Process-oriented terminology
the same time, domain specific terms contained in management in the domain of Coastal Engineering.
claims can still be used as such. Terminology, 12(2):189–213.
[Faber et al.2009] Pamela Faber, Pilar Leon, and
5 Conclusion Juan Antonio Prieto. 2009. Semantic Relations,
Dynamicity, And Terminological Knowledge Bases.
The described set up can save time while using a Current Issues in Language Studies, 1:1–23.
large corpus for the ontology population task. The [Faber2011] Pamela Faber. 2011. The dynamics
combined use of metadata and occurrences count of specialized knowledge representation: Simula-
tional reconstruction or the perceptionaction inter-
show the best sub-corpora that we should keep for
face. Terminology, 17(1):9–29.
further treatment. The semi-automatic validation [Jacquey et al.2013] Evelyne Jacquey, Agnès Tutin,
of occurrences is a useful step which helps to en- Laurence Kister, Marie-paule Jacques, Sylvain
sure that we know the data used in the project. Hatier, and Sandrine Ollinger. 2013. Filtrage ter-
minologique par le lexique transdisciplinaire scien-
Acknowlegments tifique : une expérimentation en sciences humaines.
In Terminologie et Intelligence Artificielle (TIA),
The research project is supported by the CIFRE Paris.
grant 2013/0744 delivered by the ANRT. We are [Kamel et al.2013] Mouna Kamel, Nathalie Aussenac-
grateful to the Lexicon research group6 of the Uni- Gilles, Davide Buscaldi, and Catherine Comparot.
versity of Granada for the access to the EcoLexi- 2013. A semi-automatic approach for building on-
con ontology. tologies from a collection of structured web doc-
uments. In K-Cap’13 Proceedings of the sev-
enth international conference on Knowledge cap-
References ture, pages 139–140.
[Oostdijk et al.2010] Nelleke Oostdijk, Suzan Ver-
[Araúz et al.2011] Pilar Araúz, Arianne Reimerink, and berne, and Cornelis Koster. 2010. Constructing a
Pamela Faber. 2011. Environmental knowledge broad-coverage lexicon for text mining in the patent
in EcoLexicon. In Computational Linguistics- domain. In LREC, pages 2292–2299.
Applications Conference, number 14, pages 9–16. [Pinnis2014] Marcis Pinnis. 2014. Bootstrapping of
[Aussenac-Gilles et al.2006] Nathalie Aussenac-Gilles, a Multilingual Transliteration Dictionary for Euro-
Anne Condamines, and Florence Sèdes. 2006. pean Languages. In Proceedings of the Sixth Inter-
Evolution et maintenance des ressources termino- national Conference Baltic HLT.
ontologique: une question à approfondir. Informa- [Schmid1994] Helmut Schmid. 1994. Probabilistic
tion interaction intelligence, HS. Part-of-Speech Tagging Using Decision Trees. In
[Bouamor2014] Dhouha Bouamor. 2014. Constitution Proceedings of International Conference on New
de ressources linguistiques multilingues à partir de Methods in Language Processing, Manchester.
corpus de textes parallèles et comparables. Ph.D. [Tutin2007] Agnès Tutin. 2007. Autour du lexique et
thesis, Université Paris Sud - Paris XI. de la phraséologie des écrits scientifiques. Revue
[Cimiano2006] Philipp Cimiano. 2006. Ontology française de linguistique appliquée, XII:5–14.
Learning and Population from Text: Algorithms, [Wäschle and Riezler2012] Katharina Wäschle and
Evaluation and Application. Springer US. Stefan Riezler. 2012. Structural and Topical
[Delpech et al.2012] Estelle Delpech, Béatrice Daille, Dimensions in Multi-Task Patent Translation. In
Emmanuel Morin, and Claire Lemaire. 2012. Ex- The 13th Conference of the European Chapter of the
traction of domain-specific bilingual lexicon from Association for Computational Linguistics (EACL
comparable corpora : compositional translation and 2012), pages 818–828, Avignon, France.
ranking. In COLING, volume 3.
[Faber et al.2005] Pamela Faber, Carlos Márquez
Linares, and Miguel Vega Exposito. 2005. Framing
Terminology: A Process Oriented Approach. Meta:
journal des traducteurs, 50(4):1492–1421.
[Faber et al.2006] Pamela Faber, Silvia Montero
Martı́nez, Marı́a Rosa Castro Prieto, José Senso
Ruiz, Juan Antonio Prieto Velasco, Pilar León
6
http://lexicon.ugr.es/