=Paper= {{Paper |id=Vol-1495/paper_26 |storemode=property |title=Dealing with Large Corpora for Ontology Population |pdfUrl=https://ceur-ws.org/Vol-1495/paper_26.pdf |volume=Vol-1495 |dblpUrl=https://dblp.org/rec/conf/tia/Yuliya15 }} ==Dealing with Large Corpora for Ontology Population== https://ceur-ws.org/Vol-1495/paper_26.pdf
                  Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                             201




                Dealing with Large Corpora for Ontology Population


                                      Yuliya Korenchuk (1,2)
            (1) LiLPa (Linguistique, Langues, Parole), EA 1339, Universit de Strasbourg
                                     (2) Rebuz SAS, Strasbourg
                              yuliya.korenchuk@yahoo.fr




1       Introduction                                           2       Resources
                                                               2.1      Corpus
Multilingual ontology population from texts, i.e.
addition of new terms in an ontology, requires a               The PatTR corpus is a large3 collection of paral-
suitable parallel or comparable corpus. In this pa-            lel segments from patents organized by language
per, we aim to check whether the corpus selected               pairs. These segments are classified into files ac-
for our project suits the ontology we want to pop-             cording to their position in the patent structure (ti-
ulate. The corpus for ontology population should               tle, abstract, description or claims) (Wäschle and
not only reflect a specific domain and have a suf-             Riezler, 2012). All the language pairs have their
ficient volume of data, as discussed in (Delpech et            metadata files which contain essential information
al., 2012), but also suit the initial ontology. Us-            (the IPC4 code, the reference, etc.) for each seg-
ing an existing corpus can be an efficient solution            ment. As the different domains are mixed, the
used in many projects (Cimiano, 2006; Bouamor,                 metadata play a crucial role for our project.
2014; Pinnis, 2014). However this option is less
reliable in the case of a large multi-domain cor-              2.2      Ontology
pus and an ontology which might not cover all                  The terminological knowledge base EcoLexicon
the domain concepts. The need for suitability be-              is developped by the LexiCon research group at
tween text corpora and ontology is expressed by                the University of Granada. The resource is de-
(Aussenac-Gilles et al., 2006) who underlined the              signed according to the principles of Frame Based
importance of text type in the corpus, the ontology            Terminology (Faber et al., 2005; Faber et al.,
application, the validation criteria and set up. The           2006; Faber et al., 2009; Faber, 2011; Araúz
text layout can also play an important role: some              et al., 2011). It contains 3,547 concepts and
projects aim to use extralinguistic information for            19,712 terms (cf Table 1) on the topic of envi-
ontology population (Kamel et al., 2013), while                ronment in seven languages, including English,
others concentrate on the comprehensiveness of                 German and French. The terms are connected by
the text (Faber et al., 2006).                                 generic-specific, part-whole and non-hierarchical
   In this case study, we set up an experiment                 relations. The latter refer to the behaviour of the
checking whether a corpus is suitable for ontology             concepts in a domain-specific or a general seman-
population, based on the example of the large par-             tic frame (Faber et al., 2009).
allel (English, French and German) corpus PatTR1                  EcoLexicon was built using two types of
(Wäschle and Riezler, 2012) and the EcoLexicon2               resources: manually selected domain corpora
terminology knowledge base which we use in our                 (bottom-up approach) and a collection of domain
project.                                                       thesauri, dictionaries and lexicons (top-down ap-
                                                                   3
                                                                   22,998,357 segments for EN-DE pair; 18,764,038 for
    1
    http://www.cl.uni-heidelberg.de/                           EN-FR and 5,110,262 for FR-DE (PatTR web site)
                                                                 4
statnlpgroup/pattr/                                                International Patent Classification, http://www.
  2
    http://ecolexicon.ugr.es/en/index.htm                      wipo.int/classifications/ipc/en/
                 Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                            202




               Language     Nb of terms                       up the domains defined in EcoLexicon and limited
                  FR           640                            our interest to the domains enumerated in Table 2.
                 EN            3079                           Then we selected the IPC categories which might
                 DE            3713                           suit the EcoLexion ones. As one can notice, this
                                                              manual correlation is subjective and not transpar-
Table 1: Number of terms by language in EcoLexicon
                                                              ent, so we need an automated validation.

                                                               IPC                               EcoLexicon
proach) (Faber et al., 2006). The multilingual                 C02F Treatment of water,          3.2.5.1 Waste treatment
corpora were built manually from reliable do-                  waste water, sewage, or           and 3.2.5.2 Water treat-
main sources, taking into account multiple cri-                sludge                            ment
teria (quantity, quality, simplicity and documen-              B09C Disposal of solid            3.2.5.4 Soil quality
tation). The domain-specific terminological re-                waste; reclamation of             management
sources were compared and evaluated in order to                contaminated soil
                                                               H01(G,M) Basic elec-              3.5 Energy engeneering
obtain a representative dataset.
                                                               tric elements, C01G
3     Main issues                                              Inorganic        chemistry,
                                                               H02(J,M)       Generation,
The PatTR corpus represents two main challenges:               conversion, or distribu-
its size and its domain diversity. In fact, we can             tion of electric power,
hardly estimate the amount of data for each IPC                C25(B,C,D,F)          Elec-
                                                               trolytic or electrophoretic
category without getting into the metadata anal-
                                                               processes;        apparatus
ysis. Domain diversity can also be addressed                   therefor
through the metadata. However, a manual anal-
ysis is required: unless being a specialist of the                    Table 2: Manual IPC categories selection5
IPC, one needs to manually establish a list of cat-
egories potentially corresponding to the ontology
domain. Since this intervention is guided by hu-              4.2     Occurrences count
man intuition, we need to validate the sub-corpora            We counted the occurrences of the concept labels
choice. Due to its size, the corpus is not designed           to validate the selected sub-corpora. In fact, this
to be read by a human user, so it is difficult to per-        approach is used to evaluate the ontology coverage
form any manual check on the selected domain-                 regarding a domain corpus (Oostdijk et al., 2010).
specific sub-corpus. We address the validation by             To do so, we lemmatized the corpus with the Tree-
counting the concepts occurrences in the selected             Tagger (Schmid, 1994) tool and transformed both
sub-corpora and checking that these occurrences               the corpus and the concept labels to lowercase.
belong mainly to domain-specific concepts of the              This caused some problems, because some labels
ontology.                                                     lost their domain specificity (for example, Be@en
                                                              for berrilium became be and was found nearly in
4     Set up
                                                              every English phrase). So we had to limit the la-
We defined a set up based on three main steps:                bels to words longer than 2 characters.
(i) manually matching IPC categories to select the               We calculated the percentage of the concept oc-
sub-corpora, (ii) counting concept occurrences in             currences in the total amount of tokens in the do-
the selected sub-corpora and (iii) performing a               main sub-corpus. For example, the English sub-
semi-automatic validation of the concept occur-               corpus for the C02F category has 1,339,946 oc-
rences.                                                       currences for 7,806,687 tokens, so the concept oc-
                                                              currences represent 17% of the tokens (the highest
4.1    Manual selection of IPC categories                     rate in our data collection). The least covered sub-
The main challenge is to select the IPC categories            corpus is the French H02M one with 1% of occur-
that are suitable for the EcoLexicon ontology pop-            rences (55,803 occurrences for 4,359,434 tokens).
ulation and enrichment. As the corpus is very                     5
                                                                   As the category titles are too complex, we took in this ta-
large, we cannot take all the data to check the con-          ble the generic IPC descriptions (i.e. Basic electric elements
cepts occurences. Therefore we started by looking             is the title of the whole H01 category)
                   Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                              203




Our hypothesis is that the sub-corpora containing               of the general concepts for German and French are
more ontology concepts are more likely to be effi-              respectively 9.14% and 1.19% of the concept oc-
cient for ontology population, so we will start the             currences.
ontology population from the most covered sub-
corpora.                                                              IPC      Lang     Occurrences        General
                                                                                        %                  concepts
   The disparity in the coverage among languages
                                                                                                           %
observed in the Table 4 (17.16% maximum for En-                       C02F en           17.16              11.86
glish, 3.67% for German and 3.60% for French)                         B09C en           16.17              13.40
can be explained by the difference in the number                      C25C en           12.54              11.91
of EcoLexicon labels for these languages (cf Ta-                      C25D en           11.66              14.88
ble 1). As we use a parallel corpus, we will base                     C01G en           11.57              14.72
the suitability analysis on the occurrence percent-                   C25B en           11.18              13.43
ages for English and try to find the terms transla-                   C25F en           11.04              19.00
                                                                      H01M en           10.32              10.73
tions for the other languages from the corpus.
                                                                      H02J en           9.57               15.49
4.3    Semi-automatic validation                                      H01G en           8.15               12.12
                                                                      H02M en           8.08               9.54
The purpose of this step is to see which con-                         B09C de           3.67               6.66
cepts appear in the corpus and to validate that their                 B09C fr           3.60               0.99
meaning in the corpus matches the one described                       C02F de           3.36               7.29
in the ontology.                                                      C25C fr           3.33               0.88
   We noticed that a part of the occurrences be-                      C25C de           3.12               2.66
                                                                      C01G fr           3.10               0.46
longs to quite general concepts that are quite
                                                                      C25B fr           2.93               0.79
close to the definition of transdisciplinary vocab-                   H01M fr           2.69               0.55
ulary (Tutin, 2007; Jacquey et al., 2013), such as                    C25D fr           2.63               0.98
method, device, process which is due to the fact                      C01G de           2.57               2.91
that the corpus contains segments from patents.                       C25F de           2.55               6.75
We want to be sure that the total occurrences count                   C25D de           2.48               4.41
is not made only of these concepts. To do so, we                      C25B de           2.25               5.31
                                                                      C25F fr           2.18               1.09
definded a set of five recurrent concepts and their
                                                                      H01G de           1.94               2.70
labels in the three languages (cf Table 3) in or-                     H01M de           1.86               4.17
der to calculate their percentage in the total occur-                 H01G fr           1.79               1.13
rences count.                                                         H02J de           1.68               9.14
                                                                      C02F fr           1.63               1.19
 Concept Labels
                                                                      H02J fr           1.57               0.94
 Method method@en, mthode@fr, Methode@de                              H02M de           1.39               4.03
 Process process@en, processus@fr, Prozess@de                         H02M fr           1.28               0.45
 Treatment treatment@en, traitement@fr, Verar-
           beitung@de, Behandlung@de                            Table 4: Concept occurrences and general concepts
 Device    device@en, outil@fr, Mechanismus@de                  percentages
 System system@en, systme@fr, System@de

      Table 3: Manual concepts and labels selection                We also manually checked 5 random segments
                                                                for 10 randomly selected terms, for example sur-
   The combination of the concept occurence and                 face water, waste, biomass, etc., to be sure that
the general concept percentages (cf Table 4) gives              they preserve their terminological meaning. This
a better idea of the best sub-corpora to be used in             quick validation helped us to confirm that the se-
the next steps. The highest percentage of general               lected sub-corpora can be used for future treat-
concepts is 19% (C25F for English), that means                  ments.
that almost every 5th occurrence is a general con-                 Regarding the meaning of the matched terms,
cept one. Without final results of the ontology                 the patent titles and abstracts preserve the termi-
population and enrichment, we cannot judge if this              nological sense, while the claims part has more
proportion is too high. The maximal percentages                 rigid style and uses some specific expressions, like
                   Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                              204




 method as in claim X, product accord to one of the               Arauz, Carlos Márquez Linares, and Miguel Vega
 claim X, a process along the line of claim, etc. In              Expósito. 2006. Process-oriented terminology
 the same time, domain specific terms contained in                management in the domain of Coastal Engineering.
 claims can still be used as such.                                Terminology, 12(2):189–213.
                                                              [Faber et al.2009] Pamela Faber, Pilar Leon, and
 5       Conclusion                                               Juan Antonio Prieto. 2009. Semantic Relations,
                                                                  Dynamicity, And Terminological Knowledge Bases.
 The described set up can save time while using a                 Current Issues in Language Studies, 1:1–23.
 large corpus for the ontology population task. The           [Faber2011] Pamela Faber. 2011. The dynamics
 combined use of metadata and occurrences count                   of specialized knowledge representation: Simula-
                                                                  tional reconstruction or the perceptionaction inter-
 show the best sub-corpora that we should keep for
                                                                  face. Terminology, 17(1):9–29.
 further treatment. The semi-automatic validation             [Jacquey et al.2013] Evelyne Jacquey, Agnès Tutin,
 of occurrences is a useful step which helps to en-               Laurence Kister, Marie-paule Jacques, Sylvain
 sure that we know the data used in the project.                  Hatier, and Sandrine Ollinger. 2013. Filtrage ter-
                                                                  minologique par le lexique transdisciplinaire scien-
 Acknowlegments                                                   tifique : une expérimentation en sciences humaines.
                                                                  In Terminologie et Intelligence Artificielle (TIA),
 The research project is supported by the CIFRE                   Paris.
 grant 2013/0744 delivered by the ANRT. We are                [Kamel et al.2013] Mouna Kamel, Nathalie Aussenac-
 grateful to the Lexicon research group6 of the Uni-              Gilles, Davide Buscaldi, and Catherine Comparot.
 versity of Granada for the access to the EcoLexi-                2013. A semi-automatic approach for building on-
 con ontology.                                                    tologies from a collection of structured web doc-
                                                                  uments. In K-Cap’13 Proceedings of the sev-
                                                                  enth international conference on Knowledge cap-
 References                                                       ture, pages 139–140.
                                                              [Oostdijk et al.2010] Nelleke Oostdijk, Suzan Ver-
[Araúz et al.2011] Pilar Araúz, Arianne Reimerink, and          berne, and Cornelis Koster. 2010. Constructing a
    Pamela Faber. 2011. Environmental knowledge                   broad-coverage lexicon for text mining in the patent
    in EcoLexicon.        In Computational Linguistics-           domain. In LREC, pages 2292–2299.
    Applications Conference, number 14, pages 9–16.           [Pinnis2014] Marcis Pinnis. 2014. Bootstrapping of
[Aussenac-Gilles et al.2006] Nathalie Aussenac-Gilles,            a Multilingual Transliteration Dictionary for Euro-
    Anne Condamines, and Florence Sèdes. 2006.                   pean Languages. In Proceedings of the Sixth Inter-
    Evolution et maintenance des ressources termino-              national Conference Baltic HLT.
    ontologique: une question à approfondir. Informa-        [Schmid1994] Helmut Schmid. 1994. Probabilistic
    tion interaction intelligence, HS.                            Part-of-Speech Tagging Using Decision Trees. In
[Bouamor2014] Dhouha Bouamor. 2014. Constitution                  Proceedings of International Conference on New
    de ressources linguistiques multilingues à partir de         Methods in Language Processing, Manchester.
    corpus de textes parallèles et comparables. Ph.D.        [Tutin2007] Agnès Tutin. 2007. Autour du lexique et
    thesis, Université Paris Sud - Paris XI.                     de la phraséologie des écrits scientifiques. Revue
[Cimiano2006] Philipp Cimiano. 2006. Ontology                     française de linguistique appliquée, XII:5–14.
    Learning and Population from Text: Algorithms,            [Wäschle and Riezler2012] Katharina Wäschle and
    Evaluation and Application. Springer US.                      Stefan Riezler. 2012. Structural and Topical
[Delpech et al.2012] Estelle Delpech, Béatrice Daille,           Dimensions in Multi-Task Patent Translation. In
    Emmanuel Morin, and Claire Lemaire. 2012. Ex-                 The 13th Conference of the European Chapter of the
    traction of domain-specific bilingual lexicon from            Association for Computational Linguistics (EACL
    comparable corpora : compositional translation and            2012), pages 818–828, Avignon, France.
    ranking. In COLING, volume 3.
[Faber et al.2005] Pamela Faber, Carlos Márquez
    Linares, and Miguel Vega Exposito. 2005. Framing
    Terminology: A Process Oriented Approach. Meta:
    journal des traducteurs, 50(4):1492–1421.
[Faber et al.2006] Pamela Faber, Silvia Montero
    Martı́nez, Marı́a Rosa Castro Prieto, José Senso
    Ruiz, Juan Antonio Prieto Velasco, Pilar León
     6
         http://lexicon.ugr.es/