<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dealing with Large Corpora for Ontology Population</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuliya Korenchuk (</string-name>
          <email>yuliya.korenchuk@yahoo.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>(1) LiLPa (Linguistique</institution>
          ,
          <addr-line>Langues, Parole), EA 1339, Universit de Strasbourg (2) Rebuz SAS, Strasbourg</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>201</fpage>
      <lpage>204</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Multilingual ontology population from texts, i.e.
addition of new terms in an ontology, requires a
suitable parallel or comparable corpus. In this
paper, we aim to check whether the corpus selected
for our project suits the ontology we want to
populate. The corpus for ontology population should
not only reflect a specific domain and have a
sufficient volume of data, as discussed in (Delpech et
al., 2012), but also suit the initial ontology.
Using an existing corpus can be an efficient solution
used in many projects (Cimiano, 2006; Bouamor,
2014; Pinnis, 2014). However this option is less
reliable in the case of a large multi-domain
corpus and an ontology which might not cover all
the domain concepts. The need for suitability
between text corpora and ontology is expressed by
(Aussenac-Gilles et al., 2006) who underlined the
importance of text type in the corpus, the ontology
application, the validation criteria and set up. The
text layout can also play an important role: some
projects aim to use extralinguistic information for
ontology population (Kamel et al., 2013), while
others concentrate on the comprehensiveness of
the text (Faber et al., 2006).</p>
      <p>In this case study, we set up an experiment
checking whether a corpus is suitable for ontology
population, based on the example of the large
parallel (English, French and German) corpus PatTR1
(Wa¨schle and Riezler, 2012) and the EcoLexicon2
terminology knowledge base which we use in our
project.</p>
      <p>1http://www.cl.uni-heidelberg.de/
statnlpgroup/pattr/
2http://ecolexicon.ugr.es/en/index.htm
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Resources</title>
      <sec id="sec-2-1">
        <title>Corpus</title>
        <p>The PatTR corpus is a large3 collection of
parallel segments from patents organized by language
pairs. These segments are classified into files
according to their position in the patent structure
(title, abstract, description or claims) (Wa¨schle and
Riezler, 2012). All the language pairs have their
metadata files which contain essential information
(the IPC4 code, the reference, etc.) for each
segment. As the different domains are mixed, the
metadata play a crucial role for our project.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Ontology</title>
        <p>The terminological knowledge base EcoLexicon
is developped by the LexiCon research group at
the University of Granada. The resource is
designed according to the principles of Frame Based
Terminology (Faber et al., 2005; Faber et al.,
2006; Faber et al., 2009; Faber, 2011; Arau´z
et al., 2011). It contains 3,547 concepts and
19,712 terms (cf Table 1) on the topic of
environment in seven languages, including English,
German and French. The terms are connected by
generic-specific, part-whole and non-hierarchical
relations. The latter refer to the behaviour of the
concepts in a domain-specific or a general
semantic frame (Faber et al., 2009).</p>
        <p>EcoLexicon was built using two types of
resources: manually selected domain corpora
(bottom-up approach) and a collection of domain
thesauri, dictionaries and lexicons (top-down
ap322,998,357 segments for EN-DE pair; 18,764,038 for
EN-FR and 5,110,262 for FR-DE (PatTR web site)
4International Patent Classification, http://www.
wipo.int/classifications/ipc/en/</p>
        <sec id="sec-2-2-1">
          <title>Language FR EN DE</title>
          <p>proach) (Faber et al., 2006). The multilingual
corpora were built manually from reliable
domain sources, taking into account multiple
criteria (quantity, quality, simplicity and
documentation). The domain-specific terminological
resources were compared and evaluated in order to
obtain a representative dataset.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Main issues</title>
      <p>The PatTR corpus represents two main challenges:
its size and its domain diversity. In fact, we can
hardly estimate the amount of data for each IPC
category without getting into the metadata
analysis. Domain diversity can also be addressed
through the metadata. However, a manual
analysis is required: unless being a specialist of the
IPC, one needs to manually establish a list of
categories potentially corresponding to the ontology
domain. Since this intervention is guided by
human intuition, we need to validate the sub-corpora
choice. Due to its size, the corpus is not designed
to be read by a human user, so it is difficult to
perform any manual check on the selected
domainspecific sub-corpus. We address the validation by
counting the concepts occurrences in the selected
sub-corpora and checking that these occurrences
belong mainly to domain-specific concepts of the
ontology.
4</p>
      <p>Set up
We defined a set up based on three main steps:
(i) manually matching IPC categories to select the
sub-corpora, (ii) counting concept occurrences in
the selected sub-corpora and (iii) performing a
semi-automatic validation of the concept
occurrences.
4.1</p>
      <sec id="sec-3-1">
        <title>Manual selection of IPC categories</title>
        <p>The main challenge is to select the IPC categories
that are suitable for the EcoLexicon ontology
population and enrichment. As the corpus is very
large, we cannot take all the data to check the
concepts occurences. Therefore we started by looking
up the domains defined in EcoLexicon and limited
our interest to the domains enumerated in Table 2.
Then we selected the IPC categories which might
suit the EcoLexion ones. As one can notice, this
manual correlation is subjective and not
transparent, so we need an automated validation.</p>
        <p>IPC EcoLexicon
C02F Treatment of water, 3.2.5.1 Waste treatment
waste water, sewage, or and 3.2.5.2 Water
treatsludge ment
B09C Disposal of solid 3.2.5.4 Soil quality
waste; reclamation of management
contaminated soil
H01(G,M) Basic elec- 3.5 Energy engeneering
tric elements, C01G
Inorganic chemistry,
H02(J,M) Generation,
conversion, or
distribution of electric power,
C25(B,C,D,F)
Electrolytic or electrophoretic
processes; apparatus
therefor
We counted the occurrences of the concept labels
to validate the selected sub-corpora. In fact, this
approach is used to evaluate the ontology coverage
regarding a domain corpus (Oostdijk et al., 2010).
To do so, we lemmatized the corpus with the
TreeTagger (Schmid, 1994) tool and transformed both
the corpus and the concept labels to lowercase.
This caused some problems, because some labels
lost their domain specificity (for example, Be@en
for berrilium became be and was found nearly in
every English phrase). So we had to limit the
labels to words longer than 2 characters.</p>
        <p>We calculated the percentage of the concept
occurrences in the total amount of tokens in the
domain sub-corpus. For example, the English
subcorpus for the C02F category has 1,339,946
occurrences for 7,806,687 tokens, so the concept
occurrences represent 17% of the tokens (the highest
rate in our data collection). The least covered
subcorpus is the French H02M one with 1% of
occurrences (55,803 occurrences for 4,359,434 tokens).</p>
        <sec id="sec-3-1-1">
          <title>5As the category titles are too complex, we took in this ta</title>
          <p>ble the generic IPC descriptions (i.e. Basic electric elements
is the title of the whole H01 category)
of the general concepts for German and French are
respectively 9.14% and 1.19% of the concept
occurrences.</p>
          <p>IPC</p>
          <p>Lang</p>
          <p>Occurrences
%
Our hypothesis is that the sub-corpora containing
more ontology concepts are more likely to be
efficient for ontology population, so we will start the
ontology population from the most covered
subcorpora.</p>
          <p>The disparity in the coverage among languages
observed in the Table 4 (17.16% maximum for
English, 3.67% for German and 3.60% for French)
can be explained by the difference in the number
of EcoLexicon labels for these languages (cf
Table 1). As we use a parallel corpus, we will base
the suitability analysis on the occurrence
percentages for English and try to find the terms
translations for the other languages from the corpus.
4.3</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Semi-automatic validation</title>
        <p>The purpose of this step is to see which
concepts appear in the corpus and to validate that their
meaning in the corpus matches the one described
in the ontology.</p>
        <p>We noticed that a part of the occurrences
belongs to quite general concepts that are quite
close to the definition of transdisciplinary
vocabulary (Tutin, 2007; Jacquey et al., 2013), such as
method, device, process which is due to the fact
that the corpus contains segments from patents.</p>
        <p>We want to be sure that the total occurrences count
is not made only of these concepts. To do so, we
definded a set of five recurrent concepts and their
labels in the three languages (cf Table 3) in
order to calculate their percentage in the total
occurrences count.</p>
        <p>The combination of the concept occurence and
the general concept percentages (cf Table 4) gives
a better idea of the best sub-corpora to be used in
the next steps. The highest percentage of general
concepts is 19% (C25F for English), that means
that almost every 5th occurrence is a general
concept one. Without final results of the ontology
population and enrichment, we cannot judge if this
proportion is too high. The maximal percentages</p>
        <p>We also manually checked 5 random segments
for 10 randomly selected terms, for example
surface water, waste, biomass, etc., to be sure that
they preserve their terminological meaning. This
quick validation helped us to confirm that the
selected sub-corpora can be used for future
treatments.</p>
        <p>Regarding the meaning of the matched terms,
the patent titles and abstracts preserve the
terminological sense, while the claims part has more
rigid style and uses some specific expressions, like
method as in claim X, product accord to one of the Arauz, Carlos Ma´rquez Linares, and Miguel Vega
claim X, a process along the line of claim, etc. In Expo´sito. 2006. Process-oriented terminology
the same time, domain specific terms contained in management in the domain of Coastal Engineering.
claims can still be used as such. Terminology, 12(2):189–213.</p>
        <p>[Faber et al.2009] Pamela Faber, Pilar Leon, and
5 Conclusion Juan Antonio Prieto. 2009. Semantic Relations,
Dynamicity, And Terminological Knowledge Bases.</p>
        <p>The described set up can save time while using a Current Issues in Language Studies, 1:1–23.
large corpus for the ontology population task. The [Faber2011] Pamela Faber. 2011. The dynamics
combined use of metadata and occurrences count of specialized knowledge representation:
Simulational reconstruction or the perceptionaction
intershow the best sub-corpora that we should keep for face. Terminology, 17(1):9–29.
further treatment. The semi-automatic validation [Jacquey et al.2013] Evelyne Jacquey, Agne`s Tutin,
of occurrences is a useful step which helps to en- Laurence Kister, Marie-paule Jacques, Sylvain
sure that we know the data used in the project. Hatier, and Sandrine Ollinger. 2013. Filtrage
terminologique par le lexique transdisciplinaire
scienAcknowlegments tifique : une expe´rimentation en sciences humaines.
In Terminologie et Intelligence Artificielle (TIA),
The research project is supported by the CIFRE Paris.
grant 2013/0744 delivered by the ANRT. We are [Kamel et al.2013] Mouna Kamel, Nathalie
Aussenacgrateful to the Lexicon research group6 of the Uni- Gilles, Davide Buscaldi, and Catherine Comparot.
versity of Granada for the access to the EcoLexi- 2013. A semi-automatic approach for building
oncon ontology. tologies from a collection of structured web
documents. In K-Cap’13 Proceedings of the
seventh international conference on Knowledge
capReferences ture, pages 139–140.</p>
        <p>[Oostdijk et al.2010] Nelleke Oostdijk, Suzan
Ver[Arau´z et al.2011] Pilar Arau´z, Arianne Reimerink, and berne, and Cornelis Koster. 2010. Constructing a
Pamela Faber. 2011. Environmental knowledge broad-coverage lexicon for text mining in the patent
in EcoLexicon. In Computational Linguistics- domain. In LREC, pages 2292–2299.</p>
        <p>Applications Conference, number 14, pages 9–16. [Pinnis2014] Marcis Pinnis. 2014. Bootstrapping of
[Aussenac-Gilles et al.2006] Nathalie Aussenac-Gilles, a Multilingual Transliteration Dictionary for
EuroAnne Condamines, and Florence Se`des. 2006. pean Languages. In Proceedings of the Sixth
InterEvolution et maintenance des ressources termino- national Conference Baltic HLT.
ontologique: une question a` approfondir. Informa- [Schmid1994] Helmut Schmid. 1994. Probabilistic
tion interaction intelligence, HS. Part-of-Speech Tagging Using Decision Trees. In
[Bouamor2014] Dhouha Bouamor. 2014. Constitution Proceedings of International Conference on New
de ressources linguistiques multilingues a` partir de Methods in Language Processing, Manchester.
corpus de textes paralle`les et comparables. Ph.D. [Tutin2007] Agne`s Tutin. 2007. Autour du lexique et
thesis, Universite´ Paris Sud - Paris XI. de la phrase´ologie des e´crits scientifiques. Revue
[Cimiano2006] Philipp Cimiano. 2006. Ontology franc¸aise de linguistique applique´e, XII:5–14.</p>
        <p>Learning and Population from Text: Algorithms, [Wa¨schle and Riezler2012] Katharina Wa¨schle and
Evaluation and Application. Springer US. Stefan Riezler. 2012. Structural and Topical
[Delpech et al.2012] Estelle Delpech, Be´atrice Daille, Dimensions in Multi-Task Patent Translation. In
Emmanuel Morin, and Claire Lemaire. 2012. Ex- The 13th Conference of the European Chapter of the
traction of domain-specific bilingual lexicon from Association for Computational Linguistics (EACL
comparable corpora : compositional translation and 2012), pages 818–828, Avignon, France.
ranking. In COLING, volume 3.
[Faber et al.2005] Pamela Faber, Carlos Ma´rquez</p>
        <p>Linares, and Miguel Vega Exposito. 2005. Framing
Terminology: A Process Oriented Approach. Meta:
journal des traducteurs, 50(4):1492–1421.
[Faber et al.2006] Pamela Faber, Silvia Montero</p>
        <p>Mart´ınez, Mar´ıa Rosa Castro Prieto, Jose´ Senso
Ruiz, Juan Antonio Prieto Velasco, Pilar Leo´n</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>