<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Parallel corpus-based bilingual terminology extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xavier Gomez Guinovart</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Simo~es</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Universidade de Vigo xgg@uvigo.es</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Universidade do Minho ambs@di.uminho.pt</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <abstract>
        <p>: This paper presents a parallel corpora-based bilingual terminology extraction method based on the occurrence of bilingual morphosyntactic patterns in probabilistic translation dictionaries. We discuss an experiment focused on two language pairs { English-Galician and English-Portuguese, and show results which experimentally con rm the high degree of accuracy of the proposed extraction technique.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        As for the evaluation of the terminological quality of the extracted terms, and
given the lack of a comprehensive terminological database for Portuguese, a
comparison with a hand-crafted normalised list of terms was performed only for
Galician, for which we have the bUSCatermos (http://www.usc.es/buscatermos/
Caracteristicas.htm) { a database with 126,338 Galician terms from all the
elds collected by the Servizo de Normalizacion Lingu stica at the University
of Santiago de Compostela from a wide collection of dictionaries and glossaries,
and the Termoteca (http://sli.uvigo.es/termoteca/)
        <xref ref-type="bibr" rid="ref3">(Crespo et al., 2008)</xref>
        { a corpus-based terminological databank with 6,621 Galician terms gathered by
the TALG research group of the University of Vigo from the Galician Technical
Corpus (http://sli.uvigo.es/CTG/) and the CLUVI Corpus.
      </p>
      <p>The results of the system are used by a terminologists team at the University of
Vigo as the basis for selecting English-Galician bilingual terms from the CLUVI
Corpus in order to extend Termoteca.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Extraction algorithm and metrics</title>
      <p>
        The terminology extraction algorithm used in this study is based on NATools
probabilistic translation dictionaries
        <xref ref-type="bibr" rid="ref11">(Sim~oes &amp; Almeida, 2003)</xref>
        and was
explained in detail in Sim~oes &amp; Guinovart (2009). NATools dictionaries,
automatically extracted from sentence aligned parallel corpora, map words from a source
language to a set of probable translations in a target language. Each of these
translations have a probabilistic measure of translatability. This information
enables the creation of an alignment matrix for any translation unit ( gure 1) that
includes in each cell the mutual translation probability for each word
combination (from the source/target language). These matrixes can be used to extract
bilingual terminology using translation patterns
        <xref ref-type="bibr" rid="ref9">(Sim~oes &amp; Almeida, 2008)</xref>
        that
specify how word order in the source language changes after translation takes
place. Translation patterns may include morphological restrictions (for one or
the both languages) de ning the morphological categories allowed for the words
matching the pattern. NATools relies on external morphological analyzers to
validate the morphological restrictions. We used jSpell
        <xref ref-type="bibr" rid="ref1">(Almeida &amp; Pinto, 1994)</xref>
        for Portuguese and FreeLing
        <xref ref-type="bibr" rid="ref2">(Atserias et al., 2006)</xref>
        for Galician.
      </p>
      <p>
        Moreover, following many other works on term extraction based on
        <xref ref-type="bibr" rid="ref4">Dunning
(1993)</xref>
        , the system scores each term candidate with the log-likelihood measure,
using the Text::NSP Perl module (http://ngram.sourceforge.net/). The
minimum value for the partial trigrams is used for terms with more than tree
constituents
        <xref ref-type="bibr" rid="ref8">(Patry &amp; Langlais, 2005)</xref>
        .
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments and results</title>
      <p>Our experiments focused on two language pairs, English{Galician and English{
Portuguese and used two parallel corpora of very di erent sizes (table 1): the
Unesco Corpus { a collection of 30 issues of the Unesco Courier (http://www.
iisscsodun taobu littrveeaan rscseou fo iifcagnnn fro teh reeaopun ilrcaad lliceaan .
discussão 44 0 0 0 0 0 0 0 0 0 0 0
sobre 0 11 0 0 0 0 0 0 0 0 0 0
fontes 0 0 0 74 0 0 0 0 0 0 0 0</p>
      <p>de 0 3 0 0 27 0 6 3 0 0 0 0
financiamento 0 0 0 0 0 56 0 0 0 0 0 0
alternativas 0 0 23 0 0 0 0 0 0 0 0 0
para 0 0 0 0 0 0 28 0 0 0 0 0</p>
      <p>a 0 1 0 0 1 0 4 33 0 0 0 0
aliança 0 0 0 0 0 0 0 0 0 0 65 0
radical 0 0 0 0 0 0 0 0 0 80 0 0
europeia 0 0 0 0 0 0 0 0 59 0 0 0</p>
      <p>. 0 0 0 0 0 0 0 0 0 0 0 80
unesco.org/courier/) in four languages, and the JRC-Acquis { a collection
of parallel texts in 22 languages with the total body of Europea Union law
applicable in the EU Member States.</p>
      <p>Corpus</p>
      <p>Translation Units
Tokens (source/target)
Forms (source/target)</p>
      <p>
        In addition, two literary corpora were used in the evaluation process for
bigrams and trigrams exclusion (table 2): the BiVir Corpus { a Galician
literary corpus containing works from the Virtual Library of Universal
Literature in Galician (http://www.bivir.com/), and the Compara (http://www.
linguateca.pt/COMPARA/) { a human-edited parallel corpus whose sentence
alignment, lemmatization and POS tagging have been revised by human
annotators
        <xref ref-type="bibr" rid="ref5">(Frankenberg-Garcia &amp; Santos, 2003)</xref>
        .
      </p>
      <sec id="sec-3-1">
        <title>Corpus BiVir Compara</title>
      </sec>
      <sec id="sec-3-2">
        <title>Token</title>
        <p>1 008 125
1 714 523</p>
      </sec>
      <sec id="sec-3-3">
        <title>Bigrams</title>
        <p>361 547
544 274</p>
        <p>In order to evaluate the precision of the NATools-based term extraction
algorithm, four translation patterns were de ned, as shown in gure 22.</p>
        <p>
          Di erent methods are used for ltering the results of term extraction:
identi cation of unlikely term candidates because of their similarity with a lexical
2The EN{GL patterns are similar but with FreeLing speci c tag names. Some examples of
the extracted terms can be found in
          <xref ref-type="bibr" rid="ref10">Simo~es &amp; Guinovart (2009)</xref>
          .
[R1] A B = B[CAT&lt;-/nc/] A[CAT&lt;-/(a_nc|adj)/];
[R2] A B = B[CAT&lt;-/nc/] "de"|"do"|"da"|"dos"|"das" A[CAT&lt;-/(a_nc|nc)/];
[R3] A "of"|"in"|"for" B = A[CAT&lt;-/nc/] "de"|"do"|"da"|"dos"|"das" B[CAT&lt;-/nc/];
[R4] A B C = C[CAT&lt;-/nc/] A[CAT&lt;-/(adj|a_nc)/] B[CAT&lt;-/(adj|a_nc)/];
pattern, ranking of candidates by virtue of some score of lexical association, and
assessment of term speci city with respect to some kind of non-terminological
corpus of the language, among others
          <xref ref-type="bibr" rid="ref7">(Hong et al., 2001)</xref>
          .
        </p>
        <p>With the rst ltering method, term candidates beginning or ending with any
of the words of a list of stop words are removed from the list. This method,
however, does not apply to the results of NATools complemented with
bilingual syntactic patterns, since term candidates obtained by NATools respect the
de ned morphologic restrictions.</p>
        <p>
          Another well-known method for ltering the results of extraction consists of
calculating the lexical association of candidates in the corpus using one of the
possible scores to test the strength of this association. The extractor in NATools
calculates the log-likelihood ratio score
          <xref ref-type="bibr" rid="ref4">(Dunning, 1993)</xref>
          . However, this score
does not carry any signi cance as a discriminatory factor when assessing the
outcome of our terminology extraction method, presumably because the quality
of selection based on a probabilistic translation dictionary derived from the
parallel corpus and ltered with patterns ensures a fairly high minimum cohesion
between the components of the candidate terms
          <xref ref-type="bibr" rid="ref10">(Sim~oes &amp; Guinovart, 2009)</xref>
          .
        </p>
        <p>Therefore, we decided to check the accuracy of the term extraction of NATools
with bilingual syntactic patterns using a non-terminological corpus of exclusion
as a lter. The exclusion corpus will determine the identi cation (and exclusion)
of unlikely term candidates. Literary corpora, unlike corpora of news articles,
for instance, usually contain very few terminological units. A literary corpus,
as a corpus of exclusion for term extraction, represents a very safe lter. When
using a literary corpus as a lter, there are more false candidates identi ed as
such than correct candidates wrongly identi ed as false ones. We created lists
of word n-grams from the exclusion corpora BiVir and Compara, and applied
these lists as criteria for ltering and evaluation of NATools-based terminology
extraction.</p>
        <p>The evaluation results (table 3) point to a high precision of the NATools-based
extraction algorithm. As shown in the rst column of the table, the 12,689
translation equivalences (TE) identi ed in the Unesco Corpus using NATools with the
EN-GL bilingual syntactic patterns depicted in gure 2 represent 7,250
candidate bilingual term pairs (term candidates or TC) (57% of TE) after eliminating
repeated TE. When ltering that list of TC with the list of word bi- and
trigrams from the BiVir Corpus, we obtain a list of 6,949 Galician terms from
TC (corresponding to 96% of TC) which are not present in the exclusion
corpus, and a complementary list of 301 Galician term candidates (only 4% of TC)
identi ed as erroneous term candidates due to their presence in the exclusion
corpus. Thus, these scores show a precision of 96% in the NATools-based term
extraction from the Unesco Corpus.</p>
        <p>As for the experiments with the JRC-Acquis, the 717,293 TE identi ed with the
EN-PT bilingual syntactic patterns shown in gure 2 represent 72,952 TC (only
10.2% of TE) after eliminating repeated TE. Di erences between the TE/TC
ratio of the Unesco Corpus and and that of the JRC-Acquis (57% vs. 10.2%) lie
in the lexical density (percentage of di erent words in a text) of the two corpora.
When ltering that list of TC with the list of n-grams from the Compara, we
get a list of 63,744 Portuguese terms from TC (corresponding to 87.4% of TC)
which are not present in the exclusion corpus, and a complementary list of 6,949
Portuguese term candidates (12.6% of TC) identi ed as unlikely term candidates
because of their presence in the exclusion corpus. Di erences in the precision
scores of term extraction between the Unesco Corpus and the JRC-Acquis (96%
vs. 87.4%) lie in the di erent size of the corpora (and of the exclusion corpora)
and also in their level of lexical density and terminological speci city.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Corpora</title>
        <p>Language
Trans. Equiv.</p>
        <p>Term Cand.
Excluded TC
Not-excl. TC</p>
      </sec>
      <sec id="sec-3-5">
        <title>Unesco</title>
        <p>GL
12 689
7 250 (57%)
301 (4%)
6 949 (96%)</p>
      </sec>
      <sec id="sec-3-6">
        <title>JRC-Acquis</title>
        <p>PT
717 293
72 952 (10.2%)
9 208 (12.6%)
63 744 (87.4%)</p>
        <p>Finally, regarding the terminological quality of the extracted terms, the
comparison of the 6,949 Galician terms identi ed by this method and ltered by
the BiVir literary corpus, by one side, with the gold standard list formed by
the 129,269 unique terms Galician terms found in the bUSCatermos and the
Termoteca, by the other side, shows that only the 7.5% of the terms (521 terms)
selected in the corpus by our method are part of the gold standard list. In some
cases, the reason of this mismatch lies in the lack of lemmatisation in
extraction. For instance, the extractor identi es \alimentos naturais", but the gold
standard list contains the lemmatised version of the term, namely, \alimento
natural." But more frequently the reason lies in the obvious fact that no term
listing contains all the terms in a language. So we have found in our results a lot
of genuine terms like \acceso directo", \accion cidada", \accion humanitaria",
\acordo de paz", \aeroporto internacional", \ministerio de defensa" or \abusos
sexuais", which are not included in the list of 129,269 terms of our gold standard.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Bilingual terminology extraction from parallel corpora based on probabilistic
translation dictionaries and complemented with bilingual syntactic patterns shows
high rates of accuracy. At the present stage of development of the term
extractor included in the NATools package, any word which is not recognized by the
morphological analyzer cannot be part of a term candidate and some feasible
candidates may be ignored. To avoid this the easiest solution would consist
of considering any non-recognized word as a noun (obviously, a decision with
risks). As for the evaluation of term quality, we must point the di culty both
in acquiring an undisputed gold standard for a language, as in interpreting the
evaluation results due to the fact that no term listing contains all the terms in
a language.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Almeida J. J.</given-names>
            &amp;
            <surname>Pinto</surname>
          </string-name>
          <string-name>
            <surname>U.</surname>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Jspell { um modulo para analise lexica generica de linguagem natural</article-title>
          .
          <source>In Actas do X Encontro da Associaca~o Portuguesa de Lingu stica</source>
          , p.
          <volume>1</volume>
          {
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Atserias J.</given-names>
            ,
            <surname>Casas</surname>
          </string-name>
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Comelles</surname>
          </string-name>
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Gonzalez</surname>
          </string-name>
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Padro</surname>
          </string-name>
          <string-name>
            <given-names>L.</given-names>
            &amp;
            <surname>Padro</surname>
          </string-name>
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>FreeLing 1.3: syntactic and semantic services in an open-source NLP library</article-title>
          .
          <source>In Proceedings of the 5th International Conference on Language Resources and Evaluation</source>
          , p.
          <volume>48</volume>
          {
          <fpage>55</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Crespo A.</given-names>
            ,
            <surname>Clemente</surname>
          </string-name>
          <string-name>
            <given-names>X. M. G.</given-names>
            ,
            <surname>Guinovart</surname>
          </string-name>
          <string-name>
            <given-names>X. G.</given-names>
            &amp;
            <surname>Lopez</surname>
          </string-name>
          <string-name>
            <surname>S.</surname>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>XMLbased extraction of terminological information from corpora</article-title>
          . In J. C.
          <string-name>
            <surname>Ramalho</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          <string-name>
            <surname>Lopes</surname>
            &amp; S. Abreu, Eds.,
            <given-names>XATA</given-names>
          </string-name>
          <year>2008</year>
          |
          <article-title>6a Confer^encia Nacional em XML, Aplicac~oes e Tecnologias Aplicadas</article-title>
          , p.
          <volume>28</volume>
          {
          <fpage>39</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Dunning</surname>
            <given-names>T.</given-names>
          </string-name>
          (
          <year>1993</year>
          ).
          <article-title>Accurate methods for the statistics of surprise and coincidence</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <volume>61</volume>
          {
          <fpage>74</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Frankenberg-Garcia</surname>
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Santos</surname>
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Introducing COMPARA, the Portuguese-English parallel translation corpus</article-title>
          . In S. B.
          <string-name>
            <surname>Federico Zanettin</surname>
          </string-name>
          &amp; D. Stewart, Eds., Corpora in Translation Education, p.
          <volume>71</volume>
          {
          <fpage>87</fpage>
          . Manchester: St. Jerome Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Guinovart</surname>
            <given-names>X. G.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Sacau</surname>
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Parallel corpora for the Galician language: building and processing of the CLUVI (Linguistic Corpus of the University of Vigo)</article-title>
          .
          <source>In Proceedings of the 4th International Conference on Language Resources and Evaluation</source>
          , p.
          <volume>1179</volume>
          {
          <fpage>1182</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Hong</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fissaha</surname>
            <given-names>S.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Haller</surname>
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>Hybrid ltering for extraction of term candidates from German technical texts</article-title>
          .
          <source>In Proceedings of Terminologie et Intelligence Arti cielle.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Patry A.</given-names>
            &amp;
            <surname>Langlais</surname>
          </string-name>
          <string-name>
            <surname>P.</surname>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Corpus-based terminology extraction</article-title>
          .
          <source>In Proceedings of the 7th International Conference on Terminology and Knowledge Engineering</source>
          , p.
          <volume>313</volume>
          {
          <fpage>321</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>Simo~es</article-title>
          <string-name>
            <given-names>A.</given-names>
            &amp;
            <surname>Almeida</surname>
          </string-name>
          <string-name>
            <surname>J. J.</surname>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Bilingual terminology extraction based on translation patterns</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          ,
          <volume>41</volume>
          , 281{
          <fpage>288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>Simo~es</article-title>
          <string-name>
            <given-names>A.</given-names>
            &amp;
            <surname>Guinovart</surname>
          </string-name>
          <string-name>
            <surname>X. G.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Terminology extraction from EnglishPortuguese and English-Galician parallel corpora based on probabilistic translation dictionaries and bilingual syntactic patterns</article-title>
          . In A. Teixeira,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Dias</surname>
          </string-name>
          &amp; D. Braga, Eds.,
          <source>I Iberian SLTech</source>
          <year>2009</year>
          , p.
          <volume>13</volume>
          {
          <issue>16</issue>
          ,
          <string-name>
            <surname>Porto</surname>
            <given-names>Salvo</given-names>
          </string-name>
          , Portugal.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Simo~es</article-title>
          <string-name>
            <given-names>A. M.</given-names>
            &amp;
            <surname>Almeida</surname>
          </string-name>
          <string-name>
            <surname>J. J.</surname>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>NATools { a statistical word aligner workbench</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          ,
          <volume>31</volume>
          , 217{
          <fpage>224</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Steinberger R.</given-names>
            ,
            <surname>Pouliquen</surname>
          </string-name>
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Widiger</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Ignat</surname>
          </string-name>
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Erjavec</surname>
          </string-name>
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Tufis</surname>
          </string-name>
          <string-name>
            <given-names>D.</given-names>
            &amp;
            <surname>Varga</surname>
          </string-name>
          <string-name>
            <surname>D.</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages</article-title>
          .
          <source>In Proceedings of the 5th International Conference on Language Resources and Evaluation.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>