<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Machine Translation for Entity Recognition across Languages in Biomedical Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe Attardi</string-name>
          <email>attardi@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Buzzelli</string-name>
          <email>buzzelli@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Sartiano</string-name>
          <email>sartiano@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica Università di Pisa Italy</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <abstract>
        <p>We report on our experiments for the CLEF 2013 Entity Recognition Challenge. Our approach is based on a combination of machine translation and NE tagging techniques. The Silver Standard Corpus (SSC) is used to obtain a corresponding annotated corpus in the target language. The plain text of the SSC is translated and a mapping is created between entities in the original and phrases in the translation, to which are associated the same CUIs as in the original. This produces a Bronze Standard Corpus (BSC) in the target language. A dictionary of entities is also created, which associates to each pair (entity text, semantic group) the corresponding CUIs that appeared in the SSC. The BSC is used to train a model for a Named Entity tagger. The model is used for tagging entities in sentences in the target language with the proper semantic group and the entity dictionary is used for associating CUIs to each of them.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. a terminological resource (TR) produced from UMLS, containing English and
non</p>
      <p>English concepts together with their CUIs.
2. a selection of corpora in English, i.e. patent texts, Medline titles and EMEA
documents, where the entity mentions have been annotated automatically with their
CUIs.
3. a selection of corpora in different languages other than English, i.e. in DE, FR, SP,
and NL, that have to be annotated with entity mentions and their CUIs.
The English corpora have been annotated automatically with the help of several
annotation solutions (for English) from the project partners and a harmonisation scheme to
generate a Silver Standard Corpus (CALBC approach). The English corpus thus
serves as additional input, but does not serve as a Gold Standard.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>We designed our system while we did not have yet access to the UMLS TR resources
provided by the organizers, therefore we just exploited the annotated English Silver
Standard Corpus (SSC) as a source of information. We will discuss later how the TR
resources could be integrated in our approach.</p>
      <p>Our approach combines techniques of machine translation with NER techniques.</p>
      <p>The following is an overview of the approach:
1. we apply phrase-based statistical machine translation to the SSC in order to obtain
a corresponding annotated corpus in the target language. The plain text of the SSC
is translated and a mapping is created between entities in the original and phrases
in the translation, to which are associated the same CUIs as in the original. This
produces a Bronze Standard Corpus (BSC) in the target language. A dictionary of
entities is also created, which associates to each pair (entity text, semantic group)
all the corresponding CUIs that appeared in the SSC.
2. the BSC is used to train a model for a Named Entity tagger, whose output classes
are the possible semantic groups of entities.
3. the model built at step 2) is used for tagging entities in sentences in the target
language with the proper semantic group.
4. the annotated document is converted to XML format and enriched by adding CUIs
to each entity, looking up the pair (entity, group) in the dictionary of CUIs built in
step 1.</p>
      <p>One advantage of this approach is that it produces data, a NER model and an entity
dictionary, that can be readily applied to a document in the target language without
any further reference to the source corpora in the original language.
2.1</p>
      <sec id="sec-2-1">
        <title>Creating the Bronze Standard Corpus</title>
        <p>The Bronze Standard Corpus is obtained by translating the SSC and transferring to it
the entity annotations.</p>
        <p>
          For translating the original English SSC into the target language (Spanish), we use
Moses [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a statistical phrase-based machine translation system that allows
automatically learning translation models for any language pair. Moses is trained through a
collection of parallel corpora. An efficient decoder algorithm finds the highest
probability translation among an exponential number of possible translations.
        </p>
        <p>We exploited the word alignment information produced by Moses to determine the
correspondence between entities in the source and target sentences.</p>
        <p>
          Moses was trained on texts pertaining to the biomedical domain, obtained by
joining the EMEA corpus [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] with from the Medline resource provided for the CLEF-ER
challenge.
        </p>
        <p>In our experiments we only dealt with English to Spanish translation, but the
approach can be applied to any language pair for which there exist suitable parallel
corpora.</p>
        <p>In order to identify in the target language the phrases that correspond to entities in
the original, we exploit the word alignment information obtained by invoking the
Moses decoder with the option:</p>
        <p>-alignment-output-file file
Let’s illustrate this step with the following example.</p>
        <p>“This medicine/CHEM relieves headaches/DISO”
By invoking the Moses decoder, we obtain the following best translation:
“Este medicamento alivia los dolores de cabeza”
and the following word alignment:</p>
        <p>1-1 2-2 3-3 3-4 3-5 3-6
Each pair of numbers in the alignment provides a correspondence between a token in
the original sentence identified by its left number position and a token in its
translation identified by its right number position.</p>
        <p>The word alignment allows us to map an entity in the source to its translation.</p>
        <p>For example, the entity “medicine”, located at position ‘1’ in the original sentence,
is mapped to the single word “medicamento”, at position ‘1’ in the translation.</p>
        <p>
          The case is less clear for the second entity: the word alignment indicates that the
source word “headaches”, located at position ‘3’, maps to the list of tokens [
          <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5,
6</xref>
          ], leading to the phrase “los dolores de cabeza”, as a possible entity text in the target
language. This candidate text is cleaned up by removing articles and punctuations that
occur at the beginning or end of the entity text, in order to obtain more consistent
phrases.
        </p>
        <p>
          In the example, article “los” gets dropped from the beginning, producing the final
entity “dolor de cabeza”. This step is performed by simply checking the part of speech
tags of tokens, obtained by using the Tanl POS tagger [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], trained for Spanish on the
Ancora corpus [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Building the NER Training Set</title>
        <p>The training set for the NE tagger is obtained from the BSC, converting it into IOB
notation and adding POS tags to each token.</p>
        <p>The example in previous section would be represented as follows:</p>
      </sec>
      <sec id="sec-2-3">
        <title>FORM</title>
        <p>
          For performing NE recognition, we used the Tanl Tagger [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a generic, customizable
statistical sequence labeller, suitable for many tasks of sequence labelling, such as
POS tagging or Named Entity recognition. The tagger implements a Conditional
Markov Model (CMM, aka MEMM) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for sequence labeling that combines features
of Hidden Markov models (HMMs) and Maximum Entropy models.
        </p>
        <p>
          The Tagger can be configured to use alternative types of classifiers: Maximum
Entropy or Linear Support Vector [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. By complementing the classifier with dynamic
programming the Tanl Tagger can achieve similar levels of accuracy than SVM with
much faster speed.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Data</title>
        <p>
          For training moses we used a parallel corpus consisting of 247,655 sentences from the
English-Spanish version of Medline and 1,098,333 sentences from the EMEA corpus
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We tokenized the corpus with the Tanl Tokenizer [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and then split into a train
part with 1,323,588 sentences and a development part with 11,000 sentences.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Moses training</title>
        <p>
          We ran the training of Moses, using the KenLM [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] language model, created using the
text of the text of the whole Spanish Wikipedia, extracted using the Tanl Wikipedia
Extractor tool [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          Word alignment was done using MGIZA, a multi thread version of GIZA++ [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
After training, we performed a tuning process using the development corpus, created
as mentioned earlier.
        </p>
        <p>The Moses decoder was run using this model with default settings except for a
beam size of 500.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>NER Training</title>
        <p>
          The Tanl NE tagger can be customized by specifying the set of features to use.
Features are divided into local and global features. Local features include attribute
features, extracted from attributes (e.g. Form, PoS, NE) of tokens in the vicinity of
current token, and morphological features, binary features extracted from a token
matching a given regular expression. For CLEFER we did not use any global feature:
properties holding at the document level. We tested several configurations, starting from
the one that proved best for Italian at Evalita 2011 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The submitted official run uses
the same morphological features and these attribute features:
        </p>
        <p>Attribute
We submitted three runs for evaluation. The official submission
(EMEA_es_man.LR5.xml) identified 417,390 entities in the 140,552 units present in the Spanish test
corpus EMEA_es_man.xml.</p>
        <p>Some of the identified entities appear without the corresponding CUI while others
appear with a very large number of CUIs. The former occurs typically for entities
with a general meaning, that during translation get associated to several different
original English words in the entity dictionary.</p>
        <p>After the submission, we were able to address this problem by performing a
cleanup of the entity dictionary, in this way:
for each pair (e, cl) in the dictionary
for each c in cl
retrieve the set of text entities te associated
to c in the UMLS for target language
if e is not present in te remove it from cl
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We reported our experiments in the CLEF-ER Challenge 2013 for English to Spanish.
By exploiting phrase-base machine translation and NE tagging we were able to build
a system that can operate standalone, on any text in the target language, effectively
transferring the knowledge on entities from one language to another.</p>
      <p>The accuracy of the solution can be further refined by making use of data from the
UMLS, as we started doing in our latest experiments.</p>
      <p>Acknowledgements. Partial support for this work was provided by project RIS (POR
RIS of the Regione Toscana, CUP n° 6408.30122011.026000160).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Attardi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dei</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Simi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.:</surname>
          </string-name>
          <article-title>The Tanl Pipeline</article-title>
          .
          <source>In: Proc. of Workshop on Web Services and Processing Pipelines in HLT, Malta</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Maximum Entropy Markov Models for Information Extraction and Segmentation</article-title>
          .
          <source>In Proc. ICML</source>
          <year>2000</year>
          ,
          <volume>591</volume>
          -
          <fpage>598</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>G.</given-names>
            <surname>Attardi</surname>
          </string-name>
          , G. Berardi,
          <string-name>
            <given-names>S. Dei</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simi</surname>
          </string-name>
          .
          <article-title>The Tanl Tagger for Named Entity Recognition on Transcribed Broadcast News at Evalita 2011</article-title>
          . In B. Magnini et al. (Eds.),
          <source>Proc. of Evalita</source>
          <year>2011</year>
          , LNCS 7689, pp.
          <fpage>116</fpage>
          -
          <lpage>125</lpage>
          ,
          <year>2013</year>
          . ISBN 978-3-
          <fpage>642</fpage>
          -35827-2.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M. Civit</given-names>
            <surname>Torruella</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. A. Martì</given-names>
            <surname>Antonìn</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Design Principles for a Spanish Treebank</article-title>
          ,
          <source>In Proc. of the First Workshop on Treebanks and Linguistic Theories (TLT).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tiedemann</surname>
          </string-name>
          , J.:
          <article-title>News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces</article-title>
          . In Nicolov, N.,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Angelova</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitkov</surname>
          </string-name>
          , R., eds.:
          <source>Recent Advances in Natural Language Processing</source>
          . Volume V. John Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria (
          <year>2009</year>
          )
          <fpage>237</fpage>
          -
          <lpage>248</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C. J.:</given-names>
          </string-name>
          <article-title>Large linear classification when data cannot fit in memory</article-title>
          .
          <source>ACM Trans. on Knowledge Discovery from Data</source>
          ,
          <volume>5</volume>
          :
          <issue>23</issue>
          :
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Koehn</surname>
          </string-name>
          ,
          <string-name>
            <surname>Philipp</surname>
          </string-name>
          , et al.
          <article-title>Moses: Open source toolkit for statistical machine translation</article-title>
          .
          <source>In Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. ACL</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Tanl</given-names>
            <surname>Wikipedia</surname>
          </string-name>
          <article-title>Extractor</article-title>
          . http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Heafield</surname>
            ,
            <given-names>Kenneth.</given-names>
          </string-name>
          <article-title>KenLM: Faster and smaller language model queries</article-title>
          .
          <source>In Proc. of the Sixth Workshop on Statistical Machine Translation. ACL</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Och</surname>
          </string-name>
          , Franz Josef, and Hermann Ney. Giza+
          <article-title>+: Training of statistical translation models</article-title>
          . (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>