<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual semantic resources and parallel corpora in the biomedical domain: the CLEF-ER challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dietrich Rebholz-Schuhmann</string-name>
          <email>rebholz@ifi.uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Clematide</string-name>
          <email>clematide@ifi.uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Rinaldi</string-name>
          <email>rinaldi@ifi.uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Senay Kafkas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik M. van Mulligen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chinh Bui</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Hellrich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Lewin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Milward</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Poprat</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Jimeno-Yepes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Udo Hahn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan A. Kors</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>(1) Department of Computational Linguistics, University of Zurich</institution>
          ,
          <addr-line>Ch (rebholz, clematide</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Multilingual terminological resources can be drawn from parallel corpora in the languages of interest, possibly exploiting machine translation solutions for term identi cation. This main objective of the CLEF-ER challenge involves parallel corpora in English and other languages. The challenge organisers have gathered and normalized documents from the biomedical domain: titles from scienti c articles, drug labels from the European Medicines Agency, and patent texts from the European Patent O ce. The parallel units have been identi ed, marked-up and formatted for future use. The three di erent corpora show comparable sizes. In preparation of the CLEF-ER challenge, the documents have been annotated with terminologies in English and non-English languages (de, fr, es, and nl) and the pre-existing terminological resource has been optimized for the entity recognition task in CLEF-ER. Finally a silver standard corpus for entity annotations and their identi ers has been produced on the English documents for the evaluation of challenge contributions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>Biomedical IT solutions require terminological resources (TRs) to achieve
interoperability of modules and data. Increasingly such IT solutions require
multilingual TRs, since they are used in di erent countries to capture and encode
patient related information in the home language. To this end, the biomedical
terminologies have to be produced in di erent languages and entities have to
be identi able across languages, i.e. the concepts should carry the same
identier, to enable their reuse and to improve the exchange of data across cultural
and language borders. However, existing biomedical multilingual TRs are too
limited in their size and their development has to be supported by automatic
means that analyse available document resources for the acquisition of novel
biomedical terms.</p>
      <p>The production of multilingual TRs is time-consuming and requires novel
approaches to produce them at a large scale. One way to improve the
development is the exploitation of multilingual parallel documents (\corpora") and
to use automatic annotation and alignment methods to identify relevant terms.
Such multilingual documents are available from the European Patent O ce1,
EMEA2 and the Medline3 distribution. Although they cover di erent topics, all
of them contain potentially novel terms.</p>
      <p>The documents from the available corpora can be annotated with automatic
means for the identi cation of entities in the di erent languages and
subsequently mined for novel terms. When di erent annotations have been provided,
computational methods have to be applied to align the annotations in a single
corpus (called "silver standard corpus", SSC). Under the provision of the SSC,
we achieve two goals: (1) we can evaluate the results from other annotation
solutions against the SSC, and (2) we can attribute the concept unique identi ers
(CUIs) from the English annotated documents to the non-English documents.
No gold standard corpus (GSC) is available for the assessment of the correct
annotation of concept identi ers to the terms in the parallel corpora.</p>
      <p>After the term candidates have been collected { including their CUIs { they
have to be validate and integrated into the existing TR, i.e. the novel terms have
to be aligned with the existing state of the art TRs.</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>The main terminological resources in the biomedical domain are maintained and
partially produced by the National Library of Medicine (NLM). In addition, the
NLM provides sophisticated resources for the processing of natural language text
such as MetaMap, a text categorizer that attributes CUIs to text passages. The
following terminological resources - amongst others - play an important role in
the biomedical domain: The Medical Subject Headings (MeSH) have been
produced to categorize the abstracts in PubMed in such a way that information
retrieval is performed at high e ciency. The headings are assigned manually to
text documents, but computer programmes support the assignment by
attributing also headings by automatic means (see MetaMap above). The Systematized
Nomenclature of Human and Veterinary Medicine (SNOMED-CT) resource is
a medical terminology for the encoding of diseases. It forms a standard that is
well established throughout health organisations in di erent countries and has
1 http://www.epo.org/
2 http://opus.lingfil.uu.se/EMEA.php
3 http://www.ncbi.nlm.nih.gov/pubmed/
been included into the distribution of Uni ed Medical Language System (UMLS)
under speci c licensing requirements. The Medical Dictionary for Regulatory
Activities (MedDRA) is a terminological resource that enables encoding of adverse
drug events for regulatory a airs.</p>
      <p>The analysis of clinical and biomedical documents requires tools and
resources for e cient processing. Advanced technologies have been developed in
recent years that enable semantic annotations, e.g. entity recognitions of a large
number of biomedical entities, as well as the e cient and precise parsing of
the literature. The research has focused on the development of solutions for the
bioinformatics research community, but increasingly solutions from the research
domain are used in the clinical environment as well.</p>
      <p>Clinical data is best described by the standardized reporting of clinical
parameters, for example the measurements of physiological and patho-physiological
parameters from a sample of blood, and of phenotypic parameters given in the
natural language of the country. It remains a challenge to read the notes of
the clinical doctor, but e cient solutions have been developed to transform the
patient record into a standardized representation for further exploitation.</p>
      <p>A number of challenges have been introduced for the assessment of tools
and solutions that process biomedical text and normalize identi ed facts and
entities. In the domain of molecular biology, the main focus resided on the
correct identi cation of entities such as genes, proteins, drugs and diseases, and
after that, the extraction of molecular interactions, gene regulatory events and
gene-disease associations. The sequel of BioCreAtIve challenges engaged the
research community into several challenges with di erent characteristics. As an
alternative, the BioNLP Shared Task tackled similar problems, and from early
on supported the exploitation of ontological resources for the challenges. All
the challenges provide manually annotated corpora as a gold standard corpus
to measure performances of the annotation solutions. As an alternative to this
general paradigm, the CALBC challenge made use of a corpus at a very large
scale that has been generated with automatic means from existing annotation
solutions, and it thus represented a \silver-standard" corpus.</p>
      <p>The \Conference and Labs of the Evaluation Forum" (CLEF) has been
formed to tackle challenges and their evaluation as a joint e ort. It is
organized on a yearly basis and has over and over again suggested new challenges
for the research community, including the analysis of patents for improved
information retrieval (CLEF-IP), the analysis of medical records (CLEFeHealth)
and even the analysis of multilingual and multimodal data.</p>
      <p>The CLEF-ER challenge tackles a combination of di erent tasks: (1) entity
mention annotation, (2) entity normalisation, and (3) also multilingual analysis
in the sense that participants could use a resource in English to annotate the
non-English parallel document. Furthermore, the challenge is not only tuned
to a single corpus, but includes patent texts and scienti c medical texts alike,
provides a reference terminological resource, and demands to process a
largescale corpus, larger than typically available in challenges that make use of a gold
standard corpus.</p>
    </sec>
    <sec id="sec-3">
      <title>Material and Method</title>
      <p>The CLEF-ER challenge is focused to the languages English (en), Spanish (es),
French (fr), German (de), and Dutch (nl). This selection has been motivated
by the availability of resources, i.e. terminologies and documents alike, in the
di erent languages. It was an important requirement that documents in a
nonEnglish language must have a parallel document in English, and { at the same
time { it was not relevant that a pair of documents in English and a non-English
language should be accompanied by yet another document in a third language.</p>
      <sec id="sec-3-1">
        <title>Terminologies</title>
        <p>A number of resources have been prepared for the CLEF-ER challenge:
terminological resources and documents. The terminological resource is based on the
UMLS and makes use of the contained terms, the standardized le formats and
the licensing server of the National Center of Bioinformatics (NCBI) for access
to the terminological resource.</p>
        <p>In principle, the full set of UMLS terms could be relevant for the annotation of
the multilingual documents, but a number of constraints have to be considered.
First, not all terms are relevant for the documents that have been prepared for
the CLEF-ER challenge. Second, the overhead for the processing of the full set
of terms may be high and may distract from the challenge tasks; in other terms,
it is advantageous for challenge participants to handle only a reduced set of
the terms relevant for the challenge. Third, the evaluation of the results can be
improved by reducing the term set, since the excluded terms and categories will
reduce ambiguities, i.e. less semantic categories (called \semantic types" and
\semantic groups") could be distinguished if a term is polysemous with regards
to the semantic categories.</p>
        <p>The TR is required for the term normalization: the contained English and
non-English concepts are provided with their CUIs. The CUIs form the key result
as part of the annotation task in the CLEF-ER challenge, since the challenge
participants have to annotate the text with the CUIs, and the challenge
organizers evaluate the anntotations against a silver standard corpus (SSC). This
subselection of the UMLS terminological resource is called the MANTRA
terminological resource (MTR). The terminology is delivered in the OBO le format,
which has been proposed by the Open Biomedical Ontology (OBO) Foundry
and is maintained by the National Center of Biomedical Ontologies. Its aim is
to create a common format for controlled vocabularies.</p>
        <p>The UMLS licence agreement requires that users validate their licenses when
accessing the TR. This task is performed by querying the license server from
the NLM with the right credentials, for example using restful services through
a server4 validating the username and the password of the licensee. The
Terminological Resource is accessible through the download site at the Erasmus
University Medical Center Rotterdam.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Selection of parallel corpora</title>
        <p>The corpora have been selected from di erent resources and for the purpose
to serve two objectives: (1) enabling extraction of novel terms for the
multilingual TR as part of the MANTRA project work and the outcomes from the
CLEF-ER challenge, and (2) o ering parallel corpora to the CLEF-ER
challenge participants as an input to solve term recognition tasks with and without
machine-translation solutions.</p>
        <p>A number of requirements have to be met to ful ll the given objectives. The
corpora have to cover the domain knowledge that is under investigation. This is
the case for the scienti c literature, but also for documents that deal with drug
labels. In addition, the MANTRA project partners selected patent documents
from the European Patent O ce, since this distribution of documents covers a
signi cant amount of documents, deal with biomedical domain knowledge, but
also di er in their language from the above mentioned types of documents.</p>
        <p>The diversity between the document repositories is high with regards to
the amount of available content, the type of the documents and the languages
that are supported in the repository. From the patent corpus, mainly the claim
section is available as parallel document, and from Medline only the titles of
the scienti c articles can be aligned across languages. The EMEA repository
allows to identify full documents in parallel. For the patent claims we could
produce parallel documents for English, German and French, whereas EMEA
and Medline covers all selected languages. The EMEA corpus has already been
exploited to train and test statistical machine translation solutions.</p>
        <p>Note that Medline abstracts have always been translated from the
nonEnglish language into English and as a consequence the parallel units, i.e. the
titles, are restricted to language pairs between the English and the non-English
language. More in detail, all non-English units, i.e. patent claim sections, Medline
titles, or EMEA document, have a parallel unit in English. Also, every English
4 https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser?licenseCode=...
unit has a parallel non-English unit, but only a smaller portion of English units
has a parallel non-English unit in two, three or four languages.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Optimizing the TR for the challenge</title>
        <p>The MANTRA project partners have annotated the documents with their
inhouse annotation solutions, e.g. OntoGene, Peregrine, UIMA-based annotation
solutions from the JulieLab and from Averbis. All annotation solutions apply the
provided UMLS terminological resource. After this phase, the English annotated
corpora have been harmonized and evaluated. This included the assessment on
the number of annotations that resulted from the di erent semantic categories
in the UMLS. From the distribution of the annotations, the MANTRA project
partners took the decision to reduce the number of categories in the UMLS
terminological resource and to extract those categories from UMLS that have
higher relevance to the multilingual terminological resource and show - in
addition - good coverage in the annotated corpora. The nal solution is called the
MTR and covers the semantic groups: anatomy, chemicals and drugs, devices,
disorders, geographic areas, living beings, objects, phenomena and physiology
(ANAT, CHEM, DEVI, DISO, GEOG, LIVB, OBJC, PHEN, PHYS). All terms
that are categorized with other semantic groups have then been removed to
produce the nal terminological resources for the CLEF-ER challenge.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Preparation of the silver standard corpus</title>
        <p>The annotated corpora in English have been processed to produce the silver
standard corpus as describe in.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Resource and evaluation</title>
      <p>The following resources have been made available to the challenge participants:
terminologies and corpora.</p>
      <p>Units EMEA Medline Patent Words EMEA Medline Patent
en 364,005 1,593,546 120,638 en 5,120,067 15,775,814 6,034,104
de 364,005 719,232 120,637 de 4,571,203 5,996,504 5,194,032
fr 373,152 572,176 120,636 fr 5,515,157 6,023,945 6,689,812
es 366,769 247,655 es 5,897,467 2,573,056
nl 360,418 54,483 nl 5,130,890 435,390</p>
      <sec id="sec-4-1">
        <title>Statistics across the corpora</title>
        <p>The corpora di er in their quality and also in the type of units that have been
identi ed from the documents, i.e. Medline titles in contrast to paragraphs in
EMEA documents and claim sections in the patents. On the other side, all
corpora are su ciently large to support the needs of the CLEF-ER: the corpora
pose a challenge to the participants since the annotation of the corpora requires
automatic means, and also a challenge to the MANTRA project partners, since
the harmonization of the annotations should deliver a valuable resource to the
public for future exploitation (as in the CALBC challenge).</p>
        <p>The corpora also di er in their sizes (see tbl. 2). The Medline titles form the
biggest corpus regarding the number of units and the number of words (as well
as the number of characters). The EMEA corpus appears to be larger than the
patent corpus when considering the number of units, but is evenly large when
measuring the size based on words and characters. Since the EMEA corpus is
available in all languages, it is well suited for the CLEF-ER challenge.</p>
        <p>The units from the patent corpus are available in English, French and German
as stated before, whereas the other two corpora are provided in all languages.
In addition, the annotation of the patent corpus has shown that the language in
the corpus has a high diversity and the terms from the MTR are often used in
a non-speci c way with regards to the biomedical purpose of the MTR, which
makes the patent corpus less suitable for the entity recognition task.</p>
        <p>By contrast, the pairwise units from the Medline titles show high
heterogeneity and at the same time, the use of the terminology in Medline is most speci c
with regards to the purpose of the MeSH, MedDRA, and SNOMED-CT terms
in the MTR.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Evaluation</title>
        <p>The partners have contributed sets of annotated documents where each corpus is
either in English or a non-English language (de, fr, es, nl). All identi ed mentions
of an entity have been annotated with a CUI. The following analyses have been
performed to determine which corpora will be included into the challenges, and
what languages should be covered in the challenges.</p>
        <p>One assumption is that those corpora are most suitable for the challenges
that comply best with the terminological resources and { after all { with the
domain knowledge over all. The following parameters can be used to measure
\compliance" between the terminological resources and/or the domain
knowledge and the corpora.
1. The number of annotations that can be identi ed using the prepared
terminological resource in the English corpus (L1 for language 1, see g. 1).
2. The number of annotations that can be identi ed using the prepared
terminological resource in the non-English language in the parallel corpus (L2 for
language two, i.e. the non-English language, see g. 1).
3. The previous parameter, but now only all those English annotations are
counted where non-English translation is available for the same CUI in the
non-English language (L2, see g. 2).</p>
        <p>Counting all English annotations as reference (see tbl. 1, parameter 2) gives
an analysis that is less generous to the annotation solutions (\pessimistic" or
\real world" evaluation) than counting only those English annotations that
comply with the third parameter (\optimistic" or \idealistic" evaluation,
parameter 3). The latter evaluate the performances under the condition that for each
English term there exists a non-English transcript in the TR, i.e. it ignores a
number of English annotations where no translation of the term can be found
in L2 anyway.</p>
        <p>A number of open questions have been resolved through the analysis of the
results that were given by the annotation of the corpora in English and the
nonEnglish languages from the project partners. The open questions were concerned
with the selection of the languages for the challenge, the selection of corpora and
of the semantic groups in the TR.</p>
        <p>First, the languages have been limited to German, Spanish, French and Dutch
apart from English. Second, the corpora have been limited to: Medline, patents,
and EMEA. Actually, the sizes of these three corpora is quite similar and the
diversity between the corpora should contribute to the diversity in the challenge.</p>
        <p>Finally, the terminological resource is limited to the semantic groups: ANAT,
CHEM, DEVI, DISO, GEOG, LIVB, OBJC, PHEN, PHYS. The other groups
have been excluded, since they did not show enough coverage (GENE, ORGA,
OCCU), or it was too unspeci c, i.e. not speci c enough for the biomedical
domain (CONC) resulting to very heterogeneous annotation results across the
di erent corpora and languages. It is obvious that the TR has been reduced to
the sets of terms that are required for the challenge.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The CLEF-ER challenge organized from the MANTRA project partners is the
rst of its kind tackling a number of challenges in a large-scale annotation task
allowing multilingual approaches. The main objective is the identi cation of
biomedical entities (and concepts) in multilingual documents, where the
documents are available as part of parallel corpora involving the English language
and at least one non-English language.</p>
      <p>The corpora stem from the scienti c literature (Medline abstracts), drug
labels (EMEA documents) and from the patent text (European patent o ce).
A reference terminology has been provided from UMLS and has been optimized
for the challenge participants. The English corpora have been annotated with
the terms from the MTR using the annotation solutions of the project partners
leading to a silver standard corpus, which has been distributed to the challenge
participants. These annotated corpora give the challenge participants di erent
opportunities to contribute to the challenge.</p>
      <p>Altogether, the CLEF-ER challenge will work towards solutions that
identify biomedical terms in multilingual documents of di erent kinds, where the
proposed solutions have to cope with large amounts of terms and large data
resources. The overall outcome will help to establish the semantic web in
healthcare and to allow interoperability of IT solutions across country borders and
languages.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>This work was funded by the European Commission STREP grant number
296410 ("Mantra", FP7-ICT-2011-4.1).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>