Motivation

Multilingual semantic resources and parallel corpora in the biomedical domain: the CLEF-ER challenge

Dietrich Rebholz-Schuhmann

rebholz@ifi.uzh.ch 0

Simon Clematide

clematide@ifi.uzh.ch 0

Fabio Rinaldi

rinaldi@ifi.uzh.ch 0

Senay Kafkas

Erik M. van Mulligen

Chinh Bui

Johannes Hellrich

Ian Lewin

David Milward

Michael Poprat

Antonio Jimeno-Yepes

Udo Hahn

Jan A. Kors

0 0 (1) Department of Computational Linguistics, University of Zurich , Ch (rebholz, clematide

Multilingual terminological resources can be drawn from parallel corpora in the languages of interest, possibly exploiting machine translation solutions for term identi cation. This main objective of the CLEF-ER challenge involves parallel corpora in English and other languages. The challenge organisers have gathered and normalized documents from the biomedical domain: titles from scienti c articles, drug labels from the European Medicines Agency, and patent texts from the European Patent O ce. The parallel units have been identi ed, marked-up and formatted for future use. The three di erent corpora show comparable sizes. In preparation of the CLEF-ER challenge, the documents have been annotated with terminologies in English and non-English languages (de, fr, es, and nl) and the pre-existing terminological resource has been optimized for the entity recognition task in CLEF-ER. Finally a silver standard corpus for entity annotations and their identi ers has been produced on the English documents for the evaluation of challenge contributions.

Motivation

Biomedical IT solutions require terminological resources (TRs) to achieve interoperability of modules and data. Increasingly such IT solutions require multilingual TRs, since they are used in di erent countries to capture and encode patient related information in the home language. To this end, the biomedical terminologies have to be produced in di erent languages and entities have to be identi able across languages, i.e. the concepts should carry the same identier, to enable their reuse and to improve the exchange of data across cultural and language borders. However, existing biomedical multilingual TRs are too limited in their size and their development has to be supported by automatic means that analyse available document resources for the acquisition of novel biomedical terms.

The production of multilingual TRs is time-consuming and requires novel approaches to produce them at a large scale. One way to improve the development is the exploitation of multilingual parallel documents (\corpora") and to use automatic annotation and alignment methods to identify relevant terms. Such multilingual documents are available from the European Patent O ce1, EMEA2 and the Medline3 distribution. Although they cover di erent topics, all of them contain potentially novel terms.

The documents from the available corpora can be annotated with automatic means for the identi cation of entities in the di erent languages and subsequently mined for novel terms. When di erent annotations have been provided, computational methods have to be applied to align the annotations in a single corpus (called "silver standard corpus", SSC). Under the provision of the SSC, we achieve two goals: (1) we can evaluate the results from other annotation solutions against the SSC, and (2) we can attribute the concept unique identi ers (CUIs) from the English annotated documents to the non-English documents. No gold standard corpus (GSC) is available for the assessment of the correct annotation of concept identi ers to the terms in the parallel corpora.

After the term candidates have been collected { including their CUIs { they have to be validate and integrated into the existing TR, i.e. the novel terms have to be aligned with the existing state of the art TRs.

Background

The main terminological resources in the biomedical domain are maintained and partially produced by the National Library of Medicine (NLM). In addition, the NLM provides sophisticated resources for the processing of natural language text such as MetaMap, a text categorizer that attributes CUIs to text passages. The following terminological resources - amongst others - play an important role in the biomedical domain: The Medical Subject Headings (MeSH) have been produced to categorize the abstracts in PubMed in such a way that information retrieval is performed at high e ciency. The headings are assigned manually to text documents, but computer programmes support the assignment by attributing also headings by automatic means (see MetaMap above). The Systematized Nomenclature of Human and Veterinary Medicine (SNOMED-CT) resource is a medical terminology for the encoding of diseases. It forms a standard that is well established throughout health organisations in di erent countries and has 1 http://www.epo.org/ 2 http://opus.lingfil.uu.se/EMEA.php 3 http://www.ncbi.nlm.nih.gov/pubmed/ been included into the distribution of Uni ed Medical Language System (UMLS) under speci c licensing requirements. The Medical Dictionary for Regulatory Activities (MedDRA) is a terminological resource that enables encoding of adverse drug events for regulatory a airs.

The analysis of clinical and biomedical documents requires tools and resources for e cient processing. Advanced technologies have been developed in recent years that enable semantic annotations, e.g. entity recognitions of a large number of biomedical entities, as well as the e cient and precise parsing of the literature. The research has focused on the development of solutions for the bioinformatics research community, but increasingly solutions from the research domain are used in the clinical environment as well.

Clinical data is best described by the standardized reporting of clinical parameters, for example the measurements of physiological and patho-physiological parameters from a sample of blood, and of phenotypic parameters given in the natural language of the country. It remains a challenge to read the notes of the clinical doctor, but e cient solutions have been developed to transform the patient record into a standardized representation for further exploitation.

A number of challenges have been introduced for the assessment of tools and solutions that process biomedical text and normalize identi ed facts and entities. In the domain of molecular biology, the main focus resided on the correct identi cation of entities such as genes, proteins, drugs and diseases, and after that, the extraction of molecular interactions, gene regulatory events and gene-disease associations. The sequel of BioCreAtIve challenges engaged the research community into several challenges with di erent characteristics. As an alternative, the BioNLP Shared Task tackled similar problems, and from early on supported the exploitation of ontological resources for the challenges. All the challenges provide manually annotated corpora as a gold standard corpus to measure performances of the annotation solutions. As an alternative to this general paradigm, the CALBC challenge made use of a corpus at a very large scale that has been generated with automatic means from existing annotation solutions, and it thus represented a \silver-standard" corpus.

The \Conference and Labs of the Evaluation Forum" (CLEF) has been formed to tackle challenges and their evaluation as a joint e ort. It is organized on a yearly basis and has over and over again suggested new challenges for the research community, including the analysis of patents for improved information retrieval (CLEF-IP), the analysis of medical records (CLEFeHealth) and even the analysis of multilingual and multimodal data.

The CLEF-ER challenge tackles a combination of di erent tasks: (1) entity mention annotation, (2) entity normalisation, and (3) also multilingual analysis in the sense that participants could use a resource in English to annotate the non-English parallel document. Furthermore, the challenge is not only tuned to a single corpus, but includes patent texts and scienti c medical texts alike, provides a reference terminological resource, and demands to process a largescale corpus, larger than typically available in challenges that make use of a gold standard corpus.

Material and Method

The CLEF-ER challenge is focused to the languages English (en), Spanish (es), French (fr), German (de), and Dutch (nl). This selection has been motivated by the availability of resources, i.e. terminologies and documents alike, in the di erent languages. It was an important requirement that documents in a nonEnglish language must have a parallel document in English, and { at the same time { it was not relevant that a pair of documents in English and a non-English language should be accompanied by yet another document in a third language.

Terminologies

A number of resources have been prepared for the CLEF-ER challenge: terminological resources and documents. The terminological resource is based on the UMLS and makes use of the contained terms, the standardized le formats and the licensing server of the National Center of Bioinformatics (NCBI) for access to the terminological resource.

In principle, the full set of UMLS terms could be relevant for the annotation of the multilingual documents, but a number of constraints have to be considered. First, not all terms are relevant for the documents that have been prepared for the CLEF-ER challenge. Second, the overhead for the processing of the full set of terms may be high and may distract from the challenge tasks; in other terms, it is advantageous for challenge participants to handle only a reduced set of the terms relevant for the challenge. Third, the evaluation of the results can be improved by reducing the term set, since the excluded terms and categories will reduce ambiguities, i.e. less semantic categories (called \semantic types" and \semantic groups") could be distinguished if a term is polysemous with regards to the semantic categories.

The TR is required for the term normalization: the contained English and non-English concepts are provided with their CUIs. The CUIs form the key result as part of the annotation task in the CLEF-ER challenge, since the challenge participants have to annotate the text with the CUIs, and the challenge organizers evaluate the anntotations against a silver standard corpus (SSC). This subselection of the UMLS terminological resource is called the MANTRA terminological resource (MTR). The terminology is delivered in the OBO le format, which has been proposed by the Open Biomedical Ontology (OBO) Foundry and is maintained by the National Center of Biomedical Ontologies. Its aim is to create a common format for controlled vocabularies.

The UMLS licence agreement requires that users validate their licenses when accessing the TR. This task is performed by querying the license server from the NLM with the right credentials, for example using restful services through a server4 validating the username and the password of the licensee. The Terminological Resource is accessible through the download site at the Erasmus University Medical Center Rotterdam.

Selection of parallel corpora

The corpora have been selected from di erent resources and for the purpose to serve two objectives: (1) enabling extraction of novel terms for the multilingual TR as part of the MANTRA project work and the outcomes from the CLEF-ER challenge, and (2) o ering parallel corpora to the CLEF-ER challenge participants as an input to solve term recognition tasks with and without machine-translation solutions.

A number of requirements have to be met to ful ll the given objectives. The corpora have to cover the domain knowledge that is under investigation. This is the case for the scienti c literature, but also for documents that deal with drug labels. In addition, the MANTRA project partners selected patent documents from the European Patent O ce, since this distribution of documents covers a signi cant amount of documents, deal with biomedical domain knowledge, but also di er in their language from the above mentioned types of documents.

The diversity between the document repositories is high with regards to the amount of available content, the type of the documents and the languages that are supported in the repository. From the patent corpus, mainly the claim section is available as parallel document, and from Medline only the titles of the scienti c articles can be aligned across languages. The EMEA repository allows to identify full documents in parallel. For the patent claims we could produce parallel documents for English, German and French, whereas EMEA and Medline covers all selected languages. The EMEA corpus has already been exploited to train and test statistical machine translation solutions.

Note that Medline abstracts have always been translated from the nonEnglish language into English and as a consequence the parallel units, i.e. the titles, are restricted to language pairs between the English and the non-English language. More in detail, all non-English units, i.e. patent claim sections, Medline titles, or EMEA document, have a parallel unit in English. Also, every English 4 https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser?licenseCode=... unit has a parallel non-English unit, but only a smaller portion of English units has a parallel non-English unit in two, three or four languages.

Optimizing the TR for the challenge

The MANTRA project partners have annotated the documents with their inhouse annotation solutions, e.g. OntoGene, Peregrine, UIMA-based annotation solutions from the JulieLab and from Averbis. All annotation solutions apply the provided UMLS terminological resource. After this phase, the English annotated corpora have been harmonized and evaluated. This included the assessment on the number of annotations that resulted from the di erent semantic categories in the UMLS. From the distribution of the annotations, the MANTRA project partners took the decision to reduce the number of categories in the UMLS terminological resource and to extract those categories from UMLS that have higher relevance to the multilingual terminological resource and show - in addition - good coverage in the annotated corpora. The nal solution is called the MTR and covers the semantic groups: anatomy, chemicals and drugs, devices, disorders, geographic areas, living beings, objects, phenomena and physiology (ANAT, CHEM, DEVI, DISO, GEOG, LIVB, OBJC, PHEN, PHYS). All terms that are categorized with other semantic groups have then been removed to produce the nal terminological resources for the CLEF-ER challenge.

Preparation of the silver standard corpus

The annotated corpora in English have been processed to produce the silver standard corpus as describe in.

Resource and evaluation

The following resources have been made available to the challenge participants: terminologies and corpora.

Units EMEA Medline Patent Words EMEA Medline Patent en 364,005 1,593,546 120,638 en 5,120,067 15,775,814 6,034,104 de 364,005 719,232 120,637 de 4,571,203 5,996,504 5,194,032 fr 373,152 572,176 120,636 fr 5,515,157 6,023,945 6,689,812 es 366,769 247,655 es 5,897,467 2,573,056 nl 360,418 54,483 nl 5,130,890 435,390

Statistics across the corpora

The corpora di er in their quality and also in the type of units that have been identi ed from the documents, i.e. Medline titles in contrast to paragraphs in EMEA documents and claim sections in the patents. On the other side, all corpora are su ciently large to support the needs of the CLEF-ER: the corpora pose a challenge to the participants since the annotation of the corpora requires automatic means, and also a challenge to the MANTRA project partners, since the harmonization of the annotations should deliver a valuable resource to the public for future exploitation (as in the CALBC challenge).

The corpora also di er in their sizes (see tbl. 2). The Medline titles form the biggest corpus regarding the number of units and the number of words (as well as the number of characters). The EMEA corpus appears to be larger than the patent corpus when considering the number of units, but is evenly large when measuring the size based on words and characters. Since the EMEA corpus is available in all languages, it is well suited for the CLEF-ER challenge.

The units from the patent corpus are available in English, French and German as stated before, whereas the other two corpora are provided in all languages. In addition, the annotation of the patent corpus has shown that the language in the corpus has a high diversity and the terms from the MTR are often used in a non-speci c way with regards to the biomedical purpose of the MTR, which makes the patent corpus less suitable for the entity recognition task.

By contrast, the pairwise units from the Medline titles show high heterogeneity and at the same time, the use of the terminology in Medline is most speci c with regards to the purpose of the MeSH, MedDRA, and SNOMED-CT terms in the MTR.

Evaluation

The partners have contributed sets of annotated documents where each corpus is either in English or a non-English language (de, fr, es, nl). All identi ed mentions of an entity have been annotated with a CUI. The following analyses have been performed to determine which corpora will be included into the challenges, and what languages should be covered in the challenges.

One assumption is that those corpora are most suitable for the challenges that comply best with the terminological resources and { after all { with the domain knowledge over all. The following parameters can be used to measure \compliance" between the terminological resources and/or the domain knowledge and the corpora. 1. The number of annotations that can be identi ed using the prepared terminological resource in the English corpus (L1 for language 1, see g. 1). 2. The number of annotations that can be identi ed using the prepared terminological resource in the non-English language in the parallel corpus (L2 for language two, i.e. the non-English language, see g. 1). 3. The previous parameter, but now only all those English annotations are counted where non-English translation is available for the same CUI in the non-English language (L2, see g. 2).

Counting all English annotations as reference (see tbl. 1, parameter 2) gives an analysis that is less generous to the annotation solutions (\pessimistic" or \real world" evaluation) than counting only those English annotations that comply with the third parameter (\optimistic" or \idealistic" evaluation, parameter 3). The latter evaluate the performances under the condition that for each English term there exists a non-English transcript in the TR, i.e. it ignores a number of English annotations where no translation of the term can be found in L2 anyway.

A number of open questions have been resolved through the analysis of the results that were given by the annotation of the corpora in English and the nonEnglish languages from the project partners. The open questions were concerned with the selection of the languages for the challenge, the selection of corpora and of the semantic groups in the TR.

First, the languages have been limited to German, Spanish, French and Dutch apart from English. Second, the corpora have been limited to: Medline, patents, and EMEA. Actually, the sizes of these three corpora is quite similar and the diversity between the corpora should contribute to the diversity in the challenge.

Finally, the terminological resource is limited to the semantic groups: ANAT, CHEM, DEVI, DISO, GEOG, LIVB, OBJC, PHEN, PHYS. The other groups have been excluded, since they did not show enough coverage (GENE, ORGA, OCCU), or it was too unspeci c, i.e. not speci c enough for the biomedical domain (CONC) resulting to very heterogeneous annotation results across the di erent corpora and languages. It is obvious that the TR has been reduced to the sets of terms that are required for the challenge.

Conclusions

The CLEF-ER challenge organized from the MANTRA project partners is the rst of its kind tackling a number of challenges in a large-scale annotation task allowing multilingual approaches. The main objective is the identi cation of biomedical entities (and concepts) in multilingual documents, where the documents are available as part of parallel corpora involving the English language and at least one non-English language.

The corpora stem from the scienti c literature (Medline abstracts), drug labels (EMEA documents) and from the patent text (European patent o ce). A reference terminology has been provided from UMLS and has been optimized for the challenge participants. The English corpora have been annotated with the terms from the MTR using the annotation solutions of the project partners leading to a silver standard corpus, which has been distributed to the challenge participants. These annotated corpora give the challenge participants di erent opportunities to contribute to the challenge.

Altogether, the CLEF-ER challenge will work towards solutions that identify biomedical terms in multilingual documents of di erent kinds, where the proposed solutions have to cope with large amounts of terms and large data resources. The overall outcome will help to establish the semantic web in healthcare and to allow interoperability of IT solutions across country borders and languages.

Acknowledgement

This work was funded by the European Commission STREP grant number 296410 ("Mantra", FP7-ICT-2011-4.1).