<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic tagging and normalization of French medical entities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viviana Cotik</string-name>
          <email>vcotik@dc.uba.ar</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horacio Rodr guez</string-name>
          <email>horacio@lsi.upc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorge Vivaldi</string-name>
          <email>jorge.vivaldi@upf.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad de Buenos Aires</institution>
          ,
          <addr-line>Buenos Aires</addr-line>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politecnica de Catalunya</institution>
          ,
          <addr-line>UPC, Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universitat Pompeu Fabra</institution>
          ,
          <addr-line>UPF, Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present two tools for facing task 2 in CLEF eHealth 2016. The rst one is a semantic tagger aiming to detect relevant entities in French medical documents, tagging them with their appropriate semantic class and normalizing them with the Semantic Groups codes de ned in the UMLS. It is based on a distant learning approach that uses several SVM classi ers that are combined to give a single result. The second tool is based on a symbolic procedure to obtain the English translation of each medical term and looks for normalization information in public accessible resources.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine Learning</kwd>
        <kwd>SNOMED-CT</kwd>
        <kwd>UMLS</kwd>
        <kwd>DBPEDIA</kwd>
        <kwd>BioPortal</kwd>
        <kwd>Wikipedia</kwd>
        <kwd>semantic tagger</kwd>
        <kwd>binary classi ers</kwd>
        <kwd>distant learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        We developed a semantic tagger for the medical domain [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] performing on English
Wikipedia pages4 (WP ) previously selected as belonging to the domain using a
distant learning approach. Our aim here is exploring whether the approach can
be applied to other language (French), other genre (scienti c documents) and
other tagset, and to normalize the semantic tags to the Uni ed Medical
Language System (UMLS ). We performed these experiments within the framework
of CLEF2016 eHealth contest5 (see details in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). More speci cally in Task 2,
Multilingual Information Extraction as described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]6.
      </p>
      <p>Semantic Tagging is the task of assigning to some linguistic units of a text
a unique tag from a semantic tagset. It can be divided in two subtasks:
detection and tagging. The rst one is similar to term detection and Named Entity
Recognition, while the latter is closely related to Named Entity Classi cation.</p>
      <p>The key elements of Semantic Tagging task are: (i) the document, or
document genre, to be processed, (ii) the linguistic units to be tagged and (iii) the
4 http://en.wikipedia.org
5 https://sites.google.com/site/clefehealth2016/
6 https://sites.google.com/site/clefehealth2016/task-2
tagset. All these elements play a crucial role for the success of the task. In this
concrete task our constraints are the following: (i) documents of medical domain,
mainly scienti c articles indexed in MEDLINE and some drug monographs
published by the European Medicines Agency (EMEA), (ii) the linguistic units to
be tagged are the terminological strings found in the source documents, (iii) the
tagset will be a subset of the top UMLS categories. Such resources will be used
also for the normalization on the medical entities as de ned in the phase II.</p>
      <p>Our approach consists of learning a binary classi er for each of the
categories, whose results are combined using a simple voting schema. The cases to
be classi ed are the mentions in the document corresponding to TCs, to refer to
any of the concepts in the tagset. No co-reference resolution is attempted and,
so, co-referring mentions may be tagged di erently. For the normalization of the
entities found we used the resources available through BioPortal7.</p>
      <p>After this introduction, the organization of the article is as follows: In
section 2 we sketch the state of the art of Semantic Tagging approaches. Section 3
presents the methodology followed in the current task. The experimental
framework is described in section 4. Results are shown and discussed in section 5.
Finally section 6 presents our conclusions and further work proposals.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>English is, by far, the most supported language for biomedical resources and
tools. The National Library of Medicine8 (NLM R ) maintains the Uni ed
Medical Language System9 (UMLS R ) that groups an important set of resources to
facilitate the development of computer systems to \understand" the meaning
of the language of biomedicine and health. It is worth noting that only a small
fraction of such resources exist for other languages.</p>
      <p>A relevant aspect of information extraction is the recognition and identi
cation of biomedical entities (like disease, genes, proteins . . . ). Several Named
Entity Recognition techniques have been proposed to recognize such entities
based on their morphology and context. NER can be used to recognize
previously known names and also new names, but cannot be directly used to relate
these names to speci c biomedical entities found in external databases. For this
identi cation task, a dictionary approach is necessary. A problem is that existing
dictionaries are often incomplete and di erent variations may be found in the
literature; therefore it is necessary to minimize this issue as much as possible.</p>
      <p>
        2015 edition of CLEF eHealth contest contained two tasks focusing on
information extraction and information retrieval. The topic of one of them was
Clinical Named Entity Recognition in medical texts written in French (Task 1b)
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Seven teams participated in this task. Two types of biomedical documents
were used: scienti c articles indexed in the MEDLINE database, and full text
drug monographs published by the European Medicines Agency (EMEA). The
7 SPARQL Endpoint available at http://bioportal.bioontology.org/
8 http://www.nlm.nih.gov/
9 http://www.nlm.nih.gov/research/umls/
best system obtained F-measure of 0.756 for plain entity recognition, 0.711 for
normalized entity recognition, and 0.872 for entity normalization [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
For participating in CLEF e-Health 2016 Task 2 we have submitted two runs
to the Plain Entity Recognition Task and one run to the Normalization task. A
description of our approaches followed in these three runs are presented below.
In this run we follow basically the approach of our previous system, presented
to CLEF e-Health 2015 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in turn based on a semantic tagging system aiming
to detect and classify medical entities in English WP pages [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A multilingual
extension (to Arabic, French, and Spanish) of this latter system, can be found in
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We sketch next the approach we follow, highlighting the di erences between
the current and the previous system. Details of the latter can be found in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Figure 1 presents a overall vision of the system.
      </p>
      <p>The core idea of the system is that for a term t known to belong to the
semantic category c (one of the 10 UMLS categories we deal with) not only the
occurrences of t in the training material can be considered positive examples
for learning but also the occurrences of t in its associated WP page if existing.
This hypothesis is important because for some semantic categories the training
material contains not enough terms for accurate learning.</p>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we generate training instances by automatically labelling each
instance of a seed term with its designated semantic class. When we create
feature vectors for the classi er, the seeds themselves are hidden and only
contextual features are used to represent each training instance. Proceeding in this
way the classi er is forced to generalize with limited over tting.
      </p>
      <p>
        We created a suite of binary contextual classi ers, one for each semantic
class. The classi ers are learned using, as in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Support Vector Machine models
utilizing Weka toolkit [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Each classi er makes a weighted decision as to whether
a term belongs or not to its semantic class.
      </p>
      <p>For every le of the training corpus, each tagged term is considered as a
positive example for the tagged class and negative example for the rest of the
classes. Features are the words occurring in the local context of mentions. The
context size and POS of the context words are parametrizable.</p>
      <p>Examples for learning correspond to the mentions of the seed terms in the
corresponding WP pages. Let t1; t2; : : : ; tn the seed terms for the semantic class
c, i.e. ti 2 ST c. For each ti we obtain its WP page and we extract all the
mentions of seed terms occurring in the page. Positive examples correspond to
mentions of seed terms corresponding to semantic class c while negative examples
correspond to seed terms from other semantic classes. Frequently, a positive
example occurs within the text of the page but often many other positive and
negative examples occur as well. Features are simply words occurring in the local
context of mentions.</p>
      <p>
        For French we have used for processing documents, in learning and test
phases, the Freeling toolbox10 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Term candidates, TC, are selected according to morpho-syntactic criteria.
We have used for ltering the following regular expression: NA*(PNA*)+.
Additionally, in order to take into account the peculiarities of the term selection of
CLEF organizers we also decompose each complex term in its components (see
section 4.1 for more details and examples).</p>
      <p>The learning process has been performed using for each semantic category
the most likely relevant documents including EMEA and MEDLINE training
documents and WP pages obtained as described above. From the WP pages,
besides those with purity less than 1, short pages and pages consisting mainly
of itemized material or non-textual fragments were removed too.</p>
      <p>For each example, the feature vector captures a context window of n words
to its left and right11 without surpassing sentence limits.
3.2</p>
      <sec id="sec-2-1">
        <title>Run2: Knowledge-based approach</title>
        <p>A careful analysis of our results on CLEF e-Health 2015 participation revealed
that some apparently easy to detect terms were not detected or were classi ed
incorrectly. For instance French terms occurring in French DBPedia12 or
translated English terms occurring in English (Princeton) WN or in English DBPedia
were not detected. We decided, thus, combining, in our run2, the results of run1
10 http://nlp.lsi.upc.edu/freeling/
11 In the experiments reported here n was set to 3.
12 http://wiki.dbpedia.org/
with two other systems, one based on the performance of a state-of-the-art term
extractor, YATE, tuned to work in the medical domain, and the other based
on an external knowledge source, the DBPedia. Although domain independent,
DBPedia has a nice coverage of medical classi ed terminology and o ers good
interlingual capabilities.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Extracting Term Candidates using YATE and wikiYATE YATE [11]</title>
        <p>basically performs using the taxonomic structure of the nominal part of WN.
Given a domain d, the medical domain here, YATE obtains the called Domain
Borders, synsets that are likely to belong, both them and their descendants to
d. These Domain Borders are used later for extracting from a document the set
of mentions corresponding to terms belonging to d, i.e. those TC whose synsets
are placed below a Domain Border. Right part of Figure 2 shows this process.</p>
        <p>YATE uses English WN, therefore for applying this tool a translation of
French TC s is needed. As the terms we are interested on are those represented
in the French WP, we used the interwiki links between French and English WP
(using DBPedia for getting these links). This process results on the extraction
(tagging) of the medical terms occurring in the test documents. YATE is a term
extractor, not a semantic tagger, so it is not able to classify the extracted terms,
but, indirectly, these terms can be used as seed terms for learning the classi ers.</p>
        <p>
          wikiYATE [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is a similar term extractor but it is multilingual and uses WP
as knowledge source for detecting terms. Both term extractors have been used
in this task.
        </p>
        <p>DBPedia-based approach Some useful information in WP is represented
in infoboxes and, thus, has been automatically mapped into the corresponding
DBPedia rdf triples. We take pro t of several interesting properties of DBPedia:
{ There exist DBPedia datasets for English (http://dbpedia.org/sparql) and</p>
        <p>French (http://fr.dbpedia.org/sparql).
{ Entities (resources) in the two datasets are frequently linked through sameAs
properties.
{ Entities in the datasets are frequently mapped to one or more linguistic
referents, words and phrases, through label properties, sometimes in several
languages. So an entity in the English DBPedia can be labelled with French
words or phrases.</p>
        <p>As shown in Figure 3, iterating over French and English terms and resources
in French and English DBPedia datasets through the label and sameAs
properties, we are able to collect from an initial French TC, t, the set of English
resources likely corresponding to translations of t.</p>
        <p>In order to be able to classify t into one of the 10 semantic categories we
proceed on the following way:</p>
        <p>During training, for each semantic category c we collect all the French terms
t occurring in the training dataset and tagged with c. We lter out from these
sets the terms not occurring in French WP. We then collect, as described above,
the set of English DBPedia resources associated to them. For each of such
resources r we obtain the set of classes to which r belongs (we reduce our search
to DBPedia and YAGO classes) using the type property. From each class we
recursively collect the set of super-classes using the subClassOf property. Some of
the super-classes belong unambiguously to one semantic category. For instance,
http://dbpedia.org/class/yago/AliphaticCompound114601294 occurs as super-class
of 6 terms all classi ed as CHEM. Others, as,
http://dbpedia.org/ontology/Eukaryote, are ambiguous. This super-class occurs
4 times as PHYS, 8 times as LIVB, and 2 times as DISO.</p>
        <p>For each semantic category we collect the set of unambiguous super-classes.
For ambiguous cases we proceed as follows: If the total number of terms covered
by the super-class is higher than a threshold THR1 and the ratio between the
higher option and the second is higher than a threshold THR2 we assign the
super-class to the set corresponding to the rst option. THR1 and THR2 have
been manually set to 30 and 4. For instance in
http://dbpedia.org/class/yago/Location100027167, occurring 10 times as ANAT,
13 times as CHEM, and 76 times as GEOG, the two conditions hold (76+13+10
&gt;30 and 76/13 &gt;4, so the super-class is included into the set of super-classes
corresponding to the semantic category GEOG.</p>
        <p>In this way a number of super-classes have been associated to each semantic
category. See in Table 1 the size of each set and an example of their members.</p>
        <p>Once collected these sets (during the training phase) the process at test
phase is quite straightforward. See left part of Figure 2. For each TC in the test
dataset, following the approach described above, we obtain the set of English
DBPedia resources. The DBPedia and YAGO classes of these resources are then
obtained and from them the set of super-classes. This set is intersected with
the sets of super-classes associated to each semantic category. The sizes of all
the intersections are computed and the category associated to the higher size is
returned as category of the term (ties are solved according to the most frequent
semantic category in the training dataset). Finally the results of run1 (a semantic
category), DBP-based (a semantic category), and YATE-based (a Boolean) are
combined for getting the nal result.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Normalization</title>
        <p>
          For normalization we have used the BioPortal SPARQL endpoint. The a priori
obvious way of accessing UMLS and obtaining the CUI of a term t consisted on
getting a UMLS entity labelled with t and then getting the CUI of this entity.
This simple procedure does not work because UMLS is not directly labelled
with terms. We have used instead an indirect access to UMLS through other
ontologies. We have employed for such process Snomed-CT, Mesh, and RCD.
The process consists on translating French into English, using the approach
described above and then accessing to any of the intermediate ontologies and
from them to UMLS. In Figure 4 one of the SPARQL templates used is presented.
This template is instantiated into a real query just by replacing the placeholders
'**ont**' by the name of one of our three ontologies and '**term**' by the
name of the English TC. As can be seen in the Figure, this template uses the
prefLabel (preferred label) link. We have also built templates using the altLabel
(alternate label) link, and others using approximate string matching, for covering
decreasingly con dent matchings.
Participants of CLEF2016 are requested to perform named entity recognition and
normalization on a dataset of scienti c article titles and full-text drug inserts. For
performing such tasks we designed the working frameworks that are described
in following subsections.
Our basic working framework for entity recognition was the same than in CLEF2015.
But taking into account the results obtained in such contest (see [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]) we
perform some experimentation based on such material in order to decide to include
or not WP pages in our learning framework. We test several con guration of
features selection as well as di erent number of WP pages for using in training
stage. Such framework foresees to automatically select WP pages for each class.
We manually check a number of such pages in order to correct some issues with
automatic selection phase. The results clearly shown that there was not any
improvement in adding WP material in the training phase. Therefore we decided
to train our model using only the training material provided by the CLEF2016
organization and provide such results as run 1.
        </p>
        <p>As mentioned in section 3.1 the learning phase was made using the distant
learning paradigm. For each mention of a TC the vector of features is built
and the nine13 learned binary classi ers are applied to it. For building such
classi ers all the documents of the training corpus were linguistically processed
13 In our run 1, the process of extracting GEOG entities was performed by a Geographic</p>
        <p>
          NER, and, so only nine classi ers were learnt.
using the Freeling suite (see [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for details). The vector of features was built
using the lemmas of the context words that within a window of 3 tokens of the
TC (excluding determiners and punctuation signs). We used the lemmas to as
features but we de ned several ways to select such lemmas: (i) Mode 0: any
context word within the window, (ii) Mode 1: only nouns and adjectives and
(iii) Mode 2: only verbs, nouns and adjectives.
        </p>
        <p>The Quaero corpus takes into account nested terms as di erent terms. Given
this fact, when the TC is poly-lexical, all the possible combinations of
components are taken into account. Table 2 shows some examples.</p>
        <p>We also decided to include a second run that, starting with run 1 results,
improves them by doing some symbolic processes as shown in Figure 2 and
described in the following paragraphs.</p>
        <p>
          { Term extraction and analysis. For preparing run 2 we create a single
document that includes all documents of the test corpus. We analyse such material
with wikiYATE, a term extraction tool that uses WP for obtaining the TCs
of a given text (see description in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]). This tool ranks the TC according a
termhood value; we create a set of string composed by: (i) those TC above
a given threshold, (ii) those TC not found in WP and (iii) the list of all
the adjectives that take part of the chosen TCs. We look in the DBpedia
for the English translation of these units and if available processed them
using YATE ([
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]), a medical term extraction tool that uses the
Multilingual Central Repository14 (MCR) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for analysing the TCs. This tool in
addition to give a termhood value for each TC, provides with some basic
class information that we mapped to UMLS classes.
{ DBpedia exploration as described in section 3.2.
        </p>
        <p>Finally, the results of both analysis has been combined in a single result that
was used to improve the result run 1.
4.2</p>
      </sec>
      <sec id="sec-2-4">
        <title>Entity normalization</title>
        <p>The process of Entity normalization was performed independently from the
process of entity recognition. This is why we submitted runs for the task of plain
14 http://adimen.si.ehu.es/web/MCR/
entity recognition and not to the task of normalized entity recognition. See 3.3
for details on our approach.
5</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        Table 3 depicts the global results as reported by the organization of CLEF2016
(phase I task entity recognition). The material o cially delivered included two
runs. Unfortunately, for the run 2 we incorrectly submitted the same material as
in run 1. After detecting such issue the organisation kindly accepted to evaluate
our actual run as an uno cial result. For this reason we tagged with an \*"
run2 results showed in Table 3. Additionally we performed an after challenge
improvement of our system (noted as "/3*" in the table) introducing a simple
voting mechanism for unifying the tags corresponding to multiple mentions of
the same TC in the case enough evidence for one of the choices exists.
{ entities exact match entities inexact match
docs TP FP FN Prec. Recall F1 TP FP FN Prec. Recall F1
EMEA 517 558 558 0,4809 0,4809 0,4809 517 558 558 0,4809 0,4809 0,4809
MEDLINE 673 745 748 0,4746 0,4736 0,4741 673 745 748 0,4746 0,4736 0,4741
The results obtained for the Phase I (entity recognition) are poor (although
better than those proposed in CLEF2015, see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) and far from the results
obtained from our previous experiments on French Wikipedia pages15.
15 Using a very similar methodology to classify medical WP pages (over six classes) we
obtained accuracies of 74.76% (exact match). See [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
      <p>Indications, radiotherapie, tumeurs digestives, tumeurs, digestives</p>
      <p>
        As already shown in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the terminological density of the QUAERO corpus is
very high. As the same time, such density is obtained by a tagging methodology
that nest several terms in a single polylexical term. An example of this situation
is shown in Table 5. Undoubtedly, the tagging is correct but it is not clear that
such concrete sentence actually contains 5 terms instead of just 3 (Indications,
radiotherapie and tumeurs digestives) as most term extractors will do.
      </p>
      <p>Another minor issue is that text seems to include some kind of extra
segmentation (see for example: l' enfant or d ' activation plaquettaire induite par l '
heparine among many others). The words by themselves are not important but
such segmentation may cause errors in the POS tagging stage and this fact may
be a real problem for TC delimitation (\l" and \d" will become a noun instead
of a determiner and preposition respectively). Also, the generation of the nal
stand-o annotation becomes a bit more complicated.</p>
      <p>Table 6 shows a detailed analysis of run 1 results. From one side, there
are two classes (PHEN and GEOG) that does not produce any correct result
and another class that only detects one valid term (OBJC). In these cases, the
corresponding classi ers have a extremely low accuracy, probably due to the lack
of training examples. So, acquiring additional examples for these cases should
result on some improvement. From other side, there are some classes (DISO,
PROC and ANAT) where the number of examples is much higher and therefore
show a better result .</p>
      <p>Table 7 shows the same analysis for run 2. There is an improvement in the
performance for all the classes showing that: (i) the symbolic analysis partially
solves the inaccuracies of the machine learning system and (ii) the combination
of methods improves the global e ciency.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and further work</title>
      <p>The organizers of CLEF eHealth 2016 divided the task 2 in two phases:
entity recognition and entity normalization on French medical text ofthe Quaero
corpus. Our approach results in two di erent systems for solving each task.</p>
      <p>
        For the rst task, we have presented a system that automatically detects and
tags medical terms in medical documents using a tagset derived from UMLS
taxonomy. The results of the system for entity recognition, as discussed in previous
section are poor, far from the obtained in our previous system (performing on
medical English wp pages and con rmed for other languages, including French,
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) but much better than those obtained in our participationin CLEF2015. The
improvement was specially high in our run 2 that includes some symbolic
proRight Class proposed by the classi ers
class DISO PHEN PROC PHYS ANAT LIVB CHEM DEVI OBJC GEOG
DISO 451 26 180 56 185 92 184 19 24 16
PHEN 21 0 5 1 6 2 3 0 0 0
PROC 136 10 215 31 62 48 107 14 9 8
PHYS 13 1 17 11 11 1 18 2 0 0
ANAT 59 2 32 12 68 4 19 2 2 0
LIVB 51 3 35 16 26 131 16 2 3 3
CHEM 31 7 37 15 18 20 131 8 10 1
DEVI 15 0 7 4 4 3 1 4 0 0
OBJC 2 0 3 4 0 1 9 0 1 1
GEOG 7 0 1 1 6 4 3 1 0 0
Precision 57.38 0.00 40.41 7.28 17.62 42.81 26.68 7.69 2.04 0.00
cessing for improving the results. The working framework allowed us to
experiment with several design parameters like the number of terms used for training,
context width, features de nition, etc. Undoubtedly, this is at the base of the
improvement obtained for run 1. The use of some symbolic processing on the
results of run 1 allow us to obtain some additional improvement.
      </p>
      <p>It is interesting to observe that in all cases the improvement is higher in the
inexact match than in the exact match. This fact may reveal some issues in the
TC delimitation but also in the o set calculation. The latter issue is magni ed
by the tokenization of the training corpus that di culties the linguistic analysis
and the o set calculation.</p>
      <p>The second task was solved using a totally di erent system. It is based in
obtaining the normalization information from public resources after obtaining
the English translation of each medical term. The results were a bit below of the
other participants . Again, tokenization is an issue that a ects the performance
of the system.</p>
      <p>Several research lines will be followed in the next future:
{ The integration of both entity recognition and normalization in a single task
may bring mutual bene ts.
{ To enlarge the use of BioPortal for looking in the ontologies for the
recognition and classi cation task seems to be a promising direction.
{ A combination and/or the specialization of the resources for learning more
accurate classi ers. The application of the DBPedia based approach, to all
the semantic classes merits a deeper investigation.
{ A careful combination of learning from the training dataset and from
additional material, as WP should be experimented.
{ The features currently used for learning the classi ers are rather crude and
need some revision. We foresee to do some experimentation weighting the
features, separating the features according its position in relation to the TC
and adding new features as: start/end characters, typed features, etc.
{ Moving from semantic tagging of medical entities to semantic tagging of
relations between such entities is a highly exciting objective, in the line of
recent challenges in the medical domain (and beyond).
{ Improving the selection of medical entities by using POS pattern learning,
adapting our term extractor to the tagging policy of medical entities in
Quaero corpus and improving adaptation of Freeling to French medical texts.
7</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work was partially supported by the TUNER project (Spanish Ministerio
de Econom a y Competitividad, TIN2015-65308-C5-5-R).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Vivaldi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Rodr guez, H.:
          <article-title>Medical entities tagging using distant learning</article-title>
          .
          <source>In: CICLing</source>
          <year>2015</year>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <article-title>LNCS</article-title>
          . Volume
          <volume>9042</volume>
          . (
          <year>2015</year>
          )
          <volume>631</volume>
          {
          <fpage>642</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palotti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Overview of the CLEF eHealth evaluation lab 2016</article-title>
          . In: LNCS, Springer (
          <year>September 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grouin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavergne</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rey</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>Clinical information extraction at the CLEF eHealth evaluation lab 2016</article-title>
          .
          <article-title>In: CLEF Evaluation Labs</article-title>
          and Workshop: Online Working Notes, CEUR-WS (
          <year>September 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grouin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>CLEF eHealth evaluation lab 2015 task 1b: clinical named entity recognition</article-title>
          .
          <source>In: CLEF 2015 Online Working Notes</source>
          , CEUR-WS (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanna</surname>
            <given-names>Suominen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>L.H.</given-names>
            ,
            <surname>Neveol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Grouin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Palotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Zuccon</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          :
          <article-title>Overview of the CLEF eHealth evaluation lab 2015</article-title>
          .
          <article-title>clef 2015 - 6th conference and labs of the evaluation forum</article-title>
          .
          <source>LNCS</source>
          , Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cotik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vivaldi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Rodr guez, H.:
          <article-title>Semantic tagging of French medical entities using distant learning</article-title>
          .
          <source>In: Working Notes of CLEF</source>
          <year>2015</year>
          <article-title>- Conference and Labs of the Evaluation forum</article-title>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cotik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vivaldi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Rodr guez, H.:
          <article-title>Arabic medical entities tagging using distant learning in a multilingual framework</article-title>
          . Submitted
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Rilo ., E.:
          <article-title>Inducing domain-speci c semantic class taggers from(almost) nothing</article-title>
          . In:
          <article-title>Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</article-title>
          , Uppsala, Sweden (
          <year>2010</year>
          )
          <volume>275</volume>
          {
          <fpage>285</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>The WEKA data mining software: An update</article-title>
          .
          <source>In: SIGKDD Explorations</source>
          .
          <article-title>(</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Padro</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanilovsky</surname>
          </string-name>
          , E.:
          <article-title>Freeling 3.0: Towards wider multilinguality</article-title>
          . In Calzolari, N.,
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dogan</surname>
            ,
            <given-names>M.U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Odijk</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piperidis</surname>
          </string-name>
          , S., eds.
          <source>: Proceedings of the 8th international conference on Language Resources and Evaluation</source>
          ,
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vivaldi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Rodr guez, H.:
          <article-title>Medical term extraction using EWN ontology</article-title>
          .
          <source>In: Proceedings of Terminology and Knowledge Engineering</source>
          . (
          <year>2002</year>
          )
          <volume>137</volume>
          {
          <fpage>142</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Vivaldi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Rodr guez, H.:
          <article-title>Using Wikipedia for term extraction in the biomedical domain: rst experience</article-title>
          .
          <source>In: Procesamiento del Lenguaje Natural</source>
          . Volume
          <volume>45</volume>
          . (
          <year>2010</year>
          )
          <volume>251</volume>
          {
          <fpage>254</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laparra</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rigau</surname>
          </string-name>
          , G.:
          <article-title>Multilingual central repository version 3.0: upgrading a very large lexical knowledge base</article-title>
          .
          <source>In: Proceedings of the Sixth International Global WordNet Conference (GWC'12)</source>
          .,
          <string-name>
            <surname>Matsue</surname>
          </string-name>
          , Japan (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>