=Paper= {{Paper |id=Vol-1391/93-CR |storemode=property |title=Semantic Tagging of French Medical Entities Using Distant Learning |pdfUrl=https://ceur-ws.org/Vol-1391/93-CR.pdf |volume=Vol-1391 |dblpUrl=https://dblp.org/rec/conf/clef/CotikVR15 }} ==Semantic Tagging of French Medical Entities Using Distant Learning== https://ceur-ws.org/Vol-1391/93-CR.pdf
      Semantic tagging of French medical entities
                using distant learning

             Viviana Cotik1 , Horacio Rodrı́guez2 , and Jorge Vivaldi3
               1
                 Universidad de Buenos Aires, Buenos Aires, Argentina,
                                   vcotik@dc.uba.ar,
          2
             University of Catalonia, Barcelona, Spain, horacio@lsi.upc.edu
           3
             Universitat Pompeu Fabra, Roc Boronat 132, Barcelona, Spain,
                                jorge.vivaldi@upf.edu



        Abstract. In this paper we present a semantic tagger aiming to detect
        relevant entities in French medical documents and tagging them with
        their appropriate semantic class. These experiments has been carried out
        in the framework of CLEF2015 eHealth contest that proposes a tagset
        of ten classes from UMLS taxonomy. The system presented uses a set
        of binary classifiers, and a combination mechanisms for combining the
        results of the classifiers. Learning the classifiers is performed using two
        widely used knowledge source, one domain restricted and the other is a
        domain independent resource.

        Keywords: Machine Learning, SNOMED CT, UMLS, Wikipedia, se-
        mantic tagger, binary classifiers, distant learning


1     Introduction
Recently, we [1] developed a semantic tagger for the medical domain performing
on pages of the English Wikipedia4 (WP ) previously selected as belonging to
the medical domain, using a distant learning approach. Our aim in this paper
is exploring whether the approach can be applied to other language (French),
other genre (EMEA and Medline documents) and other tagset. We performed
these experiments within the framework of CLEF2015 eHealth contest [2], more
specifically in Task 1b, Clinical Named ENtity Recognition [3].
    Semantic Tagging (ST ) can be defined as the task of assigning to some lin-
guistic units of a text a unique tag from a semantic tagset. It can be divided
in two subtasks: detection and tagging. The first one is similar to term detec-
tion and Named Entity Recognition (NER), while the latter is closely related to
Named Entity Classification (NEC ).
    Other Natural Language Processing (NLP ) tasks related to Semantic Tagging
are Word Sense Disambiguation (WSD), aiming to tag each word in a document
with its correct sense from a senses repository, and Entity Linking (EL), aiming
to map mentions in a document to entries in a Knowledge Base.
    The key elements of Semantic Tagging task are:
4
    http://en.wikipedia.org
  i) the document, or document genre, to be processed. In this paper we focus
     on the medical domain and the genre of documents are those included in
     CLEF2015 contest, namely EMEA and Medline documents in French (see
     [4] for a description of the corpus).
 ii) the linguistic units to be tagged. There are two commonly followed ap-
     proaches, those that tag the entities occurring in the text, i.e. Entity Link-
     ing, as [5], and those that tag mentions of these entities, as [6]. Frequently,
     entities are represented by co-reference chains of mentions. Consider the
     following example (from the article ”Asthma” in Wikipedia). “Asthma is
     thought to be caused by . . . Its diagnosis is usually based on . . . The dis-
     ease is clinically classified . . . ”. In these sentences there is an entity (asthma)
     referred three times, and, thus, forms a co-reference chain of three mentions.
     In the first approach, the entity (the whole set of three mentions) will be
     tagged as a disease, in the second one, which we follow in this work, each
     mention is detected and tagged independently, so only the first and last
     mentions are tagged as diseases. In this work, units to be tagged are termi-
     nological strings found in the source documents.
iii) the tagset. A crucial point is its granularity (or size). The spectrum of
     tagset sizes is immense. In one extreme of the spectrum, fine-grained tagsets
     can consist of thousands (as is the case of WSD systems that use Word-
     Net5 synsets as tags), or even millions (as is the case of wikifiers that use
     Wikipedia titles as tags). In the other extreme we can found coarse-grained
     tagsets. In the medical domain, for instance, in the i2b2/VA challenge [7]
     the tagset consisted on three tags: Medical Problem, Treatment, and Medi-
     cal Test. In the Semeval-2013 task 9 [8] focusing on drug-drug interaction
     (DDI ), the tagset included drug, brand, group (group of drug names), and
     drug-n (active substance not approved for human use). Besides these task
     specific tagsets, subsets of Category sets in the most widely used medical re-
     sources (MeSH R , SNOMED-CT6 , UMLS R ) are frequently used as tagsets.
     In this research we used a subset of the top UMLS categories, namely,
     Anatomy, Chemical and Drugs, Devices, Disorders, Geographic Areas, Liv-
     ing Beings, Objects, Phenomena, Physiology, and Procedures.
   Our approach consists of learning a binary classifier for each of the cate-
gories7 , whose results are combined using a simple voting schema. The cases to
be classified are, according the contest instructions, the mentions in the docu-
ment corresponding to term candidates, to refer to any of the concepts in the
tagset. No co-reference resolution is attempted and, so, co-referring mentions
could be tagged differently.
   Most of the approaches to Semantic Tagging for small-sized tagsets, as our,
use supervised Machine Learning (ML) techniques. The main problem found
when applying these techniques is the lack of enough annotated corpora for
5
  http://wordnet.princeton.edu/
6
  http://ihtsdo.org/snomed-ct/
7
  In fact only 9 classifiers are learned, for the Geographic Areas category a conventional
  NERC is used.
learning. In our system we overcome this problem following a distant learning ap-
proach. Distant learning is a paradigm for relation extraction, initially proposed
by [9], which uses supervised learning but with supervision not provided by man-
ual annotation but obtained from the occurrence of positive training instances
in a knowledge source or reference corpus. In [1] SNOMED CT, Wikipedia, and
DBPEDIA8 have been used as knowledge sources while in the research reported
here only the first one has been used.
    After this introduction, the organization of the article is as follows: In section
2 we sketch the state of the art of Semantic Tagging approaches. Section 3
presents the methodology followed in our previous work while Section 4 discusses
the modifications performed for dealing with the current task. The experimental
framework is described in section 5. Results are shown and discussed in section
6. Finally section 7 presents our conclusions and further work proposals.

2    Related Work
English is, by far, the most supported language for biomedical resources and
tools. The National Library of Medicine9 (NLM R ) maintains the Unified Med-
ical Language System10 (UMLS R ) that groups an important set of resources
to facilitate the development of computer systems to “understand” the meaning
of the language of biomedicine and health. It is worthnoting that only a small
fraction of such resources exist for other languages.
    A relevant aspect of information extraction is the recognition and identifi-
cation of biomedical entities (like disease, genes, proteins . . . ). Several Named
Entity Recognition techniques have been proposed to recognize such entities
based on their morphology and context. NER can be used to recognize previ-
ously known names and also new names, but cannot be directly used to relate
these names to specific biomedical entities found in external databases. For this
identification task, a dictionary approach is necessary. A problem is that existing
dictionaries often are incomplete and different variations may be found in the
literature; therefore it is necessary to minimize this issue as much as possible.
    There is a number of tools that take profit of the UMLS resources. Some the
more relevant are:
 – Metamap [10] is a pipeline that provides a mapping among concepts found
   in biomedical research English texts and those found in the UMLS Metathe-
   saurus R . For obtaining such link the input text undergoes a lexical/syntactic
   analysis and a number of mapping strategies. Metamap is highly configurable
   (it has data, output and processing options) and is being widely used since
   1994 by many researchers for indexing biomedical literature.
 – Whatizit [11] is also a pipeline for identifying biomedical entities. It includes
   a number of processes where each one is specialized in a type of task (chem-
   ical entities, diseases, drugs. . . ). Each module processes and annotates text
8
   http://wiki.dbpedia.org/
9
   http://www.nlm.nih.gov/
10
   http://www.nlm.nih.gov/research/umls/
   connecting to a publicly available specific databases (e.g. UniProtKb/Swiss-
   Prot, gene ontology, DrugBank. . . .).
 – Semantrix 11 is a private company that has developed the Ontotext Seman-
   tic Biomedical Tagger. It is an information extraction system designed to
   process biomedical texts using a number of biomedical databases.
    Keeping on the medical domain, an important source of information are the
proceedings of the 2010 i2b2/VA challenge on concepts, assertions, and rela-
tions in clinical text 12 [7]. The challenge included three sub-tasks, the first one,
Concept Extraction, namely patient medical problems, treatments, and medical
tests, corresponding to Semantic Tagging13 . Almost all the participants followed
a ML supervised approach. Regarding the first task, the one related to our sys-
tem, final results (evaluated using F1 metric) range from 0.788 to 0.852 for exact
matching and from 0.884 to 0.924 for the lenient inexact matching.
    A more recent and also interesting source of information is the DDI Ex-
traction 2013 (task 9 of Semeval-2013) [8]. Focusing on a narrower domain,
Drug-Drug interaction, the shared task included two challenges: i) Recognition
and Classification of Pharmacological substances, and ii) Extraction of Drug-
Drug interactions. The former is clearly a case of Semantic Tagging, in this case
reduced to looking for mentions of drugs within biomedical texts, but with a
finer granularity of the tagset. Regarding the first task, the overall results (us-
ing F1) range from 0.492 to 0.8. As DDI corpus was compiled from two very
different sources, DrugBank definitions and Medline abstracts, the results are
quite different depending on the source of the documents, for DrugBank, the
results range from 0.508 to 0.827, while for Medline, clearly more challenging,
the results range from 0.37 to 0.53.

3     Methodology followed in our previous work
3.1   Outline
As we mentioned above, the system presented here is heavily based on [1]. In this
Section we sketch the previous system (see details in the reference). The system
proposes a machine learning solution to a tagging task. Therefore, it requires two
main steps: training and annotation (see Figure 1). The main drawback of this
type of solutions is the dependency on annotated documents, which usually are
hard to obtain. Our main target was to train classifiers minimizing the impact
of this issue and keeping good results. For such a purpose we use, within the
distant learning paradigm, as learning examples, a set of seed words obtained
with a minimal human supervision. We used as semantic classes the top level
categories of the SNOMED CT hierarchy. More specifically its six more frequent
classes.
11
   http://semantrix.com.au
12
   Other i2b2/VA contests deal with other relevant medical text processing problems
   as co-reference detection or identification of medications, doses, forms of adminis-
   tration, etc.
13
   The other two tasks were Assertion classification and Relation classification.
                           Fig. 1. Train and testing pipelines


   We obtain an instance-based classifier (upper section in Figure 1) for each
semantic class using seed words extracted from three widely used knowledge
sources (section 3.2). The only form of human supervision is, as described below,
the assignment of about two hundred Wikipedia categories to their appropriate
SNOMED CT semantic class. Later (lower section in Figure 1) such models are
used to classify new instances.


3.2     Features extraction

To obtain the seed terms needed for learning the classifiers, we proceed in three
ways, using two different general purpose knowledge sources, Wikipedia and
DBPEDIA, and one, SNOMED CT, specific for the medical domain (see [12]
and [13] for analysis of these and other resources). From these sources, only
Wikipedia has been used in the work presented here.
   Wikipedia, although being a general purpose resource, densely covers the
medical domain; it contains terminological units from multiple medical thesauri
and ontologies, such as Classification of Diseases and Related Health Problems
(ICD-9, ICD-10), Medical Subject Headings (MeSH), and Gray’s Anatomy, etc.
We describe here the main characteristics of the method followed to obtain the
seed terms from Wikipedia, for the other sources [1] should be consulted.
   First we got the set of the most reliable Wikipedia categories14 . This resulted
on a set of 237 Wikipedia categories. We manually assigned to such categories
a unique SNOMED CT class from the set of 6 most frequent ones. For each of
these categories we obtained the full set of associated pages. For each page, we
14
     See [14] for details about the way of obtaining such categories from Wikipedia re-
     sources
calculate a purity factor, i.e. a score (ranging in [0,1]), of the appropriateness
of such page to a given SNOMED CT class15 . For such classes only the pages
having a purity of 1 are kept.
    The seed terms have been obtained with low human supervision. As can be
noticed by the way of collecting the seed terms, above, terms have associated
Wikipedia pages. The results, so, are sets of Wikipedia pages to be used for
learning the classifiers.
    Following [15], we generate training instances by automatically labelling each
instance of a seed term with its designated semantic class. When we create fea-
ture vectors for the classifier, the seeds themselves are hidden and only contextual
features are used to represent each training instance. Proceeding in this way the
classifier is forced to generalize with limited overfitting.


3.3   ML machinery

We created a suite of binary contextual classifiers, one for each semantic class.
The classifiers are learned using, as in [15], Support Vector Machine (SVM)
models using Weka toolkit [16]. Each classifier makes a weighted decision as to
whether a term belongs or not to its semantic class.
    Examples for learning correspond to the mentions of the seed terms in the
corresponding Wikipedia pages. Let x1 , x2 , . . . , xn the seed terms for the seman-
tic class t and knowledge source k, i.e. xi ∈ Rtk . Note that in this work only
the source k = wp is used. For each xi we obtain its Wikipedia page and we
extract all the mentions of seed terms occurring in the page. Positive examples
correspond to mentions of seed terms corresponding to semantic class t while
negative examples correspond to seed terms from other semantic classes. Fre-
quently, a positive example occurs within the text of the page but often many
other positive and negative examples occur as well. Features are simply words
occurring in the local context of mentions.
    The above mentioned procedure applies for regular Wikipedia pages but our
mechanism foresee also to use training corpus provided by the organizers. In this
case the occurrence of a given tagged term is a positive example for the class
that has been tagged but negative for the remaining classes.
    For processing the full corpus we use an in-house general purpose sentence
segmenter and POS tagger to identify non empty words in each sentence and
create feature vectors that represent each constituent in the sentence. For each
example, the feature vector captures a context window of n words to its left and
right16 without surpassing sentence limits.
    For evaluation we used Wikipedia categories - SNOMED CT classes map-
pings as gold standard. We considered for each semantic class t a gold standard
15
   A purity 1 means that all the Wikipedia categories attached to the page are mapped
   (directly or indirectly) into the same SNOMED CT class, lower values of the purity
   may mean that the assignment of Wikipedia categories to SNOMED CT classes is
   not unique or not exists.
16
   In the experiments reported here n was set to 3.
set including all the Wikipedia pages with purity 1, i.e. those pages unambigu-
ously mapped to t. The accuracy of the corresponding classifier is measured
against this gold standard set.

4      Current Methodology
Although our aim is applying the previous approach to the current setting there
significant differences that have to be faced:
1) The tagset is greater and comes from a different source (UMLS instead of
   SNOMED CT ).
2) The language is French instead of English.
3) The genre of documents (EMEA and Medline) is very different from Wikipedia
   pages.
So, we performed the following changes over our previous system:
     First, the way of collecting seedterms described in section 3.2 is modified as
follows: We manually mapped the UMLS tagset into the set of SNOMED CT
top categories (to the full 19 categories set, not to the 6 most frequent categories
as in the previous system) and, further to English Wikipedia categories. We
filtered out the English Wikipedia categories lacking French counterpart. For
some UMLS categories as LIVB the mapping was not biunivocal, for other
cases second level SN categories needed to be considered.
     After filtering out categories not containing French interwiki links, for some
of the UMLS classes a rather small set of Wikipedia classes remained, so we
decided to extend the set by considering the French entity mentions occurring in
the training set, we collected in this way 73 additional categories. We selected for
each UMLS class the set of mentions tagged with the corresponding tag in the
training collection existing as page or category in the French Wikipedia. In the
case of pages we obtained the corresponding Wikipedia categories. Once collected
a set of candidate French Wikipedia categories, we discarded those not having
English counterpart and we manually revised the resulting set for accepting or
rejecting each candidate and for assigning it to the correct UMLS tag. Then,
for each UMLS tag we iterated over all their English Wikipedia categories for
collecting all the pages having a purity 1 and having a French counterpart. In
this way we obtained the initial sets of French seed words for each UMLS class.
     Further processing, described in section 3.3, is basically the same. The only
difference is that for French we have used for processing documents, in learning
and test phases, the Freeling toolbox17 has been used.

5      Experimental framework
First, we proceed to collect the seed terms for each semantic class t 18 and each
knowledge source k. In our experiments we focussed on the Wikipedia-based ap-
17
     http://nlp.lsi.upc.edu/freeling/
18
     As said previously only 9
proach. The results for obtaining French terms starting from English Wikipedia
categories are shown in Table 1 and Table 2.
    Table 2 shows the number of French Wikipedia pages (i.e. terms) according
their UMLS class with independence of their purity figure. For obtaining such
pages we started using English data as shown in Table 1. This Table shows
the global figures of the extraction process from the very beginning to the fig-
ure which represents the total number of French Wikipedia pages available for
training.
    Table 1 shows the global figures of the extraction process. In all the cases we
found for the method based Wikipedia the number of terms (row 2), the French
extension from the training data (row 3), the length of the initial categories set
(row 4). Rows 5 and 6 show the number of pages from the English and French
Wikipedia. The reason to discard some Wikipedia articles are: i) only pages with
a length greater that 100 words are accepted, ii) some pages has been discarded
due to difficulties in extracting useful plain text (pages consisting mainly of
itemized lists, formulas, links, and so) and iii) Only Wikipedia pages with a
purity 1 have been selected. In Table 2 we show the number of accepted terms
splitted according the semantic class to which they belong.


                     Table 1. Terms effectively used for training

                                                        WP only
                      initial WP categories                 237
                      additional categories                  73
                      Total WP categories                   310
                      Total WP pages (EN)                16,972
                      Total available WP pages (FR)       3,564




6      Results
As mentioned above the learning phase has been followed using Wikipedia cat-
egories / UMLS classes mappings as golden standard and Wikipedia pages as
input documents. For each seed term we obtained its corresponding Wikipedia
page and, after cleaning, POS tagging, and sentence segmenting, we extracted
all the mentions. For carrying out this linguistic process we used the Freeling
suite (see [17] for details). For each mention the vector of features is built and
the 9 learned binary classifiers are applied to it19 . If none of the classifiers clearly
classify the instance as belonging to the corresponding semantic class no answer
is returned. If only one of the classifiers classifies positively the instance, the
19
     As quoted above the GEOG tag have been extracted using a conventional NERC
     based on using French DBPEDIA as a gazetteer
  Table 2. Terms available to be used for training according to its SNOMED class

                          UMLS class          WP only
                          Disorder                876
                          Procedures              624
                          Physiology              485
                          Anatomy               1,587
                          Live Beings              27
                          Chemicals and Drugs     392
                          Phenomena                 0
                          Devices                   0
                          Objects                   0
                          Total                 2,402



corresponding UMLS tag is returned. Otherwise a combination step has to be
carried out.
   For combining the results of the binary classifiers two methods have been
implemented:
 – Best result. As results of binary classifiers are scored, this method simply
   returns the class of the best scored individual result. It takes into account
   two threshold values: i) minimum class score and ii) minimum delta to the
   next better class score.
 – Meta-classifier. A SVM multiclass classifier is trained using as features the
   results of the basic binary classifiers together the context data already used
   in the basic classifiers. The resulting class is returned.
For the experiments presented in this paper, only the first combination method
has been tested. Several number of tests has been done by i) changing the num-
ber of WP articles including in training and ii) changing the threshold values
mentioned above.
    Table 3 depicts the global results as reported by the organization of CLEF2015.
Unfortunately the material officially delivered included some severe issues re-
garding offset calculation. This is the main reason of the poor results reported.
After detecting such issues the organisation of the contest proposed to fix the
issue and resubmit a run. So we plan, once fixed the bug, to incorporate to the
paper in the final release a new table showing our final results.
    The results shown in Table 3 are really poor and far from the results obtained
from our previous version performing on English Wikipedia pages, where we
obtained accuracies of 87.4 for Wikipedia-based and Snomed-based approaches
and 94.3 for DBpedia-based one. They require some explanation.
    They can be justified from one side with some issues in our program to
produce the results in a stand-off format as required by the organization. Table 5
shows very clearly that the terminological density of our Wikipedia corpus is
several times lower that the training corpora provided by the organization.
       Table 3. Results as reported by the organization of the CLEF2015’s


–              entities exact match                 entities inexact match
run    TP FP FN Precision Recall          F1 TP FP FN Precision Recall          F1
EMEA     0 2260 1067         0      0      0 83 2177 938      0,0367 0,0813 0,0506
MEDLINE 82 2895 888     0,0275 0,0845 0,0416 354 2623 672     0,1189 0,345 0,1769


                         Table 4. Results locally calculated

            run       TP Tagged Right selected Semantically right
            EMEA     2260  1090 421 (38,6%)          156 (6,9%)
            MEDLINE 2895    640 286 (44,7%)          118 (3,9%)
            MEDLINE* 2895  1008 450 (44,6%)          189 (6,3%)


            Table 5. Terminological density in WP and CLEF corpora

                           #Terms #Sentences             Density
                                                 [terms/sentence]
                    CLEF       4669         1692             2,76
                    WP          874         3158             0,50


              Table 6. Medical entities as tagged in file 4176905.txt

      Full sentence Modifications des protéines sériques et du liquide synovial
                    au cours de la polyarthrite rhumatoide .
      Entities      protéines sériques, protéines, sériques,
                    liquide synovial, liquide, synovial,
                    polyarthrite rhumatoide, polyarthrite, rhumatoide




    From the other side, such density is obtained by a tagging that embed sev-
eral terms in a single sentence. An example of this situation is shown in file
4176905.txt. The sentence and terms tagged are shown in Table 6. There is no
doubt that the tagging is correct but it is not clear that such concrete sentence
contains 9 terms instead of 3 as most term extractors will do. This fact partially
explain the low number of strings tagged by our system (see in Table 4 columns
TP versus Tagged)
    Obviously, the fact that in the current experiments learning is done from
Wikipedia and test is performed over very different genres of documents, EMEA
and Medline, while in our previous system the genre of training and test docu-
ments was the same is a drawback. The different coverage of French and English
Wikipedia and the lower accuracy of Freeling when performing on French texts
are important factors, too.
    Another minor issue is that text seems to include some kind of segmentation
(see for example: l’ enfant or d ’ activation plaquettaire induite par l ’ héparine
among many others). The words by themselves are not important but such seg-
mentation may cause errors in the POS tagging stage and this fact may be a
real problem.
    Nevertheless, Table 4 20 shows the results using only the strings as comparison
element (that is, without taking into consideration the offset values). The column
”Right selected” refers only to the number of strings correctly selected while the
column ”Semantically right” refers to the UMLS tag assigned to such strings.
Obviously a string may be correctly selected but the class assigned to it may be
wrong. The analysis of this table confirms that the results are poor.
    Leaving aside GEOG, that is detected using a specific mechanism, the best
performing classes are ANAT and LIVB with a precision greater 50% probably
due to the fact that they are the two classes more frequent in our training corpus.
    In order to improve the results, we perform some tests using both WP and
CLEF in the training stage. The results obtained are shown in Table 7. It shows
an improvement in the performance but also shows a problem in the string
selection. Examining the results in more details it reveals that if we consider
only right selected strings the precision is about 50%. Table 8 gives a more
detailed view about the results; it shows for a given true class it indicates which
has been the estimated classes. The best result has been obtained for the class
DISO that reaches a precision higher than 70%..


     Table 7. Results locally calculated using both WP and CLEF as training corpus

                run      TP Tagged Right selected Semantically right
                EMEA    2260  1088 394 (36,2%)          205 (9,1%)
                MEDLINE 2895  1356 542 (66,7%)          281 (9,4%)




7      Conclusions and further work

We have presented a system that automatically detects and tags medical terms
in general medical documents. The tagset used is derived from UMLS taxonomy.
The results of the system, as discussed in previous section are poor and far from
the obtained in our previous system, performing on English Wikipedia pages.
20
     This table shows two results for MEDLINE documents. The first is the result actually
     delivered for the contest. In this result a number of documents has been lost. The
     issue has been corrected and its result is indicated with an ’*’. Please note that the
     figures correspond with those resulted for EMEA documents.
                             Table 8. Error analysis

   Right                    Class proposed by the classifiers
   class     DISO PHEN PROC PHYS ANAT LIVB CHEM DEVI OBJC GEOG
   DISO        238    2    49     14     30    26      43     4 1   0
   PHEN          2    0     3      0      1     0        2    0 1   0
   PROC         34    0    61      3      3    10        9    0 0   0
   PHYS         14    1     9      5      1     4      15     1 3   0
   ANAT         12    2    10      6     28     5      15     1 1   0
   LIVB         15    0    14      4      5    71        0    0 0   0
   CHEM         15    0    11      7      9    13      76     1 1   0
   DEVI          2    0     1      2      1     0        3    2 0   0
   OBJC          1    0     1      0      0     0        0    0 5   0
   GEOG          0    1     1      0      1     0        4    0 0   0
   Precision 71.47 0.00 38.13 12.20 35.44 55.04 45.51 22.22 41.67 0.0



An initial error analysis has detected a program issue in the way of computing
off-sets of the detected mentions, also the extremely high difference in the den-
sity of mentions in the corpus used for learning (French Wikipedia pages) and
for testing (French EMEA and Medline documents) seems to point to a high
disagreement between training and test. A third issue is related to limitations
in the performance of Freeling, specially in the basic tokenisation task.
    The framework developed allows to perform additional experimenting chang-
ing several design parameters like the number of terms used for training, context
width, features definition, etc. Some tests will be performed to optimize such pa-
rameters.
    Several lines of research and a pending work will be followed in the next
future (beyond fixing the issues reported above).

 – As our results are based on one of the three knowledge sources used in our
   previous work, an obvious way of possible improvement is the use of the
   other two resources (SNOMED CT and DBPEDIA)
 – A combination and/or the specialization of the resources for learning more
   accurate classifiers. The application of the DBPEDIA based approach, clearly
   the most productive one, to all the classes merits a deeper investigation.
 – A careful combination of learning from the learning dataset and from addi-
   tional material should be experimented
 – Table 8 shows that three of the classes produced no results at all and another
   one only detected one term. In these cases the corresponding classifiers have a
   extremely low accuracy, probably due to few training examples. So acquiring
   additional examples for these cases could result on improvements.
 – Moving from semantic tagging of medical entities to semantic tagging of
   relations between such entities is a highly exciting objective, in the line of
   recent challenges in the medical domain (and beyond).
 – Improving the selection of medical entities by using POS pattern learning,
   adapting our term extractor to the tagging policy of medical entities in
   Quaero corpus and improving adaptation of Freeling to French medical texts.


8    Acknowledgements

This work was partially supported by the SKATER project (Spanish Ministe-
rio de Economı́a y Competitividad, TIN2012-38584-C06-01 and TIN2012-38584-
C06-05).


References
 1. Vivaldi, J., Rodrı́guez, H.: Medical entities tagging using distant learning. In:
    CICLing 2015, Part II, LNCS. Volume 9042. (2015) 631–642
 2. Goeuriot, L., Kelly, L., Hanna Suominen, L.H., Névéol, A., Grouin, C., Palotti, J.,
    Zuccon, G.: Overview of the CLEF eHealth evaluation lab 2015. clef 2015 - 6th
    conference and labs of the evaluation forum. Lecture Notes in Computer Science
    (LNCS), Springer (2015)
 3. Névéol, A., Grouin, C., Tannier, X., Hamon, T., Kelly, L., Goeuriot, L., Zweigen-
    baum, P.: CLEF eHealth evaluation lab 2015 task 1b: clinical named entity recog-
    nition. In: CLEF 2015 Online Working Notes, CEUR-WS (2015)
 4. Névéol, A., Grouin, C., Leixa, J., Rosset, S., Zweigenbaum, P.: The Quaero french
    medical corpus: A ressource for medical entity recognition and normalization. In:
    Proceedings of the Fourth Workshop on Building and Evaluating Ressources for
    Health and Biomedical Text Processing - BioTxtM2014. (2014) 29–30
 5. Ling, X., Singh, S., Weld, D.S.: Design challenges for entity linking. TACL 3
    (2015) 315–328
 6. Gattani, A., Lamba, D.S., Garera, N., Tiwari, M., Chai, X., Das, S., Subrama-
    niam, S., Rajaraman, A., Harinarayan, V., Doan, A.: Entity extraction, linking,
    classification, and tagging for social media: A wikipedia-based approach. PVLDB
    6(11) (2013) 1126–1137
 7. Özlem Uzuner, South, B.R., Shen, S., DuVall, S.L.: 2010 i2b2/VA challenge on
    concepts, assertions, and relations in clinical text. In: Journal of the American
    Medical Informatics Association. Volume 18. (2011) 552–556
 8. Segura-Bedmar, I., Martı́nez, P., Zazo, M.H.: Lessons learnt from the DDI
    extraction-2013 shared task. In: Journal of Biomedical Informatics, Elsevier, ISSN:
    1532-0464. (January 2014)
 9. Mintz, M., Bills, S., Snow, R., , Jurafsky, D.: Distant supervision for relation
    extraction without labeled data. In: Proceedings of the ACL. (2009) 1003–1011
10. Aronson, A.R., Lang, F.M.: An overview of Metamap: historical perspective and
    recent advances. In: JAMIA. Volume 17. (November 2010) 229–236
11. Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H., Jimeno, A.: Text
    processing through web services: calling Whatizit. In: Bioinformatics Applications
    Note. Volume 4. (November 2008) 296–298
12. He, J., de Rijke, M., Sevenster, M., van Ommering, R., Qian, Y. In: Generat-
    ing Links to Background Knowledge: A Case Study Using Narrative Radiology
    Reports, Glasgow, Scotland, UK. (October 2011)
13. Yeganova, L., Kim, W., Comeau, D., Wilbur, W.J.: Finding biomedical categories
    in medline R . In: Journal of Biomedical Semantics. (2012)
14. Vivaldi, J., Rodrı́guez, H.: Using Wikipedia for term extraction in the biomedical
    domain: first experience. In: Procesamiento del Lenguaje Natural. Volume 45.
    (2010) 251–254
15. Huang, R., Riloff., E.:         Inducing domain-specific semantic class taggers
    from(almost) nothing. In: Proceedings of the 48th Annual Meeting of the As-
    sociation for Computational Linguistics, Uppsala, Sweden (2010) 275–285
16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The
    WEKA data mining software: An update. In: SIGKDD Explorations. (2009)
17. Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In Cal-
    zolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J.,
    Odijk, J., Piperidis, S., eds.: Proceedings of the 8th international conference on
    Language Resources and Evaluation, European Language Resources Association
    (ELRA) (2012)