-

SIBM at CLEF e-Health Evaluation Lab 2015

Lina F. Soualmia

0 1

Chloé Cabot

Badisse Dahamna

Stéfan J. Darmoni

0 1 0 French National Institute for Health , INSERM, LIMICS UMR-1142 , France 1 Normandie Univ., SIBM - TIBS - LITIS EA 4108, Rouen University and Hospital , France

In this paper, we report on our participation in the clinical named entity recognition task of the CLEF eHealth 2015 evaluation initiative i.e. to fully automatically identify clinically relevant entities in medical text in French. We address the task by using several biomedical knowledge organization systems (KOS) containing terms and their variations already in French or that we have partially translated in the context of existing projects. The extraction method is available online in the form a web-based service that requests the KOS to extract clinical concepts from Electronic Health Records. It is also available via a user-friendly interface developed for clinicians. Our system has not obtained good results in inexact matching against the gold standard. However, this first participation allowed us to analyze our system and method and will allow us to improve it.

Information extraction Bagging Lexical semantics Natural Language Processing Information storage and retrieval Vocabulary controlled Systematized Nomenclature of Medicine Medical Subject Headings International Classification of Diseases Unified Medical Language System

With the increasing development of Electronic Health Records (EHRs) in hospitals and healthcare institutions [ 1 ], the amount of clinical documents, such as discharge summaries, in electronic format is also growing [ 2 ]. The retrieval of such documents is important in clinical and research tasks such as cohort studies or decision support in personalized medicine, a medicine tailored to each patient by considering genomic and clinical contexts of individuals. Indeed, these clinical documents are not only important to clinicians in daily use but also valuable to researchers and administrators. EHRs generate large amount of data that offer new opportunities to gain insight into clinical care. Particularly, EHR repositories enable to compose patient cohorts for the study of clinical hypotheses, hard to test experimentally, such as for example individual variability in drug responses. However, to compose those cohorts, efficient and user-friendly information retrieval systems are needed. To improve the performance of these systems, it is mandatory to develop an automatic indexing system that gives as output the representative index of an EHR. The latter should be represented by clinical related terms even if the discharge summaries are composed by free terms.

Since 1995, the department of BioMedical Informatics of the Rouen University Hospital (SIBM; URL: www.cismef.org) is working on developing tools to access health knowledge (information retrieval and automatic indexing) in French [ 3-8 ]. SIBM is a multidisciplinary team composed by physicians, medical informaticians, computer scientists, R&D engineers, librarians, postdoctoral and PhD students (n=21). SIBM is part of the Computer Science, Information Processing, and Systems Laboratory (LITIS-EA 4108), in Rouen, Normandy, France. Until recently, SIBM is working on the evaluation of health information systems and information retrieval and indexing in EHR [ 9-10 ]. In this context, a user-friendly tool and a web-based service ECMT (Extracting Concepts with Multiple Terminologies) is developed. It has been included in several projects subsidized by the French national research agency [1112]. To evaluate the precision of ECMT, SIBM participated for the first time to the CLEF eHealth evaluation initiative [ 13 ]. The main motivation in participating is to improve the functionalities of the tool. The clinical named entity recognition task is retained [ 14 ]. It aims to fully automatically identify clinically relevant entities in medical texts in French. ECMT uses natural language processing (NLP), patterns and exploit several biomedical knowledge organization systems (KOS).

The rest of paper is organized as follows. In Section 2 we present related work, in Section 3 we introduce our extraction approach and tool and we describe our experimental setup. Section 4 reports on our results and on error analysis and reflections. Finally, Section 5 wraps up concluding remarks and outlines future work. 2

Related Work

Information extraction is the extraction of pre-defined types of information from text [ 15 ]. There are four primary methods available to implement an information extraction system, including Natural Language Processing (NLP), pattern matching, rules, and machine learning. The primary disadvantage of machine learning used for information extraction is that it requires a labeled dataset for training [ 16 ]. As most clinical data are stored in free text, the primary means of performing information extraction is natural language processing [ 17 ]. Several NLP systems have shown promising results in extracting information from medical narratives [ 18-21 ]. In [ 22 ], Turchin et al. used regular expressions (a meta-language which describes string search patterns), to extract numeric data form free-text. The use of rules and pattern-matching exploits basic patterns over a variety of structures, such as text strings, part-of-speech tags, semantic pairs, and dictionary entries [ 23 ]. Patterns are easily recognized by humans and can be expressed directly using special purpose representation languages such as regular expressions. Regular expressions are effective when the structure of the text and the tokens are consistent, but tend to be one-off methods tailored to the extraction task. Regular expressions have been used to extract blood pressure values from progress notes [ 22 ]. NLP has been useful for extracting medical information such as principal diagnosis [ 20 ] and medication use [ 24 ] from clinical narratives.

Using tools built over ontologies or controlled vocabularies such as the Systematized NOmenclature of MEDicine-Clinical Terms (SNOMED-CT) or the International Classification of Diseases-10 (ICD-10) have enabled researchers to automate the capture of information in clinical narratives [ 20 ]. Other tools have been developed. For example Aronson et al. [ 25 ] developed the Medical Text Indexer. It is based on matching document terms with UMLS terms [ 26 ] using MetaMap, comparing the phrases of the document with the phrases of the concepts using the trigram method and extracting MeSH terms from the k-nearest neighbors (kNN) of the document to be indexed. The indexing method of Névéol et al. [ 5 ] combines a linguistic method and kNN. The EAGL method [ 27 ] combines the vector space model (VSM) and a regular expression pattern matcher. BioAnnotator [ 28 ] uses a parser to identify noun phrases from a document and then matches them to UMLS concepts using a rule engine. AMTEx (automatic MeSH term extraction) [ 29 ] applies the C/NC value method, which allows extraction of composed terms from the text combining statistic and linguistic information and ranks the terms according to the value of C/NC. Only terms belonging to MeSH terms are kept. Jonquet et al. [ 30 ] applied the Mgrep tool for extracting concepts using 200 biomedical ontologies and computed a score for each generated annotation according to its origin (preferred term, non-preferred term, synonym term…). BioDI [ 31 ] reduces the limitation of partial matching through filtering MeSH concepts, which are extracted using VSM. MaxMatcher+ [ 32 ] exploits the BM25 weight for ranking the concepts extracted using MaxMatcher [ 33 ], which annotates documents with only the most significant words in the UMLS Metathesaurus.

All these methods are based on the use of one ore several biomedical KOS which link health concepts and gives their associated terms, as well as their definition and code. Such a system may take the form of a terminology, thesaurus, controlled vocabulary, nomenclature, classification, taxonomy, ontology …etc. Indeed, KOS facilitates the indexing, coding and annotation of different kinds of documents. In the health domain, a great number of bio-terminological resources have been developed for different purposes (the content and structure depending on the purpose to be served). This proliferation of resources has made finding the correct concept increasingly complex when using multiple terminologies simultaneously. For example, the ICD-10 was designed for coding medical reports, the MeSH Thesaurus, for document indexing, the ATC Classification, for coding drugs, the SNOMED-CT, for semantic interoperability among EHRs, and the MedDRA for adverse drug events. However, few of these resources are available for languages other than English [ 34 ]. SIBM developed and maintains a Health Terminologies and Ontologies Portal (HeTOP) [ 35 ] that contains 55 KOS in several languages. ECMT relies on the information system of HeTOP. 3 3.1

Material &Method

Extracting Concepts with Multiple terminologies : ECMT ECMT is developed to extract as accurately as possible from texts as input, a list of candidate health concepts from the 55 KOS included in HeTOP. The extraction is performed at the phrase level of the text. A SOAP and REST Web services allow to provide a response in XML for each concept and contains: the offset of the first and the final word contained in the health concept, and which led to a medical concept in the final list, the identifier and its semantic type if the health concept is included in the UMLS Metathesaurus, and the medical specialty of the concept. The latter are based on manual semantic links between general medical specialties (e.g. dermatology, oncology …etc.) and the KOS included in HeTOP. ECMT relies on bag-of-words and also pattern-matching designed for discharge summaries, procedure reports or laboratory results which contains symbolic data (presence or absence), numerical data and units of measurement. The method of bag-of-words was developed mainly for information retrieval and it has been adapted for indexing i.e. only the largest set of words that maps a concept label is extracted, even if is subsets map other concepts. The method is considered as being more precise and avoiding noise. The text in input is normalized and each phrase is processed separately to extract the concepts.

ECMT has also a user-friendly interface (Fig. 1) accessible after authentication (http://ecmt.chu-rouen.fr/). Several options are available to index the text: • "c" : categorizing. If "c=true" the specialties of each extracted concept are given as output and their UMLS semantic type (default value: "true").

• "r" : refined. If "r=true" the search is stopped when a concept that matches a maximum of words is extracted (default value: "true"). For example, for "cardiopathie hypertensive", if "r=true" only the concept "hypertension artérielle" is returned; if "r=false", the method returns "hypertension artérielle" and "maladie cardiaque" (the latter is returned because "cardiopathie" is a synonym of the concept "maladie cardiaque").

• "sn": semantic network. If "sn=true" the concepts that are related directly (aligned [ 36 ]) to the concepts of the text are also returned by ECMT; (default value: "false").

• "e": exclusions. It is a string containing the identifiers of concepts to exclude (a specialty, a semantic type, a broader or a narrower term…etc.). For example, "e=CIS_MT_8,UML_ST_T060, MSH_D_C,T_DESC_PHARMA_RACINE" returns only concepts that are not "chirurgies" (CIS_MT_8) nor "procedures de diagnostic" (UML_ST_T060) nor MeSH "Maladies" (MSH_D_C) nor "racines de spécialités pharmaceutiques" (T_DESC_PHARMA_RACINE) (default value:"", all the categories are returned, the user can filter them after the extraction). In the case of the use of a father concept, all its descendants are excluded in the output.

• "fi" : filters. It is a string containing the identifiers of concepts to keep in the output (same as "e").

• "a" : ancestors. If "a=true" ECMT returns also the ancestors of each concept (default value: "false").

• "d" : descendants. If "d=true" ECMT returns the descendants of each concept (default value: "false").

• "at" : alternative terms. If "at=true" the synonyms of the concepts are also returned in the output (default value: "true").

The answer of the web-based service is an XML file which serializes the output of the annotation of the text. The following tags compose it: • <cis-sentences> : the set of phrases that correspond to the input. • <timemillis> : processing time in ms. • <cis-sentence> : a phrase. • <idsentence> : identifier of the phrase. • <position> : beginning position of the phrase in the text. • <start> : beginning position of indexing. • <end> : end position of indexing. • <idterm> : concept identifier in the original KOS. • <offset> : set of the terms positions composing the concept. • <ter> : acronym of the concept KOS. • <umlscui> : UMLS concept identifier. • <matchterms> : set of labels that allowed to retrieve the concept. • <cis:term> : preferred label of the concept. • <cis:label> : label. • <lang> : label language. • <cis:altterms> : list of alternative labels of the concept. • <cis:altterm> : alternative label of the concept (synonym). • <cis:categorization> : list of specialties or semantic types. • <cis:category> : a specialty or a semantic type. • <cis:descendants> : list of all descendants of the concept. • <cis:descendant> : a descendant of the concept. • <cis:ancestors> : list of all ancestors of the concept. • <cis:ancestor> : an ancestor of the concept. • <cis:relateds> : list of all concepts related semantically with the concept. • <cis:related> : a concept related semantically with the concept. • <relationLabel> : label of the relation.

Fig.2. gives an an example of processing the phrase “La contraception par les dispositifs intra utérins”. ECMT extracts the MeSH terms “dispositifs contracptifs” (CUI C0009886), “dispositifs intra-utérins” (CUI C0021900) and the ATC term “contraceptifs intra-utérins” (CUI C3653534). The user can also visualize the alternative terms and categories (Fig.3). The information retrieval system of HeTOP, and thus of ECMT, operates on more than 55 terminologies in both French and English partially or totally translated into French, aligned with semantic relations. However, for the latest version of ECMT (v3), the relational database management system is replaced by the distributed cache Infinispan to allow fast processing of the inputs (the example of Fig.2 is processed in 89 ms). The main objectives are the optimization of the response times and the dissociation of the search engine from a proprietary RDBMS. The NoSQL solution Infinispan allows data distribution and calling from several web-based servers. The version with Hibernate search combined with Apache Lucene for full text indexing is retained. This configuration allows ECMT the processing of 70,000 electronic health records per day, using the 55 KOS.

At the date of the challenge of the CLEF-eHealth task 1b [ 14 ], seven KOS were migrated to Infinispan and were available for ECMT: the Medical Subject Headings, the Anatomical Therapeutic Chemical classification, the Classification Commune des Actes Médicaux, the Classification Internationale des Maladies - 10ème révision, MedlinePlus, the Systematized Nomenclature of MEDicine International, and Pharmacology. Table 1 contains their metrics. Each concept of these KOS, when it is available in the UMLS, has a Concept Unique Identifier. It is the case for example for the CIM10 and not for the CCAM. The data set is the QUAERO French Medical Corpus, which has been developed as a resource for named entity recognition and normalization in 2013 [ 37 ]. The data set has been created by Névéol et al. in the wake of the 2013 CLEF-ER challenge, with the purpose of creating a gold standard set of normalized entities for French biomedical text. A selection of the MEDLINE titles and EMEA documents used in the 2013 CLEF-ER challenge were selected for human annotation and are used in this challenge. Annotations are provided in the BRAT1 standoff format and the annotation process was guided by concepts in the UMLS. Ten types of clinical entities which are UMLS Semantic Groups were annotated: Anatomy, Chemical and Drugs, Devices, Disorders, Geographic Areas, Living Beings, Objects, Phenomena, Physiology, Procedures. The annotations were made in a comprehensive fashion, so that nested entities were marked, and entities could be mapped to more than one UMLS concept.

In particular: (i) If a mention can refer to more than one Semantic Group, all the relevant Semantic Groups should be annotated. For instance, the mention “récidive” (recurrence) in the phrase “prévention des récidives” (recurrence prevention) should be annotated with the category “DISORDER” (CUI C2825055) and the category “PHENOMENON” (CUI C0034897); (ii) If a mention can refer to more than one UMLS concept within the same Semantic Group, all the relevant concepts should be annotated. For instance, the mention “maniaques” (obsessive) in the phrase “patients maniaques” (obsessive patients) should be annotated with CUIs C0564408 and C0338831 (category “DISORDER”); (iii) Entities which span overlaps with that of another entity should still be annotated. For instance, in the phrase “infarctus du myocarde” (myocardial infarction), the mention “myocarde” (myocardium) should be annotated with category “ANATOMY” (CUI C0027061) and the mention “infarctus du myocarde” should be annotated with category “DISORDER” (CUI C0027051). 1 http://brat.nlplab.org/standoff.html

Results & Discussion

For each run (MEDLINE and ELMA) the web-based service of ECMT is used. Before submitting our runs, we have tested ECMT with the default options (described in the section 3.1) and with the 7 available KOS for extracting entities and normalized entities. For the concerns of the task and the evaluation, the ECMT output is converted into the BRAT format. Fig.4. is the annotation file obtained and related to the phrase of Fig.2. La contraception par les dispositifs intra utérins.

The results obtained for the challenge (exact match precision, recall and F-score) are presented in tables 3, 5, 7 and 9 below (MEDLINE and EMEA) and are reported in [ 13 ]. We also present inexact performance scores in tables 4, 6, 8 and 10.

TP 680

FP 2297

FN 4412

TP 596

FP 1990

FN 3542

The results obtained for the challenge are not satisfactory at all, specifically for the EMEA corpus. The bad results obtained for the MEDLINE corpus should be explained by the existing doubloons in the KOS (Tab.11) that decrease the precision, and by the concepts extracted even if the KOS is not included in the UMLS, and thus no CUI and no semantic group are available in the output, giving noise. Also, the bad exact match results, compared to inexact match results, could be explained by slight differences in terms used. The gold standard uses UMLS labels while ECMT outputs preferred labels in the original KOS. This leads to minor differences between CLEF and ECMT outputs, such as douleur in CLEF output vs. douleurs in ECMT output. Finally, as no specific processing was done to extract overlapping entities as described for the task [ 14 ], several nested entities are missed. For example, in Fig. 4. only the concept “C0021900” is in common with the gold standard (Fig.5). Other entities are extracted with ECMT but are not in the gold standard. As they are more precise, these concepts should not be considered as noise.

Tab.11. Total of terms (distinct) in French (preferred, concept labels, synonyms …etc) of the KOS used in the task.

ATC 11,322 CCAM 25,609

CIM-10 107,790 MelinePlus 877

MeSH 288,016

Pharma 34,172

SNOMeD-Int. 151,407

The results obtained for the EMEA corpus are null (Tab.5, Tab.6, Tab.9, Tab.10). These should be explained by the presence of specific characters in the text. Fig. 6 and Fig. 7 give an example the processing of an EMEA document excerpt: “Dans quel cas Tysabri est-il utilisé ? Tysabri est utilisé dans le traitement des adultes atteints de sclérose en plaques”, all the rest of the phrase after the character “?” is ignored. Also, some characters such as “:” “µ” or newlines cause offsets to be shifted, due to specific ECMT processes, leading to decreased exact match results, especially in EMEA documents which contain many of those characters.

After the submission of the runs of ECMT, the migrating process of the 55 KOS of HeTOP into Infinispan was achieved. A set of 32 KOS in French are available (Tab.12). We have tested all the training dataset (832 vs. 400 in the first test) by using the initial 7 KOS used in the challenge and also the 32 ones.

The obtained results are reported in tables 13, 14, 15 and 16 hereafter. The results are not null but neither satisfactory. Including several KOS increases the precision and decreases the recall in exact matching.

TP 80 86 7 225 21 89 2 9 42 86 647

FP 411 258 32 725 12 205 25 51 118 486 2323

Perspectives for future work

SIBM participated for the first time to an evaluation challenge. The clinical named entity recognition task of the CLEF eHealth 2015 evaluation initiative [ 13 ] allowed us to evaluate ECMT in a very specific context (indexing MEDLINE titles and EMEA documents in French). ECMT is developed to index Electronic Health Records via a web-based service and also via a user-friendly interface. The actual version of ECMT (v3) is optimized to process around 70,000 EHR per day. ECMT was not trained with the training sets of the challenge and it used the default options and the 7 (vs. 55 today) KOS. For this kind of challenge, clinical named entity recognition, it would be more interesting, in our point of view, having a dataset clinical documents in French instead MEDLINE titles or EMEA documents with special characters.

The main conclusion of this work and the obtained results is that before running the datasets we should have studied the training sets and identified for example the specialized characters that are ignored by ECMT (mainly in the EMEA corpus). We should have also identified the set of KOS that gives the best results. We should have also tested the combinations of the options vs. the default values. For instance, for managing overlapping entities, the value of “r” should be r=false to avoid the recognition of only the concept that maps the largest bag-of-words. For normalized entities, the value of the parameter “sn” should be sn=true to exploit all the existing mappings until recognizing an UMLS concept that belongs to the semantic groups of the task. We expect doing this tuning parameter in the near future. We project to participate to other similar challenges but with a better training.

1. Jha

, DesRoches

, Kralovec

, Joshi

. A progress report on electronic health records in US hospital . Health affairs 2010 , 29 ( 10 ): 1951 - 57 .

2. Schuemie

, Sen

, Jong

, Van Soest

, Sturkenboom

, Kors

. Automating classification of free-text electronic health records for epidemiological studies . Pharmacoepidemiology and drug safety 2012 , 21 ( 6 ): 651 - 8 .

3. Darmoni

, Thirion

, Leroy

, Douyère

, Lacoste

, Godard

, Rigolle

, Brisou

, Videau

, Goupy

, Piot

, Quéré

, Ouazir

, Abdulrab

A search tool based on 'encapsulated' MeSH thesaurus to retrieve quality health resources on the internet . Medical Informatics and the Internet in Medicine 2001 , 26 ( 3 ): 165 - 178 .

4. Soualmia

, Darmoni

. Combining different standards and different approaches for health information retrieval in a quality-controlled gateway . International Journal of Medical Informatics 2005 , 74 ( 2-4 ): 141 - 50 .

5. Névéol

, Rogozan

, Darmoni

. Automatic indexing of online health resources for a French quality controlled gateway . Information Processing & Management 2006 , 42 ( 3 ) : 695 - 709 .

6. Soualmia

, Sakji

, Letord

, Rollin

, Massari

, Darmoni

. Improving information retrieval with multiple health terminologies in a quality-controlled gateway . BMC Health Information Science and Systems 2013 , 1 : 8 .

7. Griffon

, Schuers

, Soualmia

, Grosjean

, Kerdelhué

, Kergoulay

, Dahama

, Darmoni SJ . A Search Engine to Access PubMed Monolingual Subsets: Proof of Concept - Evaluation in French . Journal of Medical Internet Research 2014 , 16 ( 12 ) : e271 .

8. Chebil

, Soualmia

, Omri

, Darmoni, SJ. Indexing biomedical documents with a possibilistic network . Journal of the Association for Information Science and Technology 2015 , in press.

9. Cabot

, Grosjean

, Lelong

, Lefebvre

, Lecroq

, Soualmia

, Darmoni, SJ. Omic Data Modelling for Information Retrieval . Proceedings of the 2nd International WorkConference on Bioinformatics and Biomedical Engineering , IWBBIO, 2014 , pp. 415 - 424 .

10. Lelong

, Merabti

, Grosjean

, et al. Moteur de recherche sémantique au sein du dossier du patient informatisé : langage de requêtes spécifique . In proceeding of 15èmesJournées Francophones d'Informatique Médicale , 2014 , CEUR Workshop Proceedings Vol : 1323 .

11. Dupuch

, Segond

, Bittar

, Dini

, Soualmia

, Darmoni

, Gicquel

, Metzger

. Separate the grain from the chaff: make the best use of language and knowledge technologies to model textual medical data extracted from electronic health records . In proceedings of the 6th Language & Technology Conference , 2013 .

12. Thiessard

, Mougin

, Diallo

, Jouhet

, Cossin

, Garcelon

, Campillo

, Jouini

, Grosjean

, Massari

, Griffon

, Dupuch

, Tayalati

, Dugas

, Balvet

, Grabar

, Pereira

, Frandji

, Darmoni

, Cuggia

RAVEL: Retrieval And Visualization in ELectronic health records . In Studies in Health Technologies and Informatics , 2012 , 180 : 194 - 8 .

13. Goeuriot

, Kelly

, Suominen

, Hanlen

, Névéol

, Grouin

, Palotti

, Zuccon

. Overview of the CLEF eHealth Evaluation Lab 2015 . CLEF 2015 - 6th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS) , Springer, September 2015 .

14. Névéol

, Grouin

, Tannier

, Hamon

, Kelly

, Goeuriot

, Zweigenbaum

P. CLEF

eHealth Evaluation Lab 2015 Task 1b: Clinical Named Entity Recognition . In CLEF 2015 Online Working Notes. CEUR-WS.

15. DeJong G. An overview of the FRUMP system . Strategies for natural language processing . 1982 : 149 - 176 ( Chapter 5 ).

16. Zweigenbaum , P , Lavergne

, Grabar

, Hamon

, Rosset

, Grouin

Combining an expert-based medical entity recognizer to a machine-learning system: methods and a case study . Biomedical Informatics Insights , 2013 , 6 ( Suppl 1 ): 51 - 62 .

17. Hayes

, Carbonell J. Natural Language Understanding. Encyclopedia of Artificial Intelligence 1987 : 660 - 677 .

18. Tange , H.J, de

Hasman

, PF, Schouten

. Medical narratives in electronic medical records . International Journal of Medical Informatics , 1997 , 46 : 7 - 29 .

19. Taira , R. K. , Soderland

. A statistical natural language processor for medical reports . Proceedings of the American Medical Informatics Association Symposium , 1999 : 970 - 4 .

20. Zeng , Qing

, Goryachev

, Weiss

, Sordo

, Murphy

, Ross

. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system . BMC Medical Informatics and Decision Making , 2006 6: 30 .

21. Voorham

, Denig

. Computerized extraction of information on the quality of diabetes care from free text in electronic patient records of general practitioners . Journal of the American Medical Informatics Association , 2007 , 14 ( 3 ): 349 - 54 .

22. Turchin

, Kolatkar

, Grant

, Makhni

, Pendergrass

, Einbinder

. Using regular expressions to abstract blood pressure and treatment intensification information from the text of physician notes . Journal of the American Medical Informatics Association , 2006 , 13 : 691 - 5 .

23. Pakhomov

, Buntrock

, Duffy

. High throughput modularized NLP system for clinical text . In proceedings of the Association for Computational Linguistics 2005 , 25 - 8 .

24. Xu

, Stenner

, Doan

, Johnson

, Waitman

, Denny

JC.

MedEx: a medication information extraction system for clinical narratives . Journal of the American Medical Informatics Association 2010 , 17 : 19 - 24 .

25. Aronson

, Mork

, Gay

, Humphrey

, Rogers

. The NLM indexing initiative's medical text indexer . Medical Health Informatics , 2004 , 11 ( 1 ): 268 - 272 .

26. Bodenreider O. The Unified Medical Language System (UMLS): Integrating biomedical terminology . Nucleic Acids Research 2004 , 32 ( 4 ): 267 - 270 .

27. Ruch

. Automatic assignment of biomedical categories: Toward a generic approach . Bioinformatics 2006 , 22 ( 6 ): 658 - 664 .

28. Mukherjea

, Subramaniam

, Chanda

, Sankararaman

, Kothari

, Batra

, Bhardwaj

, Srivastava

Enhancing

a biomedical information extraction system with dictionary mining and context disambiguation . IBM Journal of Research and Development 2004 , 48 ( 5-6 ): 693 - 701 .

29. Hliaoutakis

, Zervanou

, Petrakis

EGM

. The AMTEx approach in the medical document indexing and retrieval application . Data and Knowledge Engigneering 2009 , 68 ( 3 ): 380 - 392 .

30. Jonquet

, Lependu

, Falconer

, Coulet

, Noy

, Musen

, Shah

. NCBO resource index: Ontology-based search and mining of biomedical resources . Journal of Web Semantics 2011 , 9 ( 3 ): 316 - 324 .

31. Chebil

, Soualmia

, Darmoni, SJ. BioDI: a new approach to improve biomedical documents indexing . Proceedings of the 24th International Conference on Database and Expert Systems Applications 2013 : 78 - 87 .

32. Dinh

, Tamine L. Towards a context sensitive approach to searching information based on domain specific knowledge sources . Web Semantics: Science, Services and Agents on the World Wide Web 2012 , 12 - 13 : 41 - 52 .

33. Zhou

, Zhang

, Hu

MaxMatcher: Biological concept extraction using approximate dictionary lookup . In Pacific Rim International Conferences on Artificial Intelligence 2006 : 145 - 149 .

34. Névéol

, Grosjean

, Darmoni

, Zweigenbaum

. Language Resources for French in the Biomedical Domain . Language and Resource Evaluation Conference , 2014 : 2146 - 2151 .

35. Grosjean

, Merabti

, Dahamna

, Kergourlay

, Thirion

, Soualmia

, Darmoni

. Health Multi-Terminology Portal: a semantics added-value for patient safety . Studies in Health Technology and Informatics 2011 , Vol. 166 : 129 - 138 .

36. Merabti

, Soualmia

, Grosjean

, Joubert

, Darmoni

. Aligning Biomedical Terminologies in French: Towards Semantic Interoperability in Medical Applications . Chapter in Medical Informatics, 2012 : 41 - 68 . InTech Publishing.

37. Névéol

, Grouin

, Leixa

, Rosset

, Zweigenbaum P. The QUAERO French Medical Corpus: A Ressource for Medical Entity Recognition and Normalization . Fourth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing - BioTxtM 2014 : 24 - 30 .