<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">SIBM at CLEF eHealth Evaluation Lab 2017: Multilingual Information Extraction with CIM-IND</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Chloé</forename><surname>Cabot</surname></persName>
							<email>chloe.cabot@chu-rouen.fr</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">SIBM</orgName>
								<orgName type="department" key="dep2">TIBS -LITIS EA 4108</orgName>
								<orgName type="institution" key="instit1">Normandie Univ</orgName>
								<orgName type="institution" key="instit2">Rouen University and Hospital</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lina</forename><forename type="middle">F</forename><surname>Soualmia</surname></persName>
							<email>lina.soualmia@chu-rouen.fr</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">SIBM</orgName>
								<orgName type="department" key="dep2">TIBS -LITIS EA 4108</orgName>
								<orgName type="institution" key="instit1">Normandie Univ</orgName>
								<orgName type="institution" key="instit2">Rouen University and Hospital</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">French National Institute for Health</orgName>
								<orgName type="institution" key="instit2">INSERM</orgName>
								<orgName type="institution" key="instit3">LIMICS UMR</orgName>
								<address>
									<postCode>1142</postCode>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stéfan</forename><forename type="middle">J</forename><surname>Darmoni</surname></persName>
							<email>stefan.darmoni@chu-rouen.fr</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">SIBM</orgName>
								<orgName type="department" key="dep2">TIBS -LITIS EA 4108</orgName>
								<orgName type="institution" key="instit1">Normandie Univ</orgName>
								<orgName type="institution" key="instit2">Rouen University and Hospital</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">French National Institute for Health</orgName>
								<orgName type="institution" key="instit2">INSERM</orgName>
								<orgName type="institution" key="instit3">LIMICS UMR</orgName>
								<address>
									<postCode>1142</postCode>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">SIBM at CLEF eHealth Evaluation Lab 2017: Multilingual Information Extraction with CIM-IND</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">3AE8585ED3C4E6AF4404EC0F01809057</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Information extraction</term>
					<term>Entity recognition</term>
					<term>Lexical semantics</term>
					<term>Natural Language Processing</term>
					<term>International Classification of Diseases</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents SIBM's participation in the Task 1: Multilingual Information Extraction -ICD10 coding of the CLEF eHealth 2017 evaluation initiative which focuses on named entity recognition in French and English death certificates. We addressed the identification of relevant clinical entities within the International Classification of Diseases version 10 (ICD10) in the CépiDC and CDC datasets with our CIM-IND system. CIM-IND is a multilingual system designed to recognize named entities in French and English texts using a dictionary-based approach and natural language processing and fuzzy matching methods. The evaluation was performed for two cases: (i) for all ICD10 codes, the main evaluation for the task and (ii) for ICD10 codes addressing a particular type of deaths, called external causes or violent deaths. On the English test set, our system obtained F-scores of 0.81 for all ICD10 codes and 0.4066 for external causes. On the French aligned test set, our system obtained F-scores of 0.8038 for all ICD10 codes and 0.5011 for external causes. On the French raw test set, our system obtained Fscores of 0.7636 for all ICD10 codes and 0.4897 for external causes. These scores were substantially higher than the average score of the systems that participated in the challenge.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Since the amount of digital medical documents has widely expanded in the last twenty years, the information retrieval from such heterogeneous documents has become a significant challenge to address a large variety of tasks in clinical and biomedical research as well as personalized medicine. Named entity recognition (NER) is a basic sub-task of information extraction that aims to extract and classify entity names from text. The NER problem has been studied widely in the last decade in the biomedical field as well as others such as social media <ref type="bibr" target="#b0">[1]</ref> or speech data <ref type="bibr" target="#b1">[2]</ref>. As the use of NER services has expanded, state-of-theart algorithms have improved on formal medical text for English <ref type="bibr" target="#b2">[3]</ref>. However, NER algorithms struggle to adapt to free text because algorithms are designed for formal text and are based on features present in well-formed text such as biomedical articles. Free text in medical notes comprises spelling errors, incorrect use of punctuation, grammar and capitalization <ref type="bibr" target="#b3">[4]</ref>. In other languages, free text can also present incorrect use of diacritical marks. In medical reports, text is usually made from short or incomplete sentences, similar to note-taking, with a substantial use of ambiguous abbreviations. Usually, clinical records are created in a rush without any proofing. Consequently, a large number of spelling errors occurs. These errors should not only be related to the complexity of the language but also to characteristics of the medical domain. <ref type="bibr">Siklósi et al.</ref> found that the most frequent types of errors are the unintentional mistyping, grammatical errors, sentence fragments, and non-standardized abbreviations <ref type="bibr" target="#b4">[5]</ref>. In fact, as opposed to formal text, abbreviations are rarely defined in medical reports. Despite the efforts made in NER, even in the biomedical domain, information extraction in clinical notes still has to undertake several challenges <ref type="bibr" target="#b5">[6]</ref>.</p><p>Since 1995, the department of BioMedical Informatics of the Rouen University Hospital (SIBM, URL: www.cismef.org) has been working on developing tools to access health knowledge (information retrieval and automatic indexing) in French <ref type="bibr" target="#b6">[7]</ref><ref type="bibr" target="#b7">[8]</ref><ref type="bibr" target="#b8">[9]</ref><ref type="bibr" target="#b9">[10]</ref>. More recently, our team has worked on the evaluation of health information systems and information retrieval and indexing in Electronic Health Records (EHRs) <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref>. In this context, a multilingual system called CIM-IND has been developed. CIM-IND is designed to recognize named entities in French and English texts using a dictionary-based approach and natural language processing and fuzzy matching methods. The main objective of this system is to deal accurately and efficiently with the informal and noisy nature of free text in medical reports. To assess the performance of CIM-IND, our team participated in the CLEF eHealth 2016 Task 2 <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref> which aimed at fully automatically identify clinically relevant entities in death certificates in French and obtained average results <ref type="bibr" target="#b14">[15]</ref>. While death certificates are standardized documents filled by physicians to report the death of a patient, they usually present spelling or typing errors, abbreviations, and, in French, non-diacritized text or a mix of cases and diacritized text. The main motivation in participating is to improve the functionalities of the tool and to determine the progress achieved since our last year participation and our ability to address the issues detected then. As the Task 1: Multilingual Information Extraction -ICD10 coding of the CLEF eHealth 2017 evaluation initiative involved assigning codes from the International Classification of Diseases, version 10 (ICD10) to both French and English death certificates <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17]</ref>, we were also able to test our multilingual approach.</p><p>The rest of the paper is organized as follows. In Section 2 we introduce our extraction approach and tools used in this task and we describe our experimental setup. Section 3 reports on our results. Section 4 presents some error analyses and reflections and wraps up concluding remarks and outlines future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Material and methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Test datasets</head><p>French CépiDC datasets Since 1968, the CépiDC, a French National Institute for Health and Medical Research (Inserm) laboratory, is dedicated to elaborate annually the national medical causes of death statistics in association with the French National Institute for Statistics and Economic Studies (Insee), the dissemination of the data and the studies and researches on the medical causes of death. These statistics are built from information from death certificates. The CépiDC team handles a database containing more than 18,000,000 death records <ref type="bibr" target="#b17">[18]</ref>. The task consists of extracting ICD10 codes from the raw lines of death certificate text. The task is an information extraction task that relies on the text supplied to extract ICD10 codes from the certificates, line by line. Two datasets are provided for the task. The first dataset is called "aligned dataset" and the second is called "raw dataset". As the structure of the files provided by these two sets differs, some minor adjustments were necessary to process them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Aligned dataset</head><p>The dataset includes 31,690 death certificates processed by CépiDC in 2014 totalling 91,962 lines. The annotations in the CépiDC corpus consist of ICD10 codes and were assigned per text line.. The dataset is supplied in one CSV-formatted file. Each row contains twelve information fields associated with a raw line of text from an original death certificate as follows:</p><p>-DocID: death certificate ID -YearCoded: year the death certificate was processed by CpiDC -Gender: gender of the deceased -Age: age at the time of death, rounded to the nearest five-year age group -LocationOfDeath: Location of death -LineID: line number within the death certificate -RawText: raw text entered in the death certificate -IntType: type of time interval the patient had been suffering from coded cause, according to the following categories: minutes, hours, days, months, years -IntValue: length of time the patient had been suffering from coded cause -CauseRank: Rank of the ICD10 code -StandardText: dictionary entry or excerpt of the raw text that supports the selection of an ICD10 code (if any) -ICD10: ICD10 code associated with the certificate corresponding to the Do-cID and LineID</p><p>The output comprises the 9 input fields plus two text fields (CauseRank and StandardText) used to report evidence text supporting the ICD10 code supplied in the twelfth, final field.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Raw dataset</head><p>The data from 31,683 death certificates is distributed over three CSV-formatted files. The first file includes the following fields: DocID, YearCoded, LineID, RawText, IntType, IntValue. The second files includes the following fields: DocID, YearCoded, Gender, PrimCauseCode, Age, LocationOfDeath. The third file includes the following fields: DocID, YearCoded, LineID.</p><p>English CDC dataset The data from 6,665 death certificates is distributed over three CSV-formatted files. The first file includes the following fields: Do-cID, YearCoded, LineID, RawText, IntType, IntValue. The second file includes the following fields: DocID, YearCoded, Gender, PrimCauseCode, Age, Loca-tionOfDeath. The third file includes the following fields: DocID, YearCoded, LineID.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Dictionaries</head><p>The French CépiDC corpus includes six versions of a manually curated ICD10 dictionary developed at CépiDC corresponding to years: 2006-2010, 2011, 2012, 2013, 2014 and 2015. The English CDC corpus includes a manually curated ICD10 dictionary developed by the CDC providing 170,285 entries. These resources were used to build spelling dictionaries. Moreover, the training sets were used to complete these dictionaries.</p><p>Spelling dictionaries For each language, the dictionary versions were merged if necessary. Each ICD term was split into words and duplicates removed. The two lists of unique words obtained provided a spelling dictionary for each language.</p><p>Additional dictionaries Then, an additional dictionary was computed from each training set by extracting ICD10 code and term combinations. The number of times an ICD10 code was used in the training corpus was also determined. For ambiguous terms, i.e. terms that corresponded with more than one ICD10 code, the most used term was kept. Each additional dictionary was merged with dictionaries provided in the corresponding corpus. If a term was present in both the additional dictionary and a corpus dictionary but the corresponding codes were different, the code from the additional dictionary was removed to avoid introducing ambiguity between dictionary versions. This processing helped to complete the provided dictionaries especially with some lacking abbreviations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Extracting ICD10 concepts from death certificates with CIM-IND</head><p>CIM-IND is designed to match ICD10 terms from the text as input in the relevant version of the ICD10. The extraction is performed at the phrase level of the text using natural language processing techniques. The system is built using Python and Python/C extensions and provides a response in CSV format for each identified concept with: (i) the entry text, (ii) the offset of the first and the final word contained in the health concept, (iii) the ICD10 identifier and (iv) the ICD10 term. CIM-IND performs three main steps to identify ICD10 terms: normalization, candidate selection and candidate ranking.</p><p>Normalization Several pre-processing steps are performed, including stop words filtering (using the default NLTK stop word lists for both French and English <ref type="bibr" target="#b18">[19]</ref>) and elision filtering (removing abbreviated articles that are contracted with terms). Words are matched case-insensitive. Diacritics in French texts are conserved and Unicode is used for matching. Finally, spell checking is performed with the Enchant library using the manually built dictionary.</p><p>Candidate selection A method based on the phonetic encoding algorithm Double Metaphone (DM) <ref type="bibr" target="#b19">[20]</ref> is used to operate a first approximate term search.</p><p>The DM phonetic encoding algorithm is the second generation of the Metaphone algorithm. It is designed primarily to encode American English names while taking into account the fact that such words can have more than one acceptable pronunciation. Double Metaphone can compute a primary and a secondary encoding for a given word or name to indicate both the most likely pronunciation as well as an optional alternative pronunciation (hence the "double" in the name). DM tries to account for myriad irregularities in English as well as Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other languages. Though powerful, DM does have its limitations and drawbacks. DM was designed for searching lists of proper names rather than large amounts of text. DM may not match grossly misspelled words that seriously alter the phonetic structure of the word. Despite its limitations, the DM algorithm, which is free to use and open source, still holds as a flexible and powerful phonetic encoding system today, especially in a multilingual approach. First, CIM-IND computes DM encoding for each word included in the normalized phrase. Then, ICD10 term candidates with matching DM encoding are retrieved. This step provides quickly a list of relevant ICD10 term candidates and allows to perform time-consuming analyses on a reduced set of terms in the final step. In this way, our system relies on a database to store pre-computed DM encoding for each word available in each ICD10 version dictionary. Candidate ranking Finally, a Weighted Distance Score (WDS) algorithm has been developed to rank the list of candidate terms. The WDS algorithm returns a similarity score scaled from 0 to 100 for each candidate, 100 representing a perfect match. The most likely term having the highest score is retained as the matching ICD10 term. As only one or multiple ICD10 terms can be present in a phrase, two cases are considered. First, if the candidate sequence s 1 length is similar to the processed line s 2 length (i.e only one ICD10 term is expected), two scores are computed: (i) a base score (BS) and (ii) a set score (SeS). The BS is computed by determining the Levenshtein distance between the sequences s 1 and s 2 scaled from 0 to 100. The SeS finds all alphanumeric tokens in each string and treats them as a set. Then two strings are constructed by concatenate, on the one hand, the sorted intersection and, on the other hand, the sorted remainder. Then, the distance of these strings are computed controlling any unordered partial matches.</p><p>Else, if one of the sequences is 1.5 times longer than the other, two partial scores are computed: (i) a partial base score (PBS) and (ii) a partial set score (PSeS). The PBS returns the distance of the most similar substring as a number between 0 and 100. First each block representing a sequence of matching characters in a string is determined. Then, the best partial match will be the one aligning with at least one of those blocks. The PSeS computes PBS for each string built from the sorted intersection and the sorted remainder of s 1 and s 2 . To assure that only full results can return a perfect match, partial scores are scaled based on the length of s 1 and s 2 . All set scores are scaled by 0.95. Finally, the WDS score is determined as the highest of these scores.  The seventh field contains the text to annotate, the eleventh the ICD10 dictionary entry matching the text and the last field the corresponding ICD10 code. Similarly, Figure <ref type="figure">2</ref> gives an example of processing English texts with CIM-IND.</p><p>For example, in Figure <ref type="figure">1</ref>, lines 1-2 contains the misspelled word "glisement" (for French "glissement") and lines 3-4 contains the misspelled word "héúorragie" (for French "hémorragie"). This first error is correctly processed by the DM algorithm providing the same encoding for both the misspelled and correct words. However, the second error is not properly processed. As the misspelling profoundly alters the phonetic of the word, the DM algorithm processes a different encoding than for the correct word. This highlights the importance to process a spell checking of the normalized text to avoid grossly misspelled words before the DM processing and so secure a proper list of candidates.</p><p>Regarding execution time, CIM-IND is able to process a line from 50 to 300 ms depending on its length.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">French CépiDC datasets</head><p>CIM-IND was run on both French test sets and one run was submitted for each of these datasets. Table <ref type="table" target="#tab_2">1</ref> shows the results obtained on the raw dataset together with the average and median performance scores of the runs of all task participants. Table <ref type="table" target="#tab_3">2</ref> shows the results obtained on the aligned dataset.</p><p>On the raw dataset, CIM-IND achieved a precision of 0.8568 and a recall of 0.6886 (F 1 = 0.7636) for all ICD10 codes. Regarding only ICD10 codes corresponding to external causes (meaning violent deaths), CIM-IND achieved a substantial lower performance with a precision of 0.567 and a recall of 0.431 (F 1 = 0.4897).</p><p>On the aligned dataset, CIM-IND achieved a precision of 0.8346 and a recall of 0.7751 (F 1 = 0.8038) for all ICD10 codes. Regarding only ICD10 codes corresponding to external causes, CIM-IND achieved again a lower performance with a precision of 0.5343 and a recall of 0.4717 (F 1 = 0.5011).</p><p>Since the main difference between these two datasets was related to formatting, it was expected to obtain quite similar results. However, remarkably, the aligned dataset obtains a higher recall than the raw dataset. Then, it should be noted that performance is considerably lower regarding only external causes related ICD10 codes for both test sets. Overall, our performance results are considerably better than the average and median score of all submitted runs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">English CDC dataset</head><p>One run was submitted for the English CDC set. Table <ref type="table" target="#tab_4">3</ref> shows the results obtained on this dataset together with the average and median performance scores of the runs of all task participants. CIM-IND achieved a precision of 0.8393 and a recall of 0.7827 (F 1 = 0.81) for all ICD10 codes. Regarding only ICD10 codes corresponding to external causes, CIM-IND achieved a lower performance with a precision of 0.4261 and a recall of 0.3889 (F 1 = 0.4066).</p><p>Regarding all ICD10 codes, these results are slightly better than the results obtained with the French raw dataset but remarkably similar to those obtained with the aligned dataset. Again, there is a significant performance drop regarding only external causes related ICD10 codes. In this case, results are lower than those obtained on both French datasets, for both precision and recall. Overall, in both evaluations, our results are higher than the average and median score of all submitted runs. However, some aspects of our results should be investigated. Although CIM-IND achieved satisfactory results, we noticed that some errors due to disambiguation or misspellings and inconsistencies remain. In particular, significant misspellings occurring on words which are not part of the spelling dictionary would result in incorrect DM encoding, and so an improper list of candidate terms.</p><p>In English text, our results could be slightly improved with a more complete terminology or a larger training set to cover some missing terms, especially abbreviations. Moreover, the performance drop regarding external causes-related ICD10 codes should be investigated and seems to affect all submitted runs. External causes present a specific context and often a specific terminology related to accidents, violent deaths or treatment-induced overdoses. They occur more rarely in the training sets. Actually only 2440 lines in the French training set (110,869 lines) and 313 lines in the English train set (39,333 lines) appear to be related to external causes (ICD10 codes V01 to Y98). This can explain the reduced performance to some extent. Also, in some cases, the ICD10 codes associated with a given line use the context provided in other lines of the same death certificate. CIM-IND processes each line independently and then was not able to properly annotate such lines.</p><p>The main conclusion of this work and the obtained results is that improvements can still be performed to enhance first the processing of the given terminologies and disambiguation-related issues and also the recognition and processing of spelling errors. We plan on deepening these two aspects and to participate to other challenges in the future to keep track of our developments.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .Figure 1</head><label>11</label><figDesc>Fig. 1. Annotation file in CSV containing ICD10 concepts extracted with CIM-IND in French</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>Annotation file in CSV containing ICD10 concepts extracted with CIM-IND in English</figDesc><table><row><cell>13496;2015;;;;6;Senile dementia of Alzheimer's type ASHD;;;;senile</cell></row><row><cell>dementia;F03</cell></row><row><cell>13496;2015;;;;6;Senile dementia of Alzheimer's type</cell></row><row><cell>ASHD;;;;alzheimer;G309</cell></row><row><cell>13496;2015;;;;6;Senile dementia of Alzheimer's type ASHD;;;;ashd;I251</cell></row><row><cell>16915;2015;;;;2;HEALTHCAREASSOCIATED PNEUMONIA;;;;healthcare-associated</cell></row><row><cell>pneumonia;J189</cell></row><row><cell>Fig. 2.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1 .</head><label>1</label><figDesc>ICD10 coding performance on the French CépiDC raw test dataset</figDesc><table><row><cell></cell><cell></cell><cell>All causes</cell><cell></cell><cell cols="2">External causes</cell><cell></cell></row><row><cell></cell><cell>Precision</cell><cell cols="2">Recall F-measure</cell><cell>Precision</cell><cell cols="2">Recall F-measure</cell></row><row><cell>SIBM-run1</cell><cell>0.8568</cell><cell>0.6886</cell><cell>0.7636</cell><cell>0.5670</cell><cell>0.4310</cell><cell>0.4897</cell></row><row><cell>average</cell><cell>0.4747</cell><cell>0.3583</cell><cell>0.4059</cell><cell>0.3668</cell><cell>0.2474</cell><cell>0.2921</cell></row><row><cell>median</cell><cell>0.5411</cell><cell>0.4136</cell><cell>0.5080</cell><cell>0.4431</cell><cell>0.2834</cell><cell>0.3764</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2 .</head><label>2</label><figDesc>ICD10 coding performance on the French CépiDC aligned test dataset</figDesc><table><row><cell></cell><cell></cell><cell>All causes</cell><cell></cell><cell cols="2">External causes</cell><cell></cell></row><row><cell></cell><cell>Precision</cell><cell cols="2">Recall F-measure</cell><cell>Precision</cell><cell cols="2">Recall F-measure</cell></row><row><cell>SIBM-run1</cell><cell>0.8346</cell><cell>0.7751</cell><cell>0.8038</cell><cell>0.5343</cell><cell>0.4717</cell><cell>0.5011</cell></row><row><cell>average</cell><cell>0.6479</cell><cell>0.5555</cell><cell>0.5933</cell><cell>0.5051</cell><cell>0.3109</cell><cell>0.3663</cell></row><row><cell>median</cell><cell>0.6288</cell><cell>0.5396</cell><cell>0.5484</cell><cell>0.5080</cell><cell>0.3330</cell><cell>0.4056</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3 .</head><label>3</label><figDesc>ICD10 coding performance on the English CDC test datasetThe development of CIM-IND started last year and the system was evaluated in the corresponding CLEF eHealth 2016 task, only on one French corpus. In 2016, CIM-IND obtained a F1 score of 0.6795, which was slightly below the average results<ref type="bibr" target="#b14">[15]</ref>. Since then, various improvements have been developed concerning especially the ranking of ICD10 term candidates and CIM-IND's ability to deal with free text inconsistencies. This year's results have demonstrated these improvements with a 12% increase in F1 score in the French raw dataset and an 18% increase in F1 score in the French aligned dataset. Moreover, this year's challenge demonstrated that CIM-IND performed broadly as well in both English and French, achieving above-average results in both languages.</figDesc><table><row><cell></cell><cell></cell><cell>All causes</cell><cell></cell><cell cols="2">External causes</cell><cell></cell></row><row><cell></cell><cell>Precision</cell><cell cols="2">Recall F-measure</cell><cell>Precision</cell><cell cols="2">Recall F-measure</cell></row><row><cell>SIBM-run1</cell><cell>0.8393</cell><cell>0.7827</cell><cell>0.8100</cell><cell>0.4261</cell><cell>0.3889</cell><cell>0.4066</cell></row><row><cell>average</cell><cell>0.6549</cell><cell>0.5586</cell><cell>0.6017</cell><cell>0.3986</cell><cell>0.2749</cell><cell>0.2549</cell></row><row><cell>median</cell><cell>0.6459</cell><cell>0.5267</cell><cell>0.5892</cell><cell>0.2791</cell><cell>0.2619</cell><cell>0.2740</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Analysis of named entity recognition and linking for tweets</title>
		<author>
			<persName><forename type="first">L</forename><surname>Derczynski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Rizzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Erp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gorrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Troncy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Petrak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bontcheva</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing &amp; Management</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="32" to="49" />
			<date type="published" when="2015-03">March 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Feature-enriched word embeddings for named entity recognition in open-domain conversations</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bigot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M</forename><surname>Khan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="6055" to="6059" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">12 years on -Is the NLM medical text indexer still useful and relevant?</title>
		<author>
			<persName><forename type="first">J</forename><surname>Mork</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Aronson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Demner Fushman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of biomedical semantics</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">8</biblScope>
			<date type="published" when="2017-02">February 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Automated misspelling detection and correction in clinical free-text records</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">H</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Topaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">R</forename><surname>Goss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of biomedical informatics</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page" from="188" to="195" />
			<date type="published" when="2015-06">June 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Context-aware correction of spelling errors in Hungarian medical documents</title>
		<author>
			<persName><forename type="first">B</forename><surname>Siklósi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Novák</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Prószéky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Speech &amp; Language</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="219" to="233" />
			<date type="published" when="2016-01">January 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Challenges of Medical Text and Image Processing: Machine Learning Approaches</title>
		<author>
			<persName><forename type="first">E</forename><surname>Menasalvas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gonzalo-Martin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning for Health Informatics</title>
				<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="221" to="242" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A search tool based on &apos;encapsulated&apos; MeSH thesaurus to retrieve quality health resources on the internet</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Darmoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Thirion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Leroyt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Douyère</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lacoste</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Godard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Rigolle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brisou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Videau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Goupyt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Piott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Quéré</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ouazir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Abdulrab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Medical informatics and the Internet in medicine</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="165" to="178" />
			<date type="published" when="2001-07">July 2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Automatic indexing of online health resources for a French quality controlled gateway</title>
		<author>
			<persName><forename type="first">A</forename><surname>Neveol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rogozan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Darmoni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing &amp; Management</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="695" to="709" />
			<date type="published" when="2006-05">May 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Improving information retrieval with multiple health terminologies in a quality-controlled gateway</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Soualmia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sakji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Letord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Rollin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Massari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Darmoni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Health Information Science and Systems</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">8</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Indexing biomedical documents with a possibilistic network</title>
		<author>
			<persName><forename type="first">W</forename><surname>Chebil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Soualmia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">N</forename><surname>Omri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Darmoni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">JASIST</title>
		<imprint>
			<biblScope unit="volume">67</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="928" to="941" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Retrieving Clinical and Omic Data from Electronic Health Records</title>
		<author>
			<persName><forename type="first">C</forename><surname>Cabot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lelong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Grosjean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Soualmia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Darmoni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Stud Health Technol Inform</title>
		<imprint>
			<biblScope unit="volume">221</biblScope>
			<biblScope unit="page">115</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Semantic Search Engine to Query into Electronic Health Records with a Multiple-Layer Query Language</title>
		<author>
			<persName><forename type="first">R</forename><surname>Lelong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cabot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Soualmia</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd SIGIR workshop on Medical Information Retrieval (MedIR)</title>
				<meeting>the 2nd SIGIR workshop on Medical Information Retrieval (MedIR)</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Overview of the CLEF eHealth Evaluation Lab</title>
		<author>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Suominen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neveol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Palotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zuccon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction</title>
				<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016-09">2016. September 2016</date>
			<biblScope unit="page" from="255" to="266" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Clinical information extraction at the CLEF eHealth evaluation lab</title>
		<author>
			<persName><forename type="first">A</forename><surname>Névéol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of CLEF 2016 Evaluation Labs and Workshop: Online Working Notes</title>
				<meeting>CLEF 2016 Evaluation Labs and Workshop: Online Working Notes</meeting>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">SIBM at CLEF eHealth Evaluation Lab 2016: Extracting Concepts in French Medical Texts with ECMT and CIMIND</title>
		<author>
			<persName><forename type="first">C</forename><surname>Cabot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Soualmia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dahamna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Darmoni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR-WS Working Notes of the Conference and Labs of the Evaluation Forum CLEF</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="47" to="60" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">CLEF 2017 eHealth Evaluation Lab Overview</title>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Suominen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neveol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Robert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Spijker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Palotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zuccon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF -8th Conference and Labs of the Evaluation Forum</title>
		<title level="s">Lecture Notes in Computer Science LNCS</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017-09">September 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">CLEF eHealth 2017 Multilingual Information Extraction task overview: ICD10 coding of death certificates in English and French</title>
		<author>
			<persName><forename type="first">A</forename><surname>Neveol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">N</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">B</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Grouin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lavergne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Rey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Robert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Rondet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zweigenbaum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF Evaluation Labs and Workshop Online Working Notes</title>
				<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2017-09">September 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Certification et codification des causes médicales de décès</title>
		<author>
			<persName><forename type="first">G</forename><surname>Pavillon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Laurent</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bulletin épidémiologique hebdomadaire</title>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Natural Language Processing with Python</title>
		<author>
			<persName><forename type="first">S</forename><surname>Bird</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Klein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Loper</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>O&apos;Reilly Media, Inc</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">The double metaphone search algorithm</title>
		<author>
			<persName><forename type="first">L</forename><surname>Philips</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">C/C++ Users Journal</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="38" to="43" />
			<date type="published" when="2000-06">June 2000</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
