<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-lingual ICD-10 coding using a hybrid rule-based and supervised classification approach at CLEF eHealth 2017</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jurica Ševa?</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Madeleine Kittner</string-name>
          <email>kittner@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roland Roller</string-name>
          <email>roland.roller@dfki.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ulf Leser</string-name>
          <email>leser@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deutsches Forschungszentrum für Künstliche Intelligenz, Language Technology</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Humboldt Universität zu Berlin</institution>
          ,
          <addr-line>Knowledge management in Bioinformatics, Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present our research efforts and obtained results within the CLEF eHealth challenge 2017, Track 1. The task involves the recognition and mapping of ICD-10 codes to English and French death certificates. Our approach proposes a two tier, two stage process. First, we use a rule-based system, based on handcrafted rules and the use of Apache Solr, to perform ICD-10 code Named Entity Recognition (NER). This step produces a set of possible candidates extracted from the input text. Next, we use tf-idf weighted character n-gram classification models to normalize and rank a previously generated ICD-10 candidate set. Classification models used are generated and follow the hierarchical structure of the given ICD-10 dictionaries, by creating individual classification models for the first two hierarchical levels (chapters and blocks). Finally, the top candidate, generated from the overlap between the list of possible ICD-10 code candidates (input list) and ranked list of final ICD-10 candidates (output list), is taken as the final ICD-10 code. Although the ICD-10 candidate NER is language-dependent, the normalization and ranking of candidates utilizes a language independent approach.</p>
      </abstract>
      <kwd-group>
        <kwd>ICD-10 codes</kwd>
        <kwd>Multilingual Candidates Ranking</kwd>
        <kwd>Language-independent Information Extraction</kwd>
        <kwd>Language-independent Information Retrieval</kwd>
        <kwd>Hierarchical Document Classification</kwd>
        <kwd>Named Entity Recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent years we have witnessed significant advances in automated natural language
processing research efforts. This was partly stimulated by the increase of available gold
standard corpora as it represents the foundation of scientific research. Research efforts
in the field of biomedical text mining (BTM) have been less fortuitous, especially in the
domain automatic analysis of electronic health (eHealth) records. This is primarily due
to privacy issues and concerns linked with such documents. CLEF eHealth competition
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], through various organized tasks [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ], circumvents these restrictions by providing
? Corresponding author
gold standard data sets/corpora. Its main focus is on creating automatic information
extraction pipelines of valuable information from eHealth documents.
      </p>
      <p>
        The CLEF eHealth 2017 Task 13 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] serves as an extension of the CLEF eHealth
2016 Task 2 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The goal was to develop a multilingual approach for information
extraction of ICD-10 codes from written text. In particular, participants were asked to
assign codes from the International Classification of Diseases version 10 (ICD-10)4
to French and English death certificates. Additionally, it was encouraged to explore
multilingual approaches/models as opposed to language dependent models. For both
languages customized dictionaries of ICD-10 codes and related annotations were
provided by the organizers, not excluding the use of other resources. The task had to be
performed fully automatically.
      </p>
      <p>
        In 2016, the CLEF eHealth ICD-10 coding task was applied to French death
certificates only. Participating teams used different rule-based and machine-learning
approaches. Ho-Dac et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for instance used a CRF with various features combined
with a rule-based system in order to identify more complex entities. Other participants
were using machine-learning approaches such as labeled LDA, SVM, Naive Bayes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
or treated the task as an information retrieval task using tf-idf models [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Van Mulligen
et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the best performing team, extended the terminology with code-term
combinations annotated in the training corpus and used a rule-based approach for indexing.
Additionally, they processed initial annotations using training data derived precision
scores [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>We approached this years task as a two stage process, by combining NER and
document classification to generate the final ICD-10 code. In particular, dictionary-based
indexing through Apache Solr5 was used for Named Entity Recognition and
document classification for candidates normalization/ranking. Indexing is based on exact
and fuzzy dictionary lookup thus providing potential candidates for a term sequence.
The focus of this step was to increase the Recall (R) measure values, by providing a list
of potential candidates. Candidates normalization and ranking, through trained
classification models, is then applied to rank the list of potential candidates. The focus of this
step is the increase of the Precision (P) measure.</p>
      <p>
        Similar to our approach, Zweigenbaum and Lavergne [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] also divided the task into
two steps in 2016 to i) generate candidate ICD-10 codes and ii) re-rank candidates.
While their approach use tf-idf models for both parts, we use a rule-based system to
generate candidate ICD-10 codes. Similarly, the second part of our pipeline models
are trained based on the ICD-10 hierarchy, thus include information about dictionary
chapters and blocks in our models.
      </p>
      <p>In the following we describe our system and evaluation on training and test data.
Compared to all participating systems, our results are well above the average for the
French test data, and only average for the English test data.
3 https://sites.google.com/site/clefehealth2017/task-1
4 http://www.who.int/classifications/icd/en/
5 http://lucene.apache.org/solr/</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Corpora</title>
        <p>Here we describe the corpora, used terminologies, candidate generation by indexing
and candidate ranking using classification.</p>
        <p>The French data set is the CépiDC Causes of Death corpus. The corpus contains free
text descriptions of causes of death as reported in the standardized causes of death
forms. Documents are manually annotated with ICD-10 codes by medical experts. Each
document can contain several lines while each line can contain multiple causes and
therefore multiple ICD-10 code annotations. Additionally, year of coding, patient age,
gender, location of death, and time the patient had been suffering from the coded cause
are provided for each document. The English corpus is set up similarly but is provided
in a different format. The origin of the data set is not mentioned in the challenge.</p>
        <p>Both corpora mostly contain only a few words rather than well-formed sentences,
which is common for medical text and a challenge for any NER or Named Entity
Normalization (NEN) task. The majority of sentences of death certificates (lines) (about
60%) in the English corpus consist of two to four tokens and two to five tokens in the
French corpus. Consequently, as there is almost no context available, the application of
machine learning trained models is limited.</p>
        <p>The French training set contains 65,843 death certificates from 2006 to 2012 with
264,334 ICD-10 codes annotated. The French test set contains 31,682 documents from
2014 and 2015. The English set is much smaller consisting of 13,329 death certificates
from 2015 and 38,908 annotated ICD-10 codes for training and 6,665 documents for
testing.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Terminologies</title>
        <p>The organizers provided custom terminologies for both languages. For French six
dictionaries are available, related to different years of coding (2006-2015), each providing
ICD-10 codes and related terms. Roughly 15% of the terms collected in all dictionaries
link to multiple ICD-10 codes with no correlation to the year of coding. Clearly,
depending on the context, different ICD-10 codes have been applied. On the other hand,
in the provided English terminology each unique term almost always links to a unique
ICD-10 code. For supervised classification we used the hierarchy within the ICD-10
terminology as provided here for French and English6. The terminology consists of
22 chapters which are divided into blocks and further into classes and subclasses. For
instance Chapter VI: Diseases of the nervous system contains the block Inflammatory
diseases of the central nervous system which includes ICD-10 codes G00-G09. The
class G00: Bacterial meningitis, not elsewhere classified within this block can be
further divided into ICD-10 codes like G00.2: Streptococcal meningitis. In Section 2.4 we
explain how this hierarchy is used to train classifiers for ranking candidate terms.
6 see http://www.who.int/classifications/icd/icdonlineversions/
en/ and http://apps.who.int/classifications/icd10/browse/2016/en
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Candidate generation</title>
        <p>To align ICD-10 codes to death certificates, our system applies two methods:
1. ICD-10 code recognition focusing on high R measure values and
2. candidate normalization and ranking to improve P measure values.</p>
        <p>ICD-10 candidates are generated, from the input text, based on dictionary look-up
and fuzzy search. For both languages, customized dictionaries provided by the
organizers are used. Preprocessing of documents and dictionaries has been applied to increase
the probability to match the correct concepts. It includes
– conversion to lower case characters;
– removal of punctuation and
– conversion of special characters.</p>
        <p>NER follows a stepwise matching strategy. All possible n-grams (n 5) of an input
text are compared to the dictionary by exact match. If no exact match is found then
fuzzy matching is applied using Apache Solr. We allow an edit distance of 1 for each
token longer than five characters. Multi-token terms are queued using an AND-query.
Solr results are ranked such that the first result contains most of the search tokens while
only top 10 Solr results are exported to the candidate list. Overlapping sequences are
removed from the candidate list by keeping only the longest matching sequences, which
decreases slightly the number of candidates. The resulting list of candidates has a high
recall, but a low precision. The following step aims at increasing the precision while
keeping a similar level of recall.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Candidate normalization and ranking</title>
        <p>The following step was developed to normalize and rank Section 2.3 output to a
single ICD-10 code. For this we used supervised document classification. Unlike the NER
process, here we developed a language-independent approach. The following
classification models have been taken into consideration while performing model selection
and optimization: Decision Tree Classifier, Random Forest Classifier, Stochastic
Gradient Descent Classifier and Linear Support Vector Classifier. Used classification models
are based on the content in the first two hierarchical levels of the ICD-10 dictionaries
(chapters and blocks) for French and English. Altogether, this pipeline uses 23 different
classification models:
1. A single general classification model which classifies the input text to one of 22</p>
        <p>ICD-10 chapters and
2. 22 chapter classification models which classify and rank the input text to blocks
belonging to the respective ICD-10 chapter.</p>
        <p>The normalization and ranking process, as seen in Figure 1, was performed in two
stages, representing the (shallow) hierarchical structure of the available ICD-10
dictionaries used to train previously mentioned classification models. The process itself
iterates for each input text through the following steps:
1. The input text is assigned to a chapter classification score, ChapterCSi, for each
of the 22 ICD-10 chapters, Chapteri;
2. To each block label, Blockj , in the respective chapter model, input text is classified
and assigned a classification score, BlockCSj , ;
3. A ranking score, RSx, is calculated as a product of ChapterCSi and BlockCSj
for each pair (Chapteri, Blockj ) for each possible ICD-10 candidate label, Lx;
4. A list of ranked ICD-10 codes, LSranked, is sorted descending by generated
ranking score value, RSx, thus giving us a pair (Lx, RSx);
5. An overlap list, LSoverlap, between ICD-10 candidate list, received as output from</p>
        <p>Section 2.3, and the list of ranked ICD-10 codes, LSranked, is calculated;
6. Top ranked ICD-10 candidate from LSoverlap is selected as the final ICD-10 output
code for the input text.</p>
        <p>Based on the type of text and the amount of characters available in the training
data for each chapter or block labels, character level n-gram features (with n between 2
and 5) have been used for building classification models. Extracted features were
reinterpreted with tf-idf weighting scheme. This produced a more distinct set of features.
Furthermore, tf-idf values were then normalized with L2 norm and feature selection,
based on chi2 test and focusing in top 10% of possible features, was performed. For
each of the 23 classification models, model selection and hyper-parameter optimization
with randomized search and 10-fold cross validation was performed. This ensured that
created models were immune to model overfitting.</p>
        <p>Classifier
SVM_LinearSVC
RandomForestClassifier
LogisticRegression
#models</p>
        <p>An overview of final models, based on best classification score, and their occurrence
number is given in Table 1. Average P, R and F values across all classification models,
for the two used hierarchical levels, are given in Table 2.
Chapter 0.880237 0.890911 0.884852</p>
        <p>Block 0.920025 0.911876 0.913499
Table 2. Classification models average performance across ICD-10 dictionaries hierarchical
levels
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results &amp; Discussion</title>
      <p>We applied our system to both language sets. Results on the French test set are well
above the average results over all participating systems. Test set results on the English
data show only average performance. Results for training and test data and performance
of individual parts of our system are shown in Table 3. The rule-based NER part referred
to as candidate generation, and explained in detail in 2.3, focuses on R measure. For
the French data sets candidate generation reaches R value of 0.860 for training and
0.844 for test data, while P is low as expected. After candidate ranking, explained in
2.4, using the classifier built on the ICD-10 hierarchy R value drops by 0.09 but P value
increases to 0.774 for training and 0.800 for test data. For the English data sets we see
a similar trend but an overall lower performance. Candidate generation only reaches a
R value of 0.76 for training and test sets. Again, after candidate ranking R value drops
but here by 0.16, while P value is increased up to 0.61.</p>
      <p>Language</p>
      <p>Method
French
English
candidate generation
candidate ranking
average score
median score
candidate generation
candidate ranking</p>
      <p>P</p>
      <p>The different performances for French and English data may be a result of the
differences between the datasets. For instance, we did not deal with abbreviations or
dissolve coordinated clauses. While they are present in both language sets, we have the
impression the English data contains more abbreviations. This could explain the poor
performance of the system for the English set. In general, spell checking may improve
the overall performance for both systems. Additionally, candidate generation may be
improved by taking context information into account.</p>
      <p>
        As far as candidate normalization and ranking is concerned, there are several
possibilities how to improve the results. For instance, the current approach, based on
optimized language-independent ML models and character level n-grams, ignored other
possible features available in the training data (e.g. sex, age, location, etc). Including
more diverse data for the classification models would be an interesting next step. One
could also look at the entire hierarchical structure of ICD-10. Our ML-models used the
first two hierarchical levels of ICD-10 dictionaries. We also tried out a more in-depth
classification by creating models below the second level in the ICD-10 dictionary
taxonomy. Unfortunately, those approaches failed to produce satisfactory results. This can
be attributed to the lack of sufficient data in the supplied training data sets for all
possible labels in the taxonomy. Also, we have tested using more complex features like
word embeddings which did not yield satisfactory results. This can be explained by the
fact that we have used available models not trained on in-domain documents. By using
in-language and in-domain documents to produce word embeddings one can expect this
approach to be far better. Even though the domain and used language is slightly different
and available corpora are small, one could test training word embeddings on available
biomedical French and English corpora such as Quaero [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], EMEA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or Mantra [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Emea corpus</article-title>
          . http://opus.lingfil.uu.se/EMEA.php.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Mantra corpus</article-title>
          . http://biosemantics.org/index.php/resources/mantra-gsc.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Quaro corpus</article-title>
          . https://quaerofrenchmed.limsi.fr/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Mohammed</given-names>
            <surname>Dermouche</surname>
          </string-name>
          , Vincent Looten, Rémy Flicoteaux, Sylvie Chevret, Julien Velcin, and
          <string-name>
            <given-names>Namik</given-names>
            <surname>Taright</surname>
          </string-name>
          .
          <article-title>ECSTRA-INSERM@ CLEF eHealth2016-task 2: ICD10 code extraction from death certificates</article-title>
          .
          <source>CLEF</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Lorraine</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          , Liadh Kelly, Hanna Suominen, Leif Hanlen, Aurélie Névéol, Cyril Grouin, João Palotti, and
          <string-name>
            <given-names>Guido</given-names>
            <surname>Zuccon</surname>
          </string-name>
          .
          <source>Overview of the CLEF eHealth Evaluation Lab</source>
          <year>2015</year>
          , pages
          <fpage>429</fpage>
          -
          <lpage>443</lpage>
          . Springer International Publishing,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Lorraine</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          , Liadh Kelly, Hanna Suominen, Aurélie Névéol, Aude Robert, Evangelos Kanoulas, Rene Spijker, João Palotti, and
          <string-name>
            <given-names>Guido</given-names>
            <surname>Zuccon</surname>
          </string-name>
          .
          <article-title>CLEF 2017 eHealth Evaluation Lab Overview</article-title>
          .
          <source>CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Lydia-Mai</surname>
          </string-name>
          Ho-Dac, Ludovic Tanguy, Céline Grauby, Nkauj Hnub Aurore Heu Mby, Justine Malosse, Laura Rivière,
          <article-title>Amélie Veltz-Mauclair, and Marine Wauquier. LITL at CLEF ehealth2016: recognizing entities in french biomedical documents</article-title>
          .
          <source>In Working Notes of CLEF</source>
          <year>2016</year>
          <article-title>- Conference and Labs of the Evaluation forum</article-title>
          , Évora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          ., pages
          <fpage>81</fpage>
          -
          <lpage>93</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Liadh</given-names>
            <surname>Kelly</surname>
          </string-name>
          , Lorraine Goeuriot, Hanna Suominen, Aurélie Névéol, João Palotti, and
          <string-name>
            <given-names>Guido</given-names>
            <surname>Zuccon</surname>
          </string-name>
          .
          <source>Overview of the CLEF eHealth Evaluation Lab</source>
          <year>2016</year>
          , pages
          <fpage>255</fpage>
          -
          <lpage>266</lpage>
          . Springer International Publishing,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Aurelie</given-names>
            <surname>Neveol</surname>
          </string-name>
          , Lorraine Goeuriot, Liadh Kelly, Kevin Cohen, Cyril Grouin, Thierry Hamon, Thomas Lavergne, Grégoire Rey, Aude Robert,
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Tannier</surname>
          </string-name>
          , et al.
          <article-title>Clinical information extraction at the CLEF eHealth evaluation lab 2016</article-title>
          .
          <source>In Proceedings of CLEF</source>
          <year>2016</year>
          <article-title>Evaluation Labs</article-title>
          and Workshop: Online Working Notes. CEUR-WS (
          <year>September 2016</year>
          ),
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Aurélie</surname>
            <given-names>Névéol</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert N. Anderson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Bretonnel Cohen</surname>
            , Cyril Grouin, Thomas Lavergne, Grégoire Rey, Aude Robert, Claire Rondet, and
            <given-names>Pierre Zweigenbaum. CLEF</given-names>
          </string-name>
          <article-title>eHealth 2017 Multilingual Information Extraction task overview: ICD10 coding of death certificates in English and French</article-title>
          .
          <article-title>CLEF 2017 Evaluation Labs</article-title>
          and Workshop: Online Working Notes, CEUR-WS,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E</given-names>
            <surname>Van Mulligen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Zubair Afzal</surname>
          </string-name>
          ,
          <article-title>Saber A Akhondi, Dang Vo, and Jan A Kors</article-title>
          .
          <source>Erasmus MC at CLEF eHealth</source>
          <year>2016</year>
          :
          <article-title>Concept recognition and coding in French texts</article-title>
          . CLEF,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Zweigenbaum</surname>
          </string-name>
          and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Lavergne</surname>
          </string-name>
          .
          <article-title>LIMSI ICD10 coding experiments on CépiDC death certificate statements</article-title>
          .
          <source>CLEF</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>