<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IAM at CLEF eHealth 2018 : Concept Annotation and Coding in French Death Certificates</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sébastien Cossin</string-name>
          <email>sebastien.cossin@u-bordeaux.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vianney Jouhet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fleur Mougin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gayo Diallo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frantz Thiessard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natural Lan-</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CHU de Bordeaux, Pôle de santé publique, Service d'information médicale, Informatique et Archivistique Médicales (IAM)</institution>
          ,
          <addr-line>F-33000 Bordeaux</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ. Bordeaux, Inserm, Bordeaux Population Health Research Center</institution>
          ,
          <addr-line>team ERIAS, UMR 1219, F-33000 Bordeaux</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe the approach and results for our participation in the task 1 (multilingual information extraction) of the CLEF eHealth 2018 challenge. We addressed the task of automatically assigning ICD-10 codes to French death certificates. We used a dictionarybased approach using materials provided by the task organizers. The terms of the ICD-10 terminology were normalized, tokenized and stored in a tree data structure. The Levenshtein distance was used to detect typos. Frequent abbreviations were detected by manually creating a small set of them. Our system achieved an F-score of 0.786 (precision: 0.794, recall: 0.779). These scores were substantially higher than the average score of the systems that participated in the challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic annotation guage Processing</kwd>
        <kwd>Death certificates</kwd>
        <kwd>Entity recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In this paper, we describe our approach and present the results for our
participation in the task 1, i.e. multilingual information extraction, of the CLEF
eHealth 2018 challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. More precisely, this task consists in automatically
coding death certificates using the International Classification of Diseases, 10th
revision (ICD-10) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        We addressed the challenge by matching ICD-10 terminology entries to text
phrases in death certificates. Matching text phrases to medical concepts
automatically is important to facilitate tasks such as search, classification or
organization of biomedical textual contents [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Many concept recognition systems
already exist [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ]. They use different approaches and some of them are open
source. We developed a general purpose biomedical semantic annotation tool
for our own needs. The algorithm was initially implemented to detect drugs in
a social media corpora as part of the Drugs-Safe project [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We adapted the
algorithm for the ICD-10 coding task. The main motivation in participating in
the challenge was to evaluate and compare our system with others on a shared
task.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Corpora</title>
        <p>In the following subsections, we describe the corpora, the terminology used,
the steps of pre-processing and the matching algorithm.</p>
        <p>The data set for the coding of death certificates is called the CépiDC corpus.
Three CSV files (AlignedCauses) were provided by task organizers containing
annotated death certificates for different periods : 2006 to 2012, 2013 and 2014.
This training set contained 125;383 death certificates. Each certificate contains
one or more lines of text (medical causes that led to death) and some metadata.
Each CSV file contains a "Raw Text" column entered by a physician, a "Standard
Text" column entered by a human coder that supports the selection of an
ICD10 code in the last column. Table 1 presents an excerpt of these files. Zero to
multiples ICD-10 codes can be assigned to each line of a death certificate.
Raw Text Standard Text
SYNDROME DE GLISEMENT AVEC GRABATI- syndrome glissement
SATION DEPUIS OCTOBRE 2012
SYNDROME DE GLISEMENT AVEC GRABATI- grabatisation 2 mois R263
SATION DEPUIS OCTOBRE 2012
Table 1. One raw text sample with three selected columns of the training data.
Raw Text : text entered by a physician (duplicated in the file when multiple codes are
assigned).</p>
        <p>Standard Text : text entered by a human coder to support the selection of the ICD-10
code
ICD-10 code</p>
        <p>R453
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Dictionaries</title>
        <p>We constructed two dictionaries based on ICD-10. In practice, we selected
all the terms in the "Standard Text" column of the training set to build the first
one which was used in the second run. In the first run, we added to this previous
set of terms the 2015 ICD-10 dictionary provided by the task organizers. This
dictionary contained terms that were not present in the training corpus. When
a term was associated with multiple ICD-10 codes in our dictionary, we kept the
most frequent one (Table 2).</p>
        <p>The first dictionary contained 42;439 terms and 3,539 ICD-10 codes (run2)
and the second one 148;448 terms and 6,392 ICD-10 codes (run1).</p>
        <p>Metadata on death causes were not used (age, gender, location of death).</p>
        <p>
          All the terms were normalized through accents (diacritical marks) and
punctuation removal, lowercasing and stopwords removal (we created a list of 25
stopwords for this task). Then, each term was tokenized and stored in a tree
data structure. Each token of a N-gram term is a node in the tree and N-grams
correspond to different root-to-leaf paths [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] (Figure 1).
        </p>
        <p>The goal of our algorithm was to recognize one or many dictionary entries in
a raw text. An example is given in Figure 2. For each raw text entry, the same
normalization steps described above were performed first. The raw text was then
tokenized. For each token, the algorithm looked for an available dictionary token
depending on where it currently was in the tree. For example, the token
"cardiaque" was possible after the token "insuffisance" but was not available at the
root of the tree.</p>
        <p>For each token, the algorithm used three matching techniques : perfect match,
abbreviation match and Levenshtein match. The abbreviation match technique
used a dictionary of abbreviations. We manually added nine frequent
abbreviations after looking at some examples. The Levenshtein matching technique
used the Levenshtein distance. It corresponds to the minimum number of
singlecharacter edits (insertions, deletions or substitutions) required to change one
token into the other. The LuceneTMimplementation of the Levenshtein distance
was used.</p>
        <p>In Figure 2, the algorithm used these three techniques to match the tokens "ins",
"cardiaqu", "aigue" to the dictionary term "insuffisance cardiaque aigue" whose
ICD-10 code is I509. As the following token "detresse" was not a dictionary entry
at this depth, the algorithm saved the previous and longest recognized term and
restarted from the root of the tree. At this new level, "detresse" was detected
but as no term was associated with this token alone, no ICD-10 code was saved.
Finally, only one term was recognized in this example.</p>
        <p>Besides unigrams, bigrams were also indexed in LuceneTMto resolve composed
words. For example, "meningoencephalite" matched the dictionary entry
"meningoencephalite" by a perfect match and "meningo encephalite" thanks to the
Levensthein match (one deletion). Therefore, the algorithm entered two different
paths in the tree (Figure 3). By combining these different matching methods for
each token, the algorithm was able to detect multiple lexical variants. The
program was implemented in Java and the source code is on Github 1.</p>
        <p>1. https ://github.com/scossin/IAMsystem</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We submitted two runs on the CépiDC test set, one used all the terms entered
by human coders in the training set only (run 2), the other (run 1) added the
2015 ICD-10 dictionary provided by the task organizers to the set the terms of
run 1. We obtained our best precision (0.794) and recall (0.779) with run 2.</p>
      <p>Table 3 shows the performance of our system with median and average scores
of all participants in this task.</p>
      <p>Surprisingly, adding more terms (run 1) did not improve the recall, which
appears to be even slightly worse. The results were quite promising for our first
participation in this task, using a general purpose annotation tool.</p>
      <p>A limitation of the proposed algorithm that impacted recall was the absence
of term detection when adjectives were isolated. For example, in the sentence
"metastase hepatique et renale", "metastase renale" was not recognized even
though the term existed. This situation seemed to be quite frequent.</p>
      <p>Some frequent abbreviations were manually added to improve the recall in
this corpora. Improvement at this stage may be possible by automating the
abbreviation detection or by adding more entries manually.</p>
      <p>
        In the past, other dictionary-based approaches performed better [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In 2016,
the Erasmus system [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] achieved an F-score of 0.848 without spelling correction
techniques. In 2017, the SIBM team [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] used a dictionary-based approach with
fuzzy matching methods and phonetic matching algorithm to obtain an F-score
of 0.804.
      </p>
      <p>Further improvement may be possible by using a better curated terminology.
We are currently investigating frequent irrelevant codes that may have impacted
the precision. A post-processing filtering phase could improve the precision.</p>
      <p>We also plan to combine machine learning techniques with a dictionary-based
approach. Our system can already detect and replace typos and abbreviations
to help machine learning techniques increase their performance.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Affiliation</title>
      <p>DRUGS-SAFE National Platform of Pharmacoepidemiology, France</p>
    </sec>
    <sec id="sec-5">
      <title>Funding</title>
      <p>The present study is part of the Drugs Systematized Assessment in real-liFe
Environment (DRUGS-SAFE) research platform that is funded by the French
Medicines Agency (Agence Nationale de Sécurité du Médicament et des Produits
de Santé, ANSM). This platform aims at providing an integrated system
allowing the concomitant monitoring of drug use and safety in France. The funder
had no role in the design and conduct of the studies ; collection, management,
analysis, and interpretation of the data ; preparation, review, or approval of the
manuscript ; and the decision to submit the manuscript for publication. This
publication represents the views of the authors and does not necessarily represent
the opinion of the French Medicines Agency.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Azzopardi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Spijker</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Névéol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ramadier</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Palotti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and Jimmy and Zuccon, G.:
          <article-title>Overview of the CLEF eHealth Evaluation Lab 2018</article-title>
          .
          <source>CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS)</source>
          . Springer. (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Névéol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Grippo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lavergne</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Morgand</surname>
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Orsi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pelikán L</surname>
          </string-name>
          . and
          <string-name>
            <surname>Ramadier</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rey</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P: CLEF</given-names>
          </string-name>
          <article-title>eHealth 2018 Multilingual Information Extraction task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian</article-title>
          .
          <article-title>CLEF 2018 Evaluation Labs</article-title>
          and Workshop: Online Working Notes, CEUR-WS,
          <year>September</year>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jovanović</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bagheri</surname>
          </string-name>
          , E.:
          <article-title>Semantic Annotation in Biomedicine: The Current Landscape</article-title>
          .
          <source>Journal of Biomedical Semantics</source>
          (
          <year>2017</year>
          ). https://doi.org/10.1186/s13326-017-0153-x
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Tseytlin</surname>
            , E. and Mitchell,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Legowski</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Corrigan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Chavan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jacobson</surname>
          </string-name>
          , RS.: BMC bioinformatics (
          <year>2016</year>
          ).
          <article-title>NOBLE - Flexible Concept Recognition for Large-Scale Biomedical Natural Language Processing</article-title>
          . https://doi.org/10.1186/s12859-015-0871-y
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bigeard</surname>
          </string-name>
          , E.:
          <article-title>Construction de lexiques pour l'extraction des mentions de maladies dans les forums de santé</article-title>
          .
          <source>TALN</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pibiri</surname>
          </string-name>
          , GE. and
          <string-name>
            <surname>Venturini</surname>
          </string-name>
          , R.:
          <article-title>Efficient Data Structures for Massive N-Gram Datasets</article-title>
          .
          <source>In: 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pp.
          <fpage>615</fpage>
          -
          <lpage>624</lpage>
          . ACM, Shinjuku, Tokyo, Japan (
          <year>2017</year>
          ) https://doi.org/10.1145/3077136.3080798
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Névéol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and Cohen, KB. and
          <string-name>
            <surname>Grouin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hamon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lavergne</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rey</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>Clinical Information Extraction at the CLEF eHealth Evaluation Lab 2016</article-title>
          .
          <article-title>CEUR workshop proceedings</article-title>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Van</given-names>
            <surname>Mulligen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            and
            <surname>Afzal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            and
            <surname>Akhondi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            and
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            and
            <surname>Kors</surname>
          </string-name>
          , J.:
          <source>Erasmus MC at CLEF eHealth</source>
          <year>2016</year>
          :
          <article-title>Concept Recognition and Coding in French Texts</article-title>
          .
          <source>Online Working Notes. CEUR-WS</source>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Cabot</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Soualmia</surname>
          </string-name>
          , LF. and
          <string-name>
            <surname>Darmoni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <source>SIBM at CLEF eHealth Evaluation Lab</source>
          <year>2017</year>
          :
          <article-title>Multilingual Information Extraction with CIM-IND</article-title>
          .
          <article-title>CEUR-WS</article-title>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>