<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NoNLP: Annotating Medical Domain by combining NLP techniques with Semantic Technologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ghislain Auguste Atemezing</string-name>
          <email>ghislain.atemezing@mondeca.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mondeca</institution>
          ,
          <addr-line>35 Boulevard Strasbourg, 75010, Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present in this work the process of annotating data from the medical domain using gazetteers used as reference for the annotation. The process combines semantic web technology and NLP concepts. The application is proposed in this eHealth challenge for multilingual extraction of IC10 codes. The first results give some directions on which aspects of the workflow to improve to make a better system.</p>
      </abstract>
      <kwd-group>
        <kwd>eHealth</kwd>
        <kwd>RDF</kwd>
        <kwd>NLP</kwd>
        <kwd>ICD-10</kwd>
        <kwd>multilingual extraction</kwd>
        <kwd>semantic annotation</kwd>
        <kwd>death certificates</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        We have performed the following tasks:
– Design a GATE [
        <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
        ] workflow to annotate the RDF datasets based on
      </p>
      <p>Gazetteers extracted from the dictionaries
– Work on both French (raw data) and English corpus on a single workflow, in
a multilingual approach. Thus, we are able to handle more languages.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Main objectives of experiments</title>
      <p>The main objectives are to retrieve the relevant ICD-10 codes in a text field of a
CertDC document line by line as they are provided in the dataset.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Description of the Approach</title>
      <p>
        The so-called approach “Not Only NLP” (NoNLP) is a combination of NLP
technique for entity extraction from text based on GATE annotator and the
extensive use of RDF model for data manipulation within the system and to
further enrich the data as a knowledge base. NLP helps using GATE pipeline We
use the Content Augmentation Manager (CA-Manager) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a semantic tool for
knowledge extraction from unstructured data (text, image, etc) and a knowledge
manager, able to handle both the model and the data in RDF.
4.1
      </p>
      <sec id="sec-3-1">
        <title>Content Annotation</title>
        <p>The annotation process employed is based on a central component: the Content
Augmentation Manager (CA-Manager). CA-Manager is in charge of processing
any type of content (plain text, XML, HTML, PDF, etc). This module extracts
the concepts and entities detected using text mining techniques with the text
input module. The strength of CA-Manager is to combine semantic technologies
with a UIMA-based infrastructure1 which has been enriched and customized to
address the specific needs of both semantic annotation and ontology population
tasks.</p>
        <p>In the scenario presented in this paper, we use the GATE framework for
the entity extraction. CA-Manager uses an ontology-based annotation schema
to transform heterogeneous content (text, image, video, etc.) into
semanticallydriven and organized one. The overall architecture of CA-Manager is depicted in
Figure 1. We first create the gazetteer with the SKOS document obtained from
the experts. We then launch in parallel 10 documents in multi-threads containing
text information represented by each row in the CSV. The annotation report
contains the valid knowledge section, an RDF/XML document containing the
Uniform Resources Identifiers (URIs) of the concepts detected by the annotator.
Finally, a SPARQL update query is launched to update the dataset containing
the all the data in RDF.
1 Unstructured Information Management Architecture (http://uima.apache.org)
1. First we generate the gazetteer to be used for the annotations by
SKOSification process.
2. Second: we prepare a workflow suitable for our use case, such as defining how
to deal with languages, text accentuation, etc. (Figure 2). This includes:
– the conversion of all the test data from CSV into RDF by using a model
(ontology) to transform them into RDF.
– The actual annotation to extract the pertinent concepts to enrich the
initial dataset of test data into an endpoint.
3. Third: We proceed then to enriching the dataset in the endpoint by associating
each URI detected in the annotation phase with its relevant code number
(which is a property) of the SKOS concept. Finally, a SPARQL CONSTRUCT
query is then launched to create the result output in CSV based on the
specification of the challenge.</p>
        <p>Figure 3 provides a big picture of the NoNLP approach. The process starts
with the data conversion in RDF of the dictionaries to generate the gazetteers
using SKOS2GATE. The dictionaries is used as configured in the GATE workflow
for the annotation by CA-Manager, together with the test dataset already
converted into RDF. The result is a set of files representing the valid knowledge
in RDF/XML with the entities detected and the associated URI as described in
the Gazetteer dataset. All the files output of CA-Manager are merged into the
triple store where each piece of information is manipulated by an URI with the
associated ICD-10 code if available. The enriched dataset can be exported into
CSV according to the specifications of the challenge
We first model all the dictionaries received in RDF using SKOS vocabulary2.
Each element in the dictionary is a SKOS:Concept, and different concepts in
2 https://www.w3.org/TR/skos-reference/
different years have different skos:inScheme property. Figure 4 shows a sample
view for the concept CANCER modeled in SKOS with the attributes attached
to describe the concept in our scenario. This dataset is used as input of our tool
SKOS2GATE which transforms the RDF file into list of terms or Gazetteers with
some normalization process within the configuration. The configuration file used
for creating the dictionary contains two major information:
– Name of the Java class of the morphological analyzer used to lemmatize
labels, one per language.
– The indication to convert all the labels to lower case in the dictionary. This
is the only normalization we made that affect the dictionary. The choice is
guided by the characteristics of the text being analyzed, as we are using exact
matching with the terms during the entity detection process.
NoNLP assumes that all the data manipulated are graphs. So, we transform
into RDF all the test data to be used in our experiments. The benefits here is
that each element of the graph is described by a unique URI to identify a single
resource and to be used for merging information attached to it. The input of the
annotator is not anymore a document, but a concept with properties representing
the actual document to be processed.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Resources</title>
      <p>We solely used the raw data received without additional extra data. Hence we
used the following datasets in their original version received for the challenge:
– All the dictionaries in CSV for French dataset from 2006 to 2015.
– The test dataset “AlignedCauses_2014test.csv” in the French dataset.
– The American dictionary provided in CSV for the English terms.
– The test corpus with data for 2015</p>
      <p>Precision Recall F-measure
NoNLP-run 0.3751 0.1305 0.1936
Average 0.4747 0.3583 0.4059</p>
      <p>Median 0.5411 0.4136 0.508</p>
      <p>Table 2. Results of the run for French raw corpus
However, we do not use the training data, which is one of the weakness of the
approach as described in the working notes. We left that for future work as it
will help detect patterns to write suitable JAPE rules for the annotator.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>With the help of the organizers, the scores of the output of our runs were
computed using the evaluation program. The unofficial scores were obtained after
converting them to the expected challenge csv format. The scores are as follows:
6.1</p>
      <sec id="sec-5-1">
        <title>EN-RAW Data</title>
        <p>We ran the EN corpus containing 14,833 pieces of text in our annotator. The goal
was to find in each text the ICD-10 SKOS concept present in the Gazetteer by
using an exactMatch approach. The score for this dataset is presented in Table
1. We obtained a precision of 69% and a recall of almost 31%.
6.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>FR-RAW data</title>
        <p>We ran the FR corpus (raw dataset) containing 59,176 pieces of text in our
annotator. The results in Table 2 show a low precision (37.51%) compared to the
EN run.
7</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Analysis of the results</title>
      <p>
        The results show that we detect better for the English corpus than for the French
one. We obtained both high precision and recall with the English corpus, while
in French corpus we obtained lower recall. Additionally, our approach seems to
work better in English corpus than French, by a factor of 2x. This shows that our
system can benefits from the training dataset by adding more alternative labels
in general, and in particular by adding some extra normalization based on some
patterns that could be observed in the training dataset. We can improve the
current scores if we use the training dataset received during this challenge. Our
current approach did not make use of the development data so as to help detect
patterns to use for creating pattern-matching grammars (JAPE) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] rules. This
can be seen as the basic result that just needs further tuning when exploiting
training dataset. We need to understand better the French dataset to enrich our
Gazetteers.
8
      </p>
    </sec>
    <sec id="sec-7">
      <title>Future Work</title>
      <p>The approach presented in this working notes does not make use of all the power
of Semantic technologies, such as the use of the classification rules or inference.
We have considered the resources without any hierarchy or relations such as
broader, narrower, etc. It could have been possible to have more detected entities
based on the relationships within the IC10 concepts. Also, we do not use the
training set to add a “learning module” both to improve the gazetteers and then
to detect more ICD-10 code. We plan to add JAPE rules based on the patterns
detected in the “Gold standard” to improve the GATE workflow detection, and
complete the normalization of the gazetteers with additional variations in the
label.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>H.</given-names>
            <surname>Cherfi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coste</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Amardeilh</surname>
          </string-name>
          .
          <article-title>Ca-manager: a middleware for mutual enrichment between information extraction systems and knowledge repositories</article-title>
          .
          <source>In 4th workshop SOS-DLWD “Des Sources Ouvertes au Web de Données</source>
          , pages
          <fpage>15</fpage>
          -
          <lpage>28</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>W. W. W.</given-names>
            <surname>Consortium</surname>
          </string-name>
          et al.
          <source>Rdf</source>
          <volume>1</volume>
          .
          <article-title>1 concepts and abstract syntax</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>H.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          . Gate, a
          <article-title>general architecture for text engineering</article-title>
          .
          <source>Computers and the Humanities</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>223</fpage>
          -
          <lpage>254</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Suominen</surname>
          </string-name>
          , et al.
          <article-title>Clef 2017 ehealth evaluation lab overview</article-title>
          .
          <source>In Lecture Notes in Computer Science</source>
          . Springer, Springer, Berlin / Heidelberg, German (in press),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>T.</given-names>
            <surname>Kenter</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          .
          <article-title>Using gate as an annotation tool</article-title>
          . University of Sheffield,
          <source>Natural language processing group</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Miles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Matthews</surname>
          </string-name>
          , M. Wilson, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Brickley</surname>
          </string-name>
          .
          <article-title>Skos core: simple knowledge organisation for the web</article-title>
          .
          <source>In International Conference on Dublin Core and Metadata Applications</source>
          , pages
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Névéol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. B.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grouin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavergne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Robert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rondet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Zweigenbaum</surname>
          </string-name>
          .
          <article-title>Clef ehealth 2017 multilingual information extraction task overview: Icd10 coding of death certificates in english and french</article-title>
          .
          <source>In CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS</source>
          . Springer, Springer, Berlin / Heidelberg, German (in press),
          <year>September 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Thakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Osman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Lakin</surname>
          </string-name>
          .
          <article-title>Gate jape grammar tutorial</article-title>
          . Nottingham Trent University, UK,
          <string-name>
            <surname>Phil</surname>
            <given-names>Lakin</given-names>
          </string-name>
          , UK, Version,
          <volume>1</volume>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>