<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep Parsing at the CLEF2014 IE Task (DFKI-Medical)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tigran Mkrtchyan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Sonntag</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for AI (DFKI) Stuhlsatzenhausweg 3</institution>
          ,
          <addr-line>66123 Saarbruecken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>138</fpage>
      <lpage>146</lpage>
      <abstract>
        <p>We present an information extraction system for patient records which has been submitted to the ShARe/CLEF eHealth Evaluation Lab 2014 Task 2. The task was information extraction from clinical text in terms of a disease/disorder template lling process. The system uses a lexicalized parser to annotate grammatical relations between diseases, disorders, and other constituents on a sentence level. Grammatical pattern matching rules are applied in order to annotate the speci cs of individual disease/disorder cases. High accuracy is most important for clinical decision support; the comparative results suggest that a deep parsing approach is suitable for this task, as we achieved acc = 0:822 and acc = 0:804 for the two runs of the system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>We engage in medical information extraction in the medical domain and are
supported by several national and international "smart data" initiatives1. We
are particularly interested in medical records which can be data mined in
combination with clinical sensor data and image data towards multimedia information
extraction and knowledge capture in ontologies2 and medical cyber-physical
systems3. Technically, we use a deep parsing approach (dependency parsing) which
we will tune towards high-precision real-time information extraction in the next
3 years. Deep linguistic processing approaches di er from "shallower"
methods in that they yield more expressive and structural representations which
directly capture long-distance dependencies and underlying predicate-argument
structures. The main objective is to provide a hybrid information extraction
(IE) platform based on handwritten rules in combination with semi-supervised
machine learning approaches. Constituency and dependency parsing is key to
our approach, revealing a multitude of linguistic features for making both
rulewriting and classi er induction more e ective. The features we obtain from the
1 Federal Ministry of Education and Research (BMBF), Federal Ministry for Economic
A airs and Energy (BMWi), and European Institute of Innovation &amp; Technology
(EIT)
2 http://www.dfki.de/~sonntag/courses/SS14/IE.html
3 http://www.dfki.de/MedicalCPS/
sentence-level parsing step include, most notably, bilexical a nities, distant
dependencies, and verb head information to identify more complex relations (not
yet fully evaluated at CLEF) according to valency information of heads and
dependents.</p>
      <p>
        We are interested in extracting information from unstructured text,
particularly in the medical domain. Medical reports contain huge amounts of data about
medications, recommendations, procedures, etc. which are expressed mostly as
narrative text. Such form of information is di cult to access. One of the
approaches is identifying medical semantic relations [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. A technology which
extracts important data from text is the cornerstone for many clinical
applications, such as populating multimedia databases (PACS, picture archiving and
communication system) and summarizing medical records, required medical
insurance reporting, or clinical decision support [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Extracting modi ers for given
disease/disorder is an important task. Structured information can be more
effectively accessed and processed, which will result in construction of a more
intelligent medical system. We have implemented a system which extracts
targeted information from clinical reports.
      </p>
      <p>
        Thereby we extend Task 1 from [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], focusing on Disease/Disorder template
lling. The challenge of the ShARe/CLEF eHealth Task 2 consists of extracting
10 semantic attributes from unstructured medical texts (440 patient records);
a list of expected attribute values has been provided (such as 'yes' and 'no'
for negation indicator.) For the Body Location Indicator, for example, UMLS
concept unique identi ers (CUIs) should be extracted if mentioned, see http:
//clefehealth2014.dcu.ie/task-2. In this task, a patient record consists of
2 documents: one unstructured text le of the patient record itself and another
pipe delimited template with disease/disorder annotations (disorder text span
indexes are provided as well as default values of the attributes that modify
the disorder). Each patient record contained 60 disease/disorders on average.
Because of a prevalent sentence structure where a disease/disorder is mentioned,
we expect that the usage of NLP techniques like POS tagging, chunking and
especially syntax parsing are appropriate and key for obtaining a high accuracy
(which is most important for clinical decision support).
      </p>
      <p>The task of lling the IE template consists of providing slot values for each
given disease/disorder combination. This results in the task of checking whether
the sentences in the record contain modi ers in terms of attribute types (in our
deep NLP approach, the dependents). Table 1 describes the IE task in terms of
example sentences, given attributes and their norm slot values.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>
        We will adopt the following terminology from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to refer to special types of
NLP components. Language Resources (LRs) refer to data-only resources such
as lexica, corpora, thesauri or ontologies. Processing Resources (PRs) refer to
resources whose character is principally programmatic or algorithmic, such as
text classi ers, part-of-speech taggers (POS taggers), named entity recognizers
(NERs) or grammatical parsers. PRs typically include LRs such as a lexicon.
For the information extraction task in medical domain, we employ speci c LRs
and PRs.
1. In our system, a rst preprocessing step identi es the sentence in which
a mention of disease/disorder exists. Given the narrative text of a patient
record and the start and end indexes of the disease/disorder span, we capture
the exact sentence of the mentioned disorder.
2. The second step is POS tagging: we use an implementation of the Stanford
Log-linear Part-Of-Speech Tagger [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which rst tokenizes the text, then
generates the word lemmata for all tokens in the corpus and nally labels
tokens with their POS tag. These POS tags are then used by almost every
PR in the pipeline.
3. Then we run a rule-based PR for recognizing temporal expressions in order
to annotate temporal expression attributes based on SUTime. SUTime [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
is a library for recognizing and normalizing time expressions, which outputs
temporal tagging features. SUTime is a rule-based system which can be easily
extended and adapted to special temporal expression extraction needs of
idiosyncratic datasets such as the provided patient records. We use SUTime
to nd out the time, date and duration occurrences within the records. We
have added several patterns as medical LRs for recognizing medical time
and date expressions, such as 14/10, 08-09 etc.; the rules do also capture
mentionings of type pairs of DD/MM or DD-MM.
4. The fourth and most important step is syntactic sentence parsing. First we
run a constituency parser, which outputs the noun and verb chunks of the
sentence. We consider only those chunks where the disease/disorder is
mentioned. After constituency parsing, a dependency parser is used to output the
grammatical relations in the sentence. Here we check for the disease/disorder
to be the governor or the dependent in the relation. The Stanford Parser [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
PR is used for both parsing steps. This parser is probabilistic, which means
that it outputs the most likely analyses of the sentences, retaining a lot of
ambiguity in the result set.
5. The fth step consists of running constituency-tree based regular
expressions on constituency trees and semantic graph based regular expressions on
dependency trees. Tregex [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a utility for performing pattern matching on
tree structures and tries to match regular expression on constituency parse
tree nodes (the name is short for tree regular expressions); Semgrex is a
utility for identifying patterns in Stanford Dependencies[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These pattern
matching approaches work very similar to simple string based regular
expression matching, i.e., regex or regexp, but run on acyclic graph structures
instead. Bene ts are that expressions may not only involve usual regular
expressions, but the grammatical relations in the sentence; POS tags of words
and their named entities can be used in the patterns, too, which allows for
very detailed and highly accurate extraction patterns. Figures 1 and 2 show
medical examples of IE-relevant constituency and dependency parses,
respectively. As one can see in gure 1, both diseases (underlined red) belong
to di erent noun and verb phrase constituents (chunks).
6. The sixth step is running a look-up through our personal vocabularies
(medical LRs), where synonyms of class labels, keywords of remaining classes are
gathered, and if an occurrence of such a word is found in the output of
Tregex and Semgrex the default attribute value is changed to the class
label. In parallel to vocabulary look-up, we run MetaMap[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] via a web service
API and check whether the lexical constituents within the disease/disorder
noun chunks and their dependencies for any mentionings of "Body Location"
can be identi ed and normalized (mapping of biomedical text to the UMLS
metathesaurus). If this lookup is successful, we output the concept CUI as
the body location feature. The IE pipeline is depicted in gure 3.
      </p>
      <p>
        The derived relation extraction method based on syntax parsing has many
advantages over purely string-based methods: instead of writing a huge amount
of rules (for long dependencies) we can simply extract negation information with
the correct scope, "no" is a modi er for the diabetes (Fig. 2). Dependency parsing
captures the semantic predicate argument relationships between the entities in
addition to the syntactic relationships (e.g., the scope of negation information).
From the dependency parse tree we can imply that the modi er "may" (modal
verb) has a grammatical relation to the head word "impaired", which is part
of the disorder according to the governing head-driven dependency structure.
This indicates a sentence structure driven appearance of an uncertainty indicator
(UI) with a clearly de ned scope and complements lexical uncertainty indicators
like "evaluation of x". Our dependency rule experiments suggest that, unlike
syntactic parsing, the semantic predicate argument relationships between the
entities in addition to the syntactic relationships based on dependency parsing
are useful in this domain (similar medical applications of dependency parses
are reported in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). In addition, the rule set can be heavily reduced: Having
written simple relation patterns between modal verb and the disorder can already
annotate a comprehensive number of uncertain disorder indications. Overall, we
have only written 25 generic patterns to annotate the entire set of attributes
in the IE task (10 attributes) of the given disease/disorder, which results in a
"low-cost" manual rule-writing ratio of about 2.5 patterns per attribute for the
medical domain adaptation.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The main evaluation score was overall average accuracy for the task. Accuracy
is the fraction of the number of correctly identi ed attribute:value slots (true
positives and true negatives) and the total number of slot attributes. The IE
task has been evaluated as per attribute type and on average.</p>
      <p>The overall accuracy of our system is acc = 0:822 for the second run and
acc = 0:804 for the rst run. The attribute based recall, precision, and F-measure
scores are shown in gures 4 and 5.</p>
      <p>Teams were allowed to submit up to two runs of systems. In our case, the
difference between the two runs is that in the rst one for the rst seven attributes
we used prede ned grammatical relations (domain-independent) in order to
assess the performance of the IE engine on the attributes directly. For example,
to annotate the severity class, the attribute was searched only in JJ (Adjective)
relation within the disease/disorder governor node. For the Time Expression
attribute we relied purely on SUTime; when looking for body location we sent all
linguistic types of a sentence of the disease/disorder to the MetaMap API; and
to output the document type attribute, we used document paragraph pattern
output (i.e., if there is a mention of the word "history", then the document time
is "before"). The second run works with more sophisticated and domain-speci c
patterns, e.g., looking for the attributes not only in grammatical relations among
constituents, but also in noun phrase premodi ers. Here SUTime and MetaMap
are also involved, but rst we check if the word has a relation to the
disease/disorder and consider them only in the positive case. Additionally, for the Document
Time attribute, we rely on the verb's tense, which comes from the POS tagger,
in the second run. The bene ts of the parsing approach comes through when
considering the accuracy for body location indicator. For the rst run the
accuracy was acc = 0:486 and for the second one acc = 0:586. Parsing the sentence
and sending only the noun chunk to the MetaMap API increases the accuracy
by 20%. Another important result is that the parsed attribute output results in
higher precision, whereas the whole sentence approach results in higher recall.</p>
      <p>Concerning temporal expression attribute results, we see that parser
approach behaves better and has roughly 10% better overall accuracy when
comparing to pure lexical search results. In the rst run the temporal expression was
being searched in the overall sentence, whereas in the second run only considers
chunks where the disorder was mentioned. Just as for body location, the phrase
chunked method results in higher precision in comparison to the whole sentence
approach which results in higher recall. For the seven remaining attributes we
can see that with deep parsing rules (second run) we have better accuracy in the
majority of cases.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Summary</title>
      <p>
        The system uses a lexicalized parser to annotate grammatical relations between
diseases, disorders, and other constituents on a sentence level. The features we
obtained from the sentence-level parsing step include, most notably, bilexical
a nities, distant dependencies, and verb head information to identify more
complex relations (major part of which is not yet evaluated at CLEF) according to
valency information of heads and dependents. For writing domain-speci c
extraction patterns, we used Semgrex, an utility for identifying patterns in
Stanford Dependencies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Bene ts are that expressions may not only involve usual
regular expressions, but the grammatical relations in the sentence; POS tags
of words and their named entities can be used in the patterns, too, which
allows for very detailed and highly accurate extraction patterns: the bene ts of
the parsing approach are most evident when considering the accuracy for body
location indicator. For the rst run the accuracy was acc = 0:486 and for the
second one acc = 0:586. Overall, we have (only) written 25 generic (Semgrex)
patterns to annotate the entire set of attributes in the IE task (10 attributes)
of the given disease/disorder, which results in a "low-cost" manual rule-writing
ratio of about 2.5 patterns per attribute for the medical domain adaptation.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Aronson</surname>
          </string-name>
          .
          <article-title>E ective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</article-title>
          .
          <source>In Proceedings of the AMIA Symposium</source>
          , pages
          <volume>17</volume>
          {
          <fpage>21</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A. X.</given-names>
            <surname>Chang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>SUTIME: A Library for Recognizing and Normalizing Time Expressions</article-title>
          .
          <source>In Procedings of LREC</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>M.-C. de Marne e</surname>
            , B. MacCartney, and
            <given-names>C. D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Generating typed dependency parses from phrase structure parses</article-title>
          .
          <source>In Proceedings of LREC</source>
          , pages
          <volume>449</volume>
          {
          <fpage>454</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A. O. Gunes</given-names>
            <surname>Erkan</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Radev</surname>
          </string-name>
          .
          <article-title>Semi-supervised classi cation for extracting protein interaction sentences using dependency parsing</article-title>
          .
          <source>In Proceedings of EMNLPCoNLL</source>
          , pages
          <volume>228</volume>
          {
          <fpage>237</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Accurate unlexicalized parsing</article-title>
          .
          <source>In Proceedings of ACL</source>
          , pages
          <volume>423</volume>
          {
          <fpage>430</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.</given-names>
            <surname>Levy</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Andrew</surname>
          </string-name>
          .
          <article-title>Tregex and tsurgeon: tools for querying and manipulating tree data structures</article-title>
          .
          <source>In Proceedings of LREC</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>D.</given-names>
            <surname>Sonntag</surname>
          </string-name>
          .
          <article-title>Distributed NLP and Machine Learning for Question Answering Grid</article-title>
          .
          <source>In Proceedings of the workshop on Semantic Intelligent Middleware for the Web and the Grid at ECAI</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Sonntag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wennerberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Zillner</surname>
          </string-name>
          .
          <article-title>Pillars of ontology treatment in the medical domain</article-title>
          .
          <source>Journal of Cases on Information Technology (JCIT)</source>
          ,
          <volume>11</volume>
          (
          <issue>4</issue>
          ):
          <volume>47</volume>
          {
          <fpage>73</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>H.</given-names>
            <surname>Suominen</surname>
          </string-name>
          , S. Salantera,
          <string-name>
            <given-names>S.</given-names>
            <surname>Velupillai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Savova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Elhadad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pradhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>South</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Mowery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leveling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martinez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Zuccon</surname>
          </string-name>
          .
          <article-title>Overview of the ShARe/CLEF eHealth Evaluation Lab</article-title>
          . In P. Forner, H. Muller, R. Paredes,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , and B. Stein, editors,
          <source>Information Access Evaluation</source>
          , Multilinguality, Multimodality, and Visualization, Lecture Notes in Computer Science, pages
          <volume>212</volume>
          {
          <fpage>231</fpage>
          . Springer Berlin Heidelberg,
          <year>2013</year>
          . 4th International Conference of the CLEF Initiative.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <article-title>Feature-rich part-of-speech tagging with a cyclic dependency network</article-title>
          .
          <source>In Proceedings of HLT-NAACL</source>
          , pages
          <volume>252</volume>
          {
          <fpage>259</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>S.</given-names>
            <surname>Vintar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Todorovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sonntag</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          .
          <article-title>Evaluating context features for medical relation mining</article-title>
          .
          <source>In Proceedings of the ECML/PKDD Workshop on Data Mining and Text Mining for Bioinformatics</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>