<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cocoa: Extending a rule-based system to tag disease attributes in clinical records</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>S. V. Ramanan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>P. Senthil Nathan</string-name>
          <email>senthil@relagent.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>RelAgent Tech Pvt Ltd</institution>
          ,
          <addr-line>Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>150</fpage>
      <lpage>155</lpage>
      <abstract>
        <p>We extended Cocoa/Peaberry, our (RelAgent) existing rule based entity and event tagger, to tag attributes associated with diseases in clinical records. The boolean attributes of Negation, Uncertainty and Conditional were handled by an extension of the NegEx algorithm. The multi-valued Course and Severity attributes were detected either within the extended disease spans as output by the system, or by event-based annotation using a predicate-argument framework. The anatomical attribute Body Location was marked up by either nding embedded body parts in the extended disease spans or by being colocated close to the disease span. UMLS IDs for anatomical locations were derived by using a small number of morphological lemmas, and by a few rules derived by manual inspection in case of multiple hits. We used the most frequent value in the training data for Subject, Generic, and time-related attributes.</p>
      </abstract>
      <kwd-group>
        <kwd>rule-based tagger</kwd>
        <kwd>disease attributes</kwd>
        <kwd>clinical notes</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Tracking the severity, anatomical location, and temporal factors, including the
course, pertaining to a disease/symptom is of signi cant value in diagnosis. The
incidence and progression of a disease are also relevant to tracking response to
treatment with various medications, which is useful both in clinical practice as
well as in phase trials.</p>
      <p>
        Tagging of diseases, signs and symptoms themselves as well as their
normalization to SNOMED terminology was the focus of the 2013 ShARe/CLEF
eHealth task 1, which covered a variety of clinical documents, such as discharge
summaries and echo/radiology/ECG reports. Disease-tagging tasks share a
degree of overlap with previous tasks which marked up radiology reports [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
discharge summaries [
        <xref ref-type="bibr" rid="ref11 ref13">11, 13</xref>
        ].
      </p>
      <p>
        The current ShARe/CLEF eHealth task 2 [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] addresses annotations of the
various attributes associated with diseases, signs and symptoms. Speci cally, the
text span of the disease-related entity as well as its mapping to the SNOMED
subset of UMLS are \given" by the organizers, and the performance of systems is
evaluated against the detection of only the attributes of these given diseases. The
attributes are of several types. Some attributes are boolean; these are the
negation, speculation, and conditional attributes. Some attributes are multi-valued;
for example, the severity attribute can take \mild", \moderate" and \severe"
values. The progression or course attribute takes six values, some pertaining
to disease, such as improved/worsened/resolved, which others pertain more to
symptoms/signs, such as increased/decreased/changed. The anatomical location
of the disease or injury is in an attribute class of its own, with a very large value
set, as it is a normalization against the (large) sub-branch of UMLS dealing
with body parts. Other annotated attributes are: the bearer of the disease (e.g.,
the patient or a family member), time-related attributes, and generic symptoms
such as fever, which are system-wide and not con ned to a discrete anatomic
location.
      </p>
      <p>
        Cocoa/Peaberry is our (RelAgent) existing named entity and event tagger
for published literature in the biomedical domain. The system performs
reasonably well in various tasks ranging from tagging entities and events in the
molecular/cellular domain [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ][
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to tagging disease-related entities in the clinical domain
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For diseases, the system is designed to tag the maximal compatible span;
thus \acute renal insu ciency" would be tagged as a single entity. Thus, many
of the attributes required for the current task are already pre-tagged inside the
extended entity (location =`renal', severity =`acute') . However, severity/course
attributes in some cases (`His condition resolved') are indicated by a verb rather
than an adjective, and we use the event-processing capability of the system to
tag these cases as well. Further, proximity is used to detect anatomical sites that
are distal from the disease mention. Finally, as clinical notes are often
syntactically opaque [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we use a NegEx-based strategy for detecting attributes such as
negation and conditionality.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>System description</title>
      <p>
        We have used the Cocoa/Peaberry system to detect diseases, signs and symptoms
for the 2013 CLEF ehealth task 1 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The system is composed of the following
modules running in succession: (a) sentence splitter (b) acronym detector (c)
POS module (d) word level entity tagger (e) multi word entity detection (e)
coordination module (f) shallow parser and (g) predicate-argument detector and
nally (h) intra-sentential discourse detection for resolution of argument sharing
across predicates. The system detects entities from a range of semantic classes,
including proteins, chemicals and clinical procedures in addition to those primary
to this task, namely anatomical parts and diseases.
      </p>
      <p>Brie y, the system marks up anatomical parts (`liver') and disease
headwords (`cancer') separately, and then merges them to get the extended disease
entity. Body parts tagged by the system are composed of maximal spans, such
that words corresponding to location (`left upper') or laterality are merged into
the body part entity. For disease entities, words describing severity (`acute'),
frequency (`recurrent', `regular'), state (`unresectable', `disseminated') are also
merged if they are located proximally. When anatomical parts are in
coordination, and the last occurrence abuts a disease headword (`liver and breast cancer'),
all the coordinated anatomical parts are marked up as disease entities. Finally,
unusually named diseases or symptoms such as `premature cardiac complex'.
`long QT syndrome' are handled. The nal tagged disease entity is thus of
maximal span in that all adjacent words in any way pertaining to a disease are merged
into the entity.</p>
      <p>Many disease entities have disjoint spans, especially when they correspond
to pathological changes in body parts (`enlargement of the left ventricle') or to a
source (`drainage from wound'). In a predicate-argument formalism, one of the
disjoint spans appears as an argument of the other span, where the predicate
appears either as a verb or in a nominal form. The system has an event detection
module to detect such cases.</p>
      <p>For the Severity attribute, we used the training set to make a list of trigger
words that correspond to the three severity classes, namely `slight', `moderate'
and `severe'. The disease entity was marked up correspondingly for severity if the
extended disease span contained any of these trigger words. Similarity, embedded
trigger words associated with the various values for the Course attribute were
obtained from the training sets. Examples are 'progressive' for 'worsened' and
'healed' for 'resolved'. However, the majority of the Course attribute data derived
from verbal markers for the disease entity. An example for the value 'improved'
is the fragment: 'mental status changes that responded well to Haldol' which
follows the template 'Disease responds to Chemical'. We use the event detection
module to mark up the Course attribute for such cases with about 15 trigger
word driven event templates (verbs or nominals). However, we ignored the value
'changed' for the Course attribute as it caused too many false positives, and its
occurrence in the training set was small (less than 2%)</p>
      <p>As the system marked up anatomical parts before merging them with disease
headwords, embedded body locations are automatically output by the system.
Additionally the event module also marks up cases where the disease entity is
linked to an anatomical part or location through an intervening preposition,
as in `bleeding in your esophagus' or `loculated e usion seen on the left side'.
However, there exist a large number of examples where the anatomical location
does not occur in a sentential context, but is implied by the discourse, as when
it heads the utterance with a following colon or a hyphen, as in `Abdomen: no
masses'. Thus for diseases which do not have an embedded or prepositionally
proximal body part, we look for occurrences of anatomical locations within 100
characters of the disease entity span. We constructed about 50 rules matching
the anatomical part with the disease, e.g., `murmur' or `gallop' match to cardiac
entities such as `CV', `atheroma' or `extravasation' match to `artery', `clubbing'
and `edema' with `extremities' and so on.</p>
      <p>
        We mapped anatomical entities to UMLS IDs through a collection of
morphological transformations which converted the entity as it occurred in the text
to a regular expression. The framework for this module is the same as the one we
have used in other shared tasks to map diseases to their UMLS or MeSH IDs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ];
both involve substitutions such as modifying 'facial' to the regex `fac(ejial)' and
`ventricular' to `ventric(lejular)', apart from generic postpositional changes such
as `ive' to '(ivejion)'. Altogether, we have 120 rules for such morphological
transformations. The regular expression thus constructed was `grep'ped against the
descriptive phrases for entities in the anatomical subsections of UMLS as given in
the task description. Where there were multiple matches, the match with the
lowest UMLS ID was chosen, as this was empirically found to best model the training
data. An additional set of priority rules were used at the end to re ect certain
preferences of the annotators; for example, the UMLS entity "C0278454jAll
extremitiesjextremities" was preferred to "C0015385jLimb structurejExtremities"
when matching the term "extremities", while for the term `organ', the UMLS
term 'C0229983jBody organ structure' was preferred to `C0178784jOrgan'. There
were about 130 priority rules for such preferences.
      </p>
      <p>
        The Negation, Speculation and Conditional attributes were handled by
using the NegEx algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with additional trigger words as derived from the
training data. A few modi cations such as the words `mild' and `moderate'
limiting the scope of a 'no' negation to their left were also added. In addition, the
event detection module marks negation for any event when the verb/action is
conjoined to the appropriate marker (`not seen'). Beyond these few changes, the
NegEx algorithm was used as-is.
      </p>
      <p>
        We did not address the other attributes. For the Subject and Generic
attributes, the data was somewhat sparse, and we decided to leave the default
value in place. Detecting time attributes is well known to be di cult task [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
and given our own lack of time, we chose to simply insert the most frequent
value in the training set for these attributes, namely `none' for the Temporal
Expression attribute and 'overlap' for the DocTime attribute.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We tested and re ned the performance of the system against the training set.
We then ran two runs against the dataset, which di ered only in that the 2nd
run split a word into two tokens if there was a slash (`/') character in the word.
The two runs produced results that did not di er in the overall accuracy (0.843),
and we have shown the better of the results of the two runs for the individual
attributes in the column titled `Test set' in Table 1 below. The column entitled
`Best' is the best result over all systems for each attribute, while the `Baseline'
column shows the result if the gold template had been returned unaltered, i.e.
the accuracy with the default value for each attribute. These `Baseline` gures
were taken from the accuracy results for systems which had an F-score of 0:0 for
that attribute.</p>
      <p>
        One note of relevance is that we did not directly use the gold annotations
supplied by the task organizers (except for one case; please see below). Instead, we
used the system to itself detect the disease entities while simultaneously marking
up the disease attributes. Then, for every disease span in the gold annotations,
we found the rst disease span in our own annotations that overlapped with the
gold span using the overlap algorithm in the 2013 ShARe/CLEF eHealth Task
1 evaluation script [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. If there was no overlap, we left the attribute template
unaltered. If there was an overlap, we copied the attributes from the
systemdetected disease entity to the gold entity that it overlapped with. However, for
the anatomical part attribute alone, we used the algorithms described in the
Results section to nd any embedded body part or, failing that, the nearest body
location compatible with the gold tagged disease entity.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>We extended Cocoa/Peaberry, an existing multi-class entity tagger for the
biomedical domain, to detect attributes of disease entities in clinical records. With fairly
minor improvements, the system came second in overall accuracy and performed
reasonably well in most attributes.</p>
      <p>Except for the Body Location attribute, we did not directly use the gold
annotated disease entities, instead using the system itself to detect and tag the
disease entities and subsequently (and nally) transferring their attributes to
overlapping gold annotated entities. We believe therefore that our results on
the test set are likely to be close to results against unannotated data, where
disease entities are not tagged beforehand, and our results are encouraging in
this regard. However, improvement against the baseline is not very high for
Cocoa for many attributes, as re ected in the F-scores (not shown), indicating
that system performance could bene t from further improvement generally, but
particularly for some attributes such as Body Location.</p>
      <p>Acknowledgments. We thank Shereen Broido for discussions. The ShARe/CLEF
eHealth shared task was made possible by a grant to the task organizers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chapman</surname>
            ,
            <given-names>W.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bridewell</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanbury</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cooper</surname>
            ,
            <given-names>G.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buchanan</surname>
            ,
            <given-names>B.G.</given-names>
          </string-name>
          ,
          <article-title>A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <year>2001</year>
          .
          <volume>34</volume>
          : p.
          <fpage>301</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Elhadad</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chapman</surname>
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Gorman</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmer</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savova</surname>
            <given-names>G</given-names>
          </string-name>
          ..
          <article-title>The ShARe Schema for the Syntactic and Semantic Annotation of Clinical Texts</article-title>
          . Under Review.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leroy</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schreck</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mowery</surname>
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velupillai</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chapman</surname>
            <given-names>W.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <source>Palotti J. Overview of the ShARe/CLEF eHealth Evaluation Lab 2014</source>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Marsh</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sager</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <source>Analysis and processing of compact texts</source>
          .
          <source>1982. COLING 82: Proceedings of the Ninth International Conference on Computational Linguistics. North-Holland. 201206.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Pestian</surname>
            <given-names>J. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brew</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matykiewicz</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovermale</surname>
            <given-names>D. J.</given-names>
          </string-name>
          , Johnson N.,
          <string-name>
            <surname>Cohen K. B.: A Shared Task</surname>
          </string-name>
          <article-title>Involving Multi-label Classi cation of Clinical Free Text</article-title>
          .
          <year>2007</year>
          .
          <article-title>Association for Computational Linguistics (ACL</article-title>
          ),
          <year>2007</year>
          :
          <fpage>97104</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pradhan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elhadad</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>South</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christensen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vogel</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chapman</surname>
            <given-names>W. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savova</surname>
          </string-name>
          , G..
          <article-title>Task 1: ShARe/CLEF eHealth Evaluation Lab 2013</article-title>
          .
          <article-title>Proceedings of ShARe/CLEF eHealth Evaluation Labs</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>S. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senthil</surname>
            <given-names>Nathan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Performance of a multi-class biomedical tagger on the BioCreative IV CTD task</article-title>
          .
          <source>2013. Proceedings of the Fourth BioCreative Challenge Evaluation Workshop</source>
          vol.
          <volume>1</volume>
          .
          <string-name>
            <surname>Bethesda</surname>
          </string-name>
          , MD.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>S. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senthil</surname>
            <given-names>Nathan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.: Adapting</given-names>
            <surname>Cocoa</surname>
          </string-name>
          <article-title>a multi-class entity detector for the CHEMDNER task of BioCreative IV</article-title>
          .
          <year>2013</year>
          .
          <source>Proceedings of the Fourth BioCreative Challenge Evaluation Workshop</source>
          vol.
          <volume>2</volume>
          .
          <string-name>
            <surname>Bethesda</surname>
          </string-name>
          , MD.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>S. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senthil</surname>
            <given-names>Nathan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Performance and limitations of the linguistically motivated Cocoa/Peaberry system in a broad biomedical domain</article-title>
          .
          <source>2013. Proceedings of Workshop. BioNLP Shared Task</source>
          <year>2013</year>
          . ACL.
          <article-title>So a</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>S. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Broido</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senthil</surname>
            <given-names>Nathan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P. S. V.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Broido</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Senthil</surname>
          </string-name>
          <article-title>Nathan: Performance of a multi-class biomedical tagger on clinical records</article-title>
          .
          <source>2013. Proceedings of ShARe/CLEF eHealth Evaluation Labs.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rushisky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uzuner</surname>
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Evaluating temporal relations in clinical text: 2012 i2b2 Challenge</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          <year>2013</year>
          ;
          <volume>0</volume>
          :
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Suominen</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salantera</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velupillai S</surname>
          </string-name>
          . et al.:
          <source>Three Shared Tasks on Clinical Natural Language Processing. Proceedings of CLEF 2013</source>
          . To appear.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Uzuner</surname>
            <given-names>O.</given-names>
          </string-name>
          :
          <year>2011</year>
          i2b2/
          <article-title>VA co-reference annotation guidelines for the clinical domain</article-title>
          . Available from: https://www.i2b2.org/NLP/Coreference/assets/CoreferenceGuidelines.pdf
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>