TeamUEvora at CLEF eHealth 2014 Task2a

       João Sequeira, Nuno Miranda, Teresa Gonçalves, Paulo Quaresma

          Computer Science Department, School of Science and Technology
                       University of Évora, Évora, Portugal
                   {jsequeira,nmiranda,tcg,pq}@uevora.pt


       Abstract. We present our first participation in a ShARe/CLEF eHealth
       Lab contributing for task 2a. Task 2 is an extension of the 2013 lab
       task 1 and consists of information extraction from clinical texts for
       Disease/Disorder Template Filling; task 2a aims at predicting each at-
       tribute’s normalization value.
       This work constitutes a preliminary approach to the problem of extract-
       ing and handling information from clinical texts. More than getting a
       good result, our priority was to get a first hint on the questions and
       problems that are posed within this area.
       For that, we developed a system that combines information from cTAKES
       output and the training corpus. The performance was measured using ac-
       curacy. Our system ranked 7th with an accuracy of 0.802, a F1 of 0.214,
       a precision of 0.217 and a recall value of 0.211.

       Keywords: Clinical texts, Template filling, Text normalization, cTAKES,
       Medical Informatics


1     Introduction

The ShARe/CLEF eHealth Lab 20141 [1,2] task 2 is an extension of the task 1
of the same lab from 2013 [3] and consists of information extraction from clinical
texts with the goal of disease/disorder template filling. For each disease/disorder
present in each clinical report there is a template with ten different attributes
and participants have to predict the value for each attribute. There are two
subtasks: 2a) assign normalization values to the ten attributes; 2b) assign cue
values to the nine attributes with cues.
    This is our first participation in a ShARe/CLEF eHealth Lab and we con-
tributed to subtask 2a, building a system that uses previous implemented tech-
nologies. Being this the first time we work with medical information, our main
priority is to understand the problems associated with the extraction of infor-
mation in the area. In this paper we present the system architecture and the
decisions made; we also present and analyse the experimental results on the
training and test corpora.
1
    https://sites.google.com/a/dcu.ie/clefehealth2014/


                                        156
    The paper has the following structure: Section 2 introduces the task, the
training and test corpora in detail and Section 3 presents the implemented sys-
tem. The results are discussed in Section 4 and conclusions and a glimpse of
future work are presented in Section 5.


2     Task

As said in Section 1, task 2 is an extension of the 2013 task 1 lab aiming at
filling templates with attributes values and cues.
     Files with empty templates for each disease/disorder (mentioned in the cor-
responding clinical text) were provided to the participants. Each template in-
dicates the Unified Medical Language System Concept Unique Identifier (CUI),
mention boundaries and the ten attributes needed to be filled. Each attribute
has two slot types: the normalized value and the lexical cue from the sentence
where the normalized value occurred. Task 2a evaluates the systems’ ability to
predict the normalized value for each attribute and task 2b the ability to find
the right cue slot value for each attribute.
     Since we participated only on task 2a (that was mandatory), our templates
have default values in all the cue slots. Table 1 presents template information: a
header with the file name, the cue slot of the disease/disorder and its CUI, the
nine modifiers associated with the disease/disorder with normalized values (task
2a) and cue slots (task 2b) plus the DocTime modifier that only has a normalized
value.


2.1   Description of the training and test corpora

The train and test corpora provided are composed of clinical texts from four
different types: discharge summary, ECG report, ECHO report and radiology
report. Their distribution in each corpus is presented in Table 2.
    Analysing both corpora we can observe some differences. In the training cor-
pus the Discharge summary type has 45.82% of documents while the remaining
classes have an equal number, 18.06%; in the test corpus there are only Discharge
summary documents.


3     System Architecture

This section presents the implementation of our system and the approaches taken
to tackle the modifiers.


3.1   cTAKES

As said before, our system uses previous implemented technologies for clinical
texts analysis and information extraction (this method was also used in task
1 [6,7,8,9,10,11,12,13,14] of 2013 ShARe/CLEF eHealth Lab).


                                       157
    Table 1. Template representation with the default values identified by (*).


                                     Header
                                    File name
                                     Cue slot
                          Concept Unique Identifier (CUI)
                                    Modifiers
        Attribute                  2a) Normalized values            2b) Cue slot
Negation indicator (NI)                      yes/no*                 if value is yes
                             patient*, family member, other, null,     if different
Subject class (SC)
                              donor family member, donor other          of patient
Uncertainty indicator (UI)                   yes/no*                 if value is yes
                           unmarked*, changed, increased, decreased, if different
Course class (CC)
                                 improved, worsened, resolved         of unmarked
                                      unmarked*, severe,               if different
Severity class (SV)
                                        slight, moderate              of unmarked
Conditional class (CO)                     true/false*              if value is true
Generic class (GC)                         true/false*              if value is true
                                          NULL*, CUI,                  if different
Body location (BL)
                                            CUI-less                    of NULL
                                   unknown*, before, after,                 no
DocTime class (DT)
                                    overlap, before-overlap                slot
                                       none*, date, time,              if different
Temporal Expression (TE)
                                          duration, set                  of none


Table 2. Number and percentage of documents of each type in the train corpus and
test corpus.

                                   Train             Test
                   Type
                              no. docs    %    no. docs     %
            Discharge summary   137      45.82   133      100.00
            ECG report           54      18.06     0       0.00
            ECHO report          54      18.06     0       0.00
            Radiology report     54      18.06     0       0.00
            Total               299     100.00   133      100.00


                                       158
    We used the output of the clinical Text Analysis and Knowledge Extraction
System (cTAKES) [4] (version 3.1.1). cTAKES2 is a open source linguistic tool
kit from the Apache Software Foundation. Some operations done by cTAKES
include:
 – boundary detection;
 – tokenization;
 – morphological normalization;
 – POS tagging;
 – shallow parsing;
 – negation detection;
 – named entities detection with mapping to UMLS terms;
 – relations detection

3.2    Modifiers
Negation and Uncertainty Indicators, Subject and Conditional Classes
and Body Location. For the modifiers NI, SC, UI, CO and BL we extracted the
information from the cTAKES output. Among the attributes related with the
diseases/disorders identified by cTAKES we found information that could be
directly used for some of the modifiers: we used the polarity attribute from
cTAKES to identify if the diseases/disorders were negated and assigning a value
to NI; for the SC, UI and CO modifiers, cTAKES have attributes with the same
name and we only needed to convert that information into the normalized values
of the task modifiers.
    For the BL modifier we used a set of rules to know if there were body loca-
tions in the same sentence of the identified disease/disorders and extracted the
respective CUI. We tried to extract the CUI of the most specific body location
possible, so we searched the expression with a bigger number of words, using the
premise that more information means more specificity.

Course Class and Severity Class. For the CC and SV modifiers we used a
mapping approach. From the 299 clinical texts that compose the training corpus,
we extracted expressions (without repetition) related to each modifier value.
    When using expressions from a mapping approach, there is the risk of identi-
fying equal expressions from the text but not in the correct context. To determine
if the modifiers CC and SV had this problem we checked the expressions in each
mappings file and concluded that the expressions were not too common and the
probability of identifying wrong expressions was acceptable for our objectives.

Generic Class. The GC modifier had a particular characteristic – there was no
example of it in the training corpus; assuming that the test corpus would follow
this, few to none appearances of this modifier expressions would appear. Based
on this assumption we used the default value (false) in every template.
2
    http://ctakes.apache.org/


                                       159
DocTime. The DT modifier expresses the temporal relation between the dis-
ease/disorder and the time when the clinical text was written. It can have the
following values:
 – Before-overlaps: disease/disorder identified in the past and still present;
 – Before: disease/disorder identified and treated in the past;
 – Overlap: disease/disorder present but there is no information about when
   it was diagnosed or when it will pass;
 – After: one action or event that it is still to come;
 – Unknown: no temporal relation information.
   For this modifier we used a purely statistic approach, meaning that, for each
template we selected the most common value presented in the training corpus –
Overlap.
   Table 3 presents occurrence percentage for training corpus for each possible
DT value; it can be noticed that more than half of the occurrences (56.35%) has
the Overlap value, so this one was chosen to fill all the templates. The Before
value had also an expressive number, but Overlap more than doubles it.

           Table 3. DocTime values distribution in the training corpus.

                       Value       no. occurrences        %
                   Before-overlaps       2814           16.41
                   Before                4205           24.52
                   Overlap               9666           56.35
                   After                  442            2.58
                   Unknow                 25             0.14
                   Total                17152           100.00


Temporal Expressions. To identify dates and hours we used regular expres-
sions. At first we thought of using a mapping approach too, but dates and hours
are very specific and if an expression appear in the same format but with one
day apart, that expression wouldn’t be identified.
    Based in the training corpus, we created four regular expressions aiming to
identify DATE and two regular expressions to identify Time:
 – DATE
    • Day/Month/Year (dd/mm/yyyy);
    • Day-Month-Year (dd-mm-yyyy);
    • Year-Month-Day (yyyy-mm-dd);
    • Month-Year (mm-yy).
 – TIME
    • 24 hours time (hh:mm);
    • 12 hours time (hh:mm am/pm)
   We didn’t consider the identification of expressions associated with the re-
maining values of the modifier – duration and set.


                                      160
3.3     Implementation

Our system was implemented using the Java programming language. Figure 1
presents the system’s architecture – it uses mapping files, regular expressions,
decisions based on the training corpus and cTAKES.
    XML files were generated from cTAKES, and from them we extracted infor-
mation using a parser and applied the procedures described in the last subsec-
tion. With the obtained information, the system updated the modifiers’ values
and printed the templates with the final result.


                              Fig. 1. System description


      Next we explain the steps necessary to get the filled templates:

1. run cTAKES with the clinical texts as input;
2. load information from templates, namely the header (because the rest are
   the default values), and the map files built for CC and SV;
3. process the XML from cTAKES using a set of rules to extract information;
4. use the information previously gathered to substitute the default values from
   the templates.


Step 1. The first step can be also called a pre-processing one – the generation
of the XML files using cTAKES. It generates a XML file for each clinical text.


                                        161
   cTAKES has a large set of specific analysis engines and a set of aggregate ones
that combine the specific ones. These aggregate engines describe how particular
annotators can be combined using a set of rules that describe how each annotator
uses the analysis of the previous one.
   Several aggregate engines were tested and the one that offered the best results
(and was used for the participation run) was AggregatePlaintextUMLSProcessor.


Step 2. On startup, the system loads the mapping files of CC and SV modifiers
obtained from the training corpus. It also loads the templates information into
a data structure that the system can use during all process.


Step 3. After steps 1 and 2, the system processes the XML files. We used
xPath expressions to extract the information considered necessary to task 2a;
this information was stored in data structures suited for being subsequently
processed. The information is extracted using two approaches:

 – the ’strict’ one, where the system searches diseases/disorders with a perfect
   match the information gathered from cTAKES;
 – the ’relaxed’ one, that is used in case the ’strict’ fails. This one, although less
   accurate, verifies if the boundaries of the disease/disorder from the template
   header are inside the ones of the chunk identified by cTAKES.

   The CUI of the body locations associated to the disease/disorder is obtained
using a set of rules that joins information from the different data structures
mantained. In order to reach the most specific CUI, the system chooses the
longest body location term from the cTAKES output.


Step 4. The final step gathers all information from the previous steps, relying
mainly in the coordinates of the diseases/disorders in text.
    To extract the modifiers information, the system searches the sentences where
the diseases/disorders were identified, looks for the cTAKES gathered informa-
tion, replaces the info in the respective template, searches for terms in the map-
ping, applies the regular expressions and writes the found info in the template.
Finally it writes the info for the DT and GC modifiers (that is equal for all tem-
plates).


4   Results

Table 4 presents the accuracy obtained by our system for the train and test
corpora, and also the best accuracy obtained for each modifier in the task 2a.
    Analysing the table we see that the overall accuracy between the train and
test corpora have a difference less than 0.03. For most of the modifiers the
accuracy between the train and the test corpora don’t differ more than 0.02, but
in some of them the test corpus’s accuracy is better: BL has an improvement of


                                        162
Table 4. System’s accuracy for the train and test corpora and the best accuracy
reported on task 2a for each modifier.

                       modifier     train     test     best
                         NI         0.916     0.901   0.969
                         SC         0.991     0.987   0.995
                         UI         0.932     0.955   0.960
                         CC         0.866     0.859   0.971
                         SV         0.915     0.919   0.982
                         CO         0.978     0.975   0.978
                         GC         1.000     1.000   1.000
                         BL         0.469     0.540   0.797
                         DT          0.59     0.024   0.328
                         TE         0.715     0.857   0.864
                       Overall      0.837     0.802   0.884


0.071 and TE an improvement of 0.142. For DT modifier, the training presents a
better result with an improvement of 0.57 over the test corpus.
    Comparing the test corpus results with the best accuracy reported in task
2a we notice that in some modifiers like SC, UI, CO and TE the difference is lower
than 0.2 and the values for class GC are equal; for modifiers BL, DT and CC there
is a bigger discrepancy between the results. Nevertheless, in overall our system
stood behind 0.082 when compared with the overall value calculated.
    Table 5 presents the F1 , precision and recall values for both the train and
test corpora. There we can see that the values are not so different between the
train and test corpora among most of the modifiers. Modifiers like SC, UI, CO, BL
and TE have better results in the test corpus; on the other side NI, CC, SV and
DT modifiers have better results in the training corpus.


       Table 5. F1 , precision and recall for training corpus and test corpus.

                             train                        test
         modifier
                     F1   precision     recall    F1   precision     recall
           NI       0.744   0.914       0.628    0.723   0.862       0.622
           SC       0.495   0.408       0.631    0.556   0.532       0.581
           UI       0.409   0.886       0.266    0.451   0.813       0.312
          CC        0.385   0.257       0.771    0.264   0.165       0.661
           SV       0.670   0.546       0.868    0.547   0.400       0.866
          CO        0.723   0.942       0.587    0.760   0.955       0.631
          GC          0       0           0        0       0           0
           BL       0.232   0.255       0.213    0.253   0.265       0.243
          DT        0.592   0.590       0.593    0.024   0.024       0.024
          TE        0.155   0.581       0.089    0.233   0.425       0.161
         Overall    0.479   0.513       0.448    0.214   0.217       0.211


                                        163
    The DT modifier obtained widely different values with a F1 of 0.592 in the
train and a corresponding value of 0.024 in the test corpus. This can be explained
because the value of this modifier is always the same for every template of
the output; this decision was based on the modifier statistics from the training
corpus.
    We ranked seventh among all the participants of task 2a, as showed in Table
6. The best system had an overall accuracy of 0.868 and our system obtained an
overall accuracy of 0.802. This value is lower than the average accuracy value of
all participants. Our system also obtained values below the average in the F1 ,
precision and recall.


                    Table 6. Relative performance for task 2a.

                 system       accuracy         F1     precision   recall
          TeamUEvora (rank 7)   0.802         0.214     0.217     0.211
          Best system           0.868         0.499     0.485     0.514
          Average               0.814         0.273     0.308     0.269


    Table 7 shows the relative performance of full template accuracy of our sys-
tem, the best value obtained and the average of all participants. The best value
is below 0.2 and our system obtained a very low value of 0.007.


        Table 7. Relative performance for task 2a of full template accuracy.

                               system        accuracy
                        TeamUEvora (rank 11)   0.007
                        Best System            0.196
                        Average                0.056


5     Conclusions and Future work

This paper presents the design and the implementation of our system, devel-
oped for participating in the task 2a of 2014 ShARe/CLEF eHealth Lab. The
task’s main goal was to obtain normalized attributes values for disease/disorder
template filling.


5.1   Conclusions

Our participation’s main goal was to understand the problems associated with
the design and implementation of a system to extract information from medical


                                       164
data. The system gathers knowledge from already implemented technology in
the clinical area, namely cTAKES; it also uses resources based on the training
corpus, regular expressions and decisions based on modifiers statistics.
    Between 14 participants, it ranked 7th, with an accuracy value of 0.802.
Taking into account our goal, we consider this a good result; nevertheless there
is much space for improvement.


5.2   Future work

cTakes is one of the resources of our system and we intend to add more sources
of knowledge of the medical area so we can improve our system. One hypothesis
is MetaMap[5], widely used in task 1 of 2013 Lab. Last year, some participants
used only cTAKES [6,8], others used only MetaMap [7,9,10,11,12] and others
used a joint approach [13,14].
    On the other hand, we intend to complement or substitute the approach
taken to some modifiers:

 – for Course and Severity we want to try a machine learning approach;
 – for temporal expressions, we want to improve the system by also identifying
   duration and set expressions. For that we intend to use technologies in the
   area of clinical time identification;
 – for DocTime we intend to incorporate knowledge in order to give different
   values to different examples (instead of using the same value for all of them):
 – for Generic modifier, we aim to develop a more automatic way to detect
   this class. Nevertheless, to do that we need some examples of this modifier
   in the training corpus.


References
1. Kelly, L., Goeuriot, L., Leroy, G., Suominen, H., Schreck, T., Mowery D. L., Velupil-
   lai, S., Chapman, W. W., Zuccon, G., Palotti, J.: Overview of the ShARe/CLEF
   eHealth Evaluation Lab 2014. Springer-Verlag.(2014)
2. Elhadad, N., Chapman, W., O’Gorman, T., Palmer, M., Savova, G.: The ShARe
   Schema for the Syntactic and Semantic Annotation of Clinical Texts. (2014). (Under
   Review).
3. Suominen, H., Salanterä, S., Velupillai, S., Chapman, W. W., Savova, G., Elhadad,
   N., Pradhan, S., South, B. R., Mowery, D. L., Jones, G. J., Leveling, J., Kelly,
   L., Goeuriot, L., Martinez, D., Zuccon, G.: Overview of the ShARe/CLEF eHealth
   Evaluation Lab 2013. CLEF 2013, Valencia, Spain: Springer Berlin Heidelberg. In:
   Proceedings of ShARe/CLEF eHealth Evaluation Labs (2013).
4. Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C.,
   Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes):
   architecture, component evaluation and applications. In: Journal of the American
   Medical Informatics Association 17 (2010) 507-513.
5. Aronson, A.R., Lang, F.M.: An overview of MetaMap: historical perspective and
   recent advances. JAMIA 17(3) (2010) 229-236.


                                         165
6. Cogley, J., Stokes, N., Carthy, J.: Medical Disorder Recognition with Structural
   Support Vector Machines. In: Proceedings of ShARe/CLEF eHealth Evaluation
   Labs (2013).
7. Leaman, R., Khare, R., Lu, Z.: NCBI at 2013 ShARe/CLEF eHealth Shared
   Task: Disorder Normalization in Clinical Notes with DNorm. In: Proceedings of
   ShARe/CLEF eHealth Evaluation Labs (2013).
8. Gung, J.: Using Relations for Identication and Normalization of Disorders: Team
   CLEAR in the ShARe/CLEF 2013 eHealth Evaluation Lab. In: Proceedings of
   ShARe/CLEF eHealth Evaluation Labs (2013).
9. Hervás, L., Martı́nez, V., Sánchez, I., Dı́az, A.: UCM at CLEF eHealth 2013 Shared
   Task1. In: Proceedings of ShARe/CLEF eHealth Evaluation Labs (2013).
10. Osborne, J. D., Gyawali, B., Solorio, T.: Evaluation of YTEX and MetaMap for
   clinical concept recognition. In: Proceedings of ShARe/CLEF eHealth Evaluation
   Labs (2013).
11. Wang, C., Akella, R.: UCSC’s System for CLEF eHealth 2013 Task 1. In: Proceed-
   ings of ShARe/CLEF eHealth Evaluation Labs (2013).
12. Zuccon, G., Holloway, A., Koopman, B., Nguyen A.: Identify Disorders in Health
   Records using Conditional Random Fields and Metamap; AEHRC at ShARe/CLEF
   2013 eHealth Evaluation Lab Task 1. In: Proceedings of ShARe/CLEF eHealth
   Evaluation Labs (2013).
13. Bodnari, A., Deleger, L., Lavergne, T., Neveol, A., Zweigenbaum, P.:A Su-
   pervised Named-Entity Extraction System for Medical Text. In: Proceedings of
   ShARe/CLEF eHealth Evaluation Labs (2013).
14. Xia, Y., Zhong, X., Liu, P., Tan, C., Na, S., Hu, Q., Huang, Y.: Combining
   MetaMap and cTAKES in Disorder Recognition: THCIB at CLEF eHealth Lab
   2013 Task 1. In: Proceedings of ShARe/CLEF eHealth Evaluation Labs (2013).


                                         166