Task 2: ShARe/CLEF eHealth Evaluation Lab
                     2014

Danielle L. Mowery1 , Sumithra Velupillai2 , Brett R. South3 , Lee Christensen3 ,
David Martinez4 , Liadh Kelly5 , Lorraine Goeuriot5 , Noemie Elhadad6 , Sameer
          Pradhan7 , Guergana Savova7 , and Wendy W. Chapman3 ?
                1
                    University of Pittsburgh, PA, USA, dlm31@pitt.edu
                2
                    Stockholm University, Sweden, sumithra@dsv.su.se
     3
        University of Utah, UT, USA, brett.south@hsc.utah.edu, leenlp@q.com,
                                wendy.chapman@utah.edu
          4
             University of Melbourne and MedWhat (CA,USA), VIC, Australia,
                              davidm@csse.unimelb.edu.au
       5
         Dublin City University, Ireland, Firstname.Lastname@computing.dcu.ie
            6
              Columbia University, NY, USA, noemie.elhadad@columbia.edu
      7
         Harvard University, MA, USA, sameer.pradhan@childrens.harvard.edu,
                        guergana.savova@childrens.harvard.edu


        Abstract. This paper reports on Task 2 of the 2014 ShARe/CLEF
        eHealth evaluation lab which extended Task 1 of the 2013 ShARe/CLEF
        eHealth evaluation lab by focusing on template filling of disorder at-
        tributes. The task was comprised of two subtasks: attribute normaliza-
        tion (task 2a) and cue identification (task 2b). We instructed participants
        to develop a system which either kept or updated a default attribute
        value for each task. Participant systems were evaluated against a blind
        reference standard of 133 discharge summaries using Accuracy (task 2a)
        and F-score (task 2b). In total, ten teams participated in task 2a, and
        three teams in task 2b. For task 2a and 2b, the HITACHI team systems
        (run 2) had the highest performances, with an overall average average
        accuracy of 0.868 and F1-score (strict) of 0.676, respectively.


Keywords: Natural Language Processing, Template Filling, Information Ex-
traction, Clinical Text


1     Introduction

In recent years, healthcare initiatives such as the United States Meaningful Use
[1] and European Union Directive 2011/24/EU [2] have created policies and leg-
islation to promote patient involvement and understanding of their personal
health information. These policies and legislation have encouraged health care
?
    DLM, SV, WWC led the task, WWC, SV, DLM, NE, SP, and GS defined the task,
    SV, DLM, BRS, LC, and DM processed and distributed the dataset, and SV, DLM,
    and DM led result evaluations


                                           31
organizations to provide patients open access to their medical records and ad-
vocate for more patient-friendly technologies. Patient-friendly technologies that
could help patients understand their personal health information, e.g., clinical
reports, include providing links for unfamiliar terms to patient-friendly websites
and generating patient summaries that use consumer-friendly terms and simpli-
fied syntactic constructions. These summaries could also limit the semantic con-
tent to the most salient events such as active disorder mentions and their related
discharge instructions. Natural Language Processing (NLP) can help by filter-
ing non-active disorder mentions using their semantic attributes e.g., negated
symptoms (negation) or uncertain diagnoses (certainty) [3] and by identifying
the discharge instructions using text segmentation [4, 5].
     In previous years, several NLP shared tasks have addressed related seman-
tic information extraction tasks such as automatically identifying concepts -
problems, treatments, and tests - and their related attributes (2010 i2B2/VA
Challenge [6]) as well as identifying temporal relationships between these clin-
ical events (2012 i2B2/VA Challenge [7]). The release of these semantically-
annotated datasets to the NLP community is important for promoting the de-
velopment and evaluation of automated NLP tools. Such tools can identify, ex-
tract, filter and generate information from clinical reports that assist patients
and their families in understanding the patient’s health status and their contin-
ued care. The ShARe/CLEF eHealth 2014 shared task [8] focused on facilitating
understanding of information in narrative clinical reports, such as discharge sum-
maries, by visualizing and interactively searching previous eHealth data (Task 1)
[9], identifying and normalizing disorder attributes (Task 2), and retrieving doc-
uments from the health and medicine websites for addressing questions mono-
and multi-lingual patients may have about the disease/disorders in the clinical
notes (Task 3) [10]. In this paper, we discuss Task 2: disorder template filling.


2     Methods

We describe the ShARe annotation schema, the dataset, and the evaluation
methods used for the ShARe/CLEF eHealth Evaluation Lab Task 2.


2.1   ShARe Annotation Schema

As part of the ongoing Shared Annotated Resources (ShARe) project [11], disor-
der annotations consisting of disorder mention span offsets, their SNOMED CT
codes, and their contextual attributes were generated for community distribu-
tion. For 2013 ShARe/CLEF eHealth Challenge Task 1[12] the disorder mention
span offsets and SNOMED CT codes were released. For 2014 ShARe/CLEF
eHealth Challenge Task 2, we released the disorder templates with 10 attributes
that represent a disorder’s contextual description in a report including Negation
Indicator, Subject Class, Uncertainty Indicator, Course Class, Severity Class,
Conditional Class, Generic Class, Body Location, DocTime Class, and Temporal


                                       32
Expression. Each attribute contained two types of annotation values: normaliza-
tion and cue detection value. For instance, if a disorder is negated e.g., “denies
nausea”, the Negation Indicator attribute would represent nausea with a nor-
malization value: yes indicating the presence of a negation cue and cue value:
start span-end span for denies. All attributes contained a slot for a cue value
with the exception of the DocTime Class. Each note was annotated by two pro-
fessional coders trained for this task, followed by an open adjudication step.
    From the ShARe guidelines[13], each disorder mention contained an attribute
cue as a text span representing a non-default normalization value (*default nor-
malization value)[8]:

   Negation Indicator (NI): def. indicates a disorder was negated: *no, yes
Ex. “No cough.”

   Subject Class (SC): def. indicates who experienced a disorder: *patient,
family member, donor family member, donor other, null, other
Ex. “Dad had MI.”

    Uncertainty Indicator (UI): def. indicates a measure of doubt about the
disorder: *no, yes
Ex. “Possible pneumonia.”

   Course Class (CC): def. indicates progress or decline of a disorder: *un-
marked, changed, increased, decreased, improved, worsened, resolved
Ex. “Bleeding abated.”

    Severity Class (SV): def. indicates how severe a disorder is: *unmarked,
slight, moderate, severe
Ex. “Infection is severe.”

    Conditional Class (CO): def. indicates existence of disorder under certain
circumstances: *false, true
Ex. “Return if nausea occurs.”

   Generic Class (GC): def. indicates a generic mention of disorder: *false,
true
Ex. “Vertigo while walking.”

   Body Location (BL): def. represents an anatomical location: *NULL, CUI:
C0015450, CUI-less
Ex. “Facial lesions.”

   DocTime Class (DT): def. indicates temporal relation between a disorder
and document authoring time: before, after, overlap, before-overlap, *unknown


                                       33
Ex. “Stroke in 1999.”

   Temporal Expression (TE): def. represents any TIMEX (TimeML) tem-
poral expression related to the disorder: *none, date, time, duration, set
Ex. “Flu on March 10.”


2.2   Dataset
At the time of the challenge, the ShARe dataset consisted of 433 de-identified
clinical reports sampled from over 30,000 ICU patients stored in the MIMIC
(Multiparameter Intelligent Monitoring in Intensive Care) II database [14]. The
initial development set contained 300 documents of 4 clinical report types -
discharge summaries, radiology, electrocardiograms, and echocardiograms. The
unseen test set contained 133 documents of only discharge summaries. Partici-
pants were required to participate in Task 2a and had the option to participate
in Task 2b.

    For Task 2a and 2b, the dataset contained templates in a “|” delimited for-
mat with: a) the disorder CUI assigned to the template as well as the character
boundary of the named entity, and b) the default values for each of the 10 at-
tributes of the disorder. Each template contained the following format [12]:

   DD DocName|DD Spans|DD CUI|Norm NI|Cue NI|
Norm SC|Cue SC|Norm UI|Cue UI|Norm CC|Cue CC|
Norm SV|Cue SV|Norm CO|Cue CO|Norm GC|Cue GC|
Norm BL|Cue BL|Norm DT|Norm TE|Cue TE

    For example, the following sentence, “The patient has an extensive thyroid
history.”, was represented to participants with the following disorder template
with default normalization and cue values:

   09388-093839-DISCHARGE SUMMARY.txt|30-36|C0040128|*no|*NULL|
patient|*NULL|*no|*NULL|*false|*NULL|
unmarked|*NULL|*false|*NULL|*false|*NULL|
NULL|*NULL|*Unknown|*None|*NULL

   For Task 2a: Normalization, participants were asked to either keep or update
the normalization values for each attribute. For the example sentence, the Task
2a changes:

   09388-093839-DISCHARGE SUMMARY.txt|30-36|C0040128|*no|*NULL|
patient|*NULL|*no|*NULL|*false|*NULL|
unmarked|*NULL|severe|*NULL|*false|*NULL|
C0040132|*NULL|Before|*None|*NULL


                                      34
   For Task 2b: Cue detection, participants were asked to either keep or update
the cue values for each attribute. For the example sentence, the Task 2b changes:

   09388-093839-DISCHARGE SUMMARY.txt|30-36|C0040128|*no|*NULL|
patient|*NULL|*no|*NULL|*false|*NULL|
unmarked|*NULL|severe|20-28|*false|*NULL|
C0040132|30-36|Before|*None|*NULL

   In this example, the Subject Class cue span is not annotated in ShARe since
*patient is an attribute default.

2.3   Participant Recruitment and Registration
We recruited participants using listservs such as AMIA NLP Working Group,
AISWorld, BioNLP, TREC, CLEF, Corpora, NTCIR, and Health Informatics
World. Although the ShARe dataset is de-identified, it contains sensitive, patient
information. After registration for task 2 through the CLEF Evaluation Lab,
each participant completed the following data access procedure, which included
(1) a CITI [15] or NIH [16] Training certificate in Human Subjects Research, (2)
registration on the Physionet.org site [17], (3) signing a Data Use Agreement to
access the MIMIC II data.

2.4   Evaluation Metrics
For Tasks 2a and 2b, we determined system performance by comparing partic-
ipating system outputs against reference standard annotations. We evaluated
overall system performance and performance for each attribute type e.g., Nega-
tion Indicator.

Task 2a: Normalization Since we defined all possible normalized values for
each attribute, we calculated system performance using Accuracy as Accuracy =
count of correct normalized values divided by total count of disorder templates.

Task 2b: Cue Detection Since the number of strings not annotated as at-
tribute cues (i.e., true negatives (TN)) is very large, we followed [18] in calcu-
lating F1-score as a surrogate for kappa. F1-score is the harmonic mean of recall
and precision, calculated from true positive, false positive, and false negative
annotations, which were calculated as follows:

true positive (TP) = the annotation cue span from the participating system
    overlapped with the annotation cue span from the reference standard
false positive (FP) = an annotation cue span from the participating system
    did not exist in the reference standard annotations
false negative (FN) = an annotation cue span from the reference standard
    did not exist in the participating system annotations


                                       35
Table 1: System Performance, Task 2a: predict each attribute’s normalization slot value.
Accuracy: overall (official ranking result)

                  Attribute System ID ({team}.{system}) Accuracy
                  Overall TeamHITACHI.2                 0.868
                  Average TeamHITACHI.1                 0.854
                            RelAgent.2                  0.843
                            RelAgent.1                  0.843
                            TeamHCMUS.1                 0.827
                            DFKI-Medical.2              0.822
                            LIMSI.1                     0.804
                            DFKI-Medical.1              0.804
                            TeamUEvora.1                0.802
                            LIMSI.2                     0.801
                            ASNLP.1                     0.793
                            TeamCORAL.1.add             0.790
                            TeamGRIUM.1                 0.780
                            HPI.1                       0.769


Recall =
                                             TP
                                                                                    (1)
                                         (T P + F N )

Precision =
                                             TP
                                                                                    (2)
                                         (T P + F P )

F1-score =
                                     (Recall ∗ P recision)
                                 2                                                  (3)
                                     (Recall + P recision)


3    Results

Participating teams included between 1-4 people and competed from Canada
(team GRIUM), France (team LIMSI), Germany (teams HPI and DFKI-Medical),
India (teams RelAgent and HITACHI), Japan (team HITACHI), Portugal (team
UEvora), Taiwan (team ASNLP), Vietnam (team HCMUS) and USA (team
CORAL). Participants represented academic and industrial institutions includ-
ing LIMSI-CNRS, University of Alabama at Birmingham, Hasso Plattner Insti-
tute, University of Heidelberg, Academia Sinica, DIRO, University of Science,
RelAgent Tech Pvt Ltd, University of Evora, Hitachi, International Institute of
Information Technology, and German Research Center for AI (DFKI). In total,
ten teams submitted systems for Task 2a. Four teams submitted two runs. For
Task 2b, three teams submitted systems, one of them submitted two runs.


                                           36
3.1   System Performance on Task 2a
As shown in Table 1, the HITACHI team system (run 2) had the highest perfor-
mance in Task 2a, with an overall average accuracy of 0.868. For the individual
attributes, team HITACHI had the highest performance for Negation Indica-
tor (0.969), Uncertainty Indicator (0.960), Course Class (0.971), Severity Class
(0.982), Conditional Class (0.978), Body Location (0.797) and DocTime Class
(0.328), Tables 2 and 3. The HCMUS team had the highest performance for
the attribute Subject Class (0.995), and three teams (HPI, RelAgent, Coral)
had the highest performance for the attribute Temporal Expression (0.864). For
the attribute Generic Class, most teams correctly predicted no change in the
normalization value.

3.2   System Performance on Task 2b
For Task 2b, the HITACHI team system (run 2) had the highest performance,
with an overall average F1-score (strict) of 0.676 (Table 4). Team HITACHI also
had the highest performance (strict) for the individual attributes Negation In-
dicator (0.913), Uncertainty Indicator (0.9561), Course Class (0.645), Severity
Class (0.847), Conditional Class (0.638), Generic Class (0.225) and Body Loca-
tion (0.854). The HCMUS team had the highest performance for the attribute
Subject Class (0.857), and Temporal Expression (0.287).


4     Discussion
We released an extended ShARe corpus through Task 2 of the ShARe/CLEFeHealth
Evaluation Lab. This corpus contains disease/disorder templates with ten se-
mantic attributes. In the evaluation lab, we evaluated systems on the task of
normalizing semantic attribute values overall and by attribute type (Task 2a),
as well as on the task of assigning attribute cue slot values (Task 2b). This is
a unique clinical NLP challenge - no previous challenge has targeted such rich
semantic annotations. Results show that high overall average accuracy can be
achieved by NLP systems on the task of normalizing semantic attribute values,
but that performance levels differ greatly between individual attribute types,
which was also reflected in the results for cue slot prediction (Task 2b). This
corpus and the participating team system results are an important contribu-
tion to the research community and the focus on rich semantic information is
unprecedented.


Acknowledgments
We greatly appreciate the hard work and feedback of our program committee
members. We also want to thank all participating teams. This shared task was
partially supported by the CLEF Initiative, the ShARe project funded by the
United States National Institutes of Health (R01GM090187), the US Office of the


                                      37
National Coordinator of Healthcare Technology, Strategic Health IT Advanced
Research Projects (SHARP) 90TR0002, and the Swedish Research Council (350-
2012-6658).

References
 1. Center for Medicare, Medicaid Services:           Eligible professional meaningful
    use menu set measures: Measure 5 of 10. http://www.cms.gov/Regulations-and-
    Guidance/Legislation/EHRIncentivePrograms/downloads/5 Patient Electronic Access.pdf
    Accessed: 2014-06-16.
 2. Eutopian Union:             Directive 2011/24/EU of the European Par-
    liament and of the Council of 9 march 2011.                              http://eur-
    lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2011:088:0045:0065:en:PDF
    Accessed: 2014-06-16.
 3. Mowery, D., Jordan, P., Wiebe, J., Harkema, H., Dowling, J., Chapman, W.: Se-
    mantic annotation of clinical events for generating a problem list. AMIA Annu
    Symp Proc (2013) 1032–1041
 4. Apostolova, E., Channin, D., Demner-Fushman, D., Furst, J., Lytinen, S., Raicu,
    D.: Automatic segmentation of clinical texts. Conf Proc IEEE Eng Med Biol Soc
    (2009) 5905–5908
 5. Engel, K., Buckley, B., Forth, V., McCarthy, D., Ellison, E., Schmidt, M., Adams,
    J.: Patient understanding of emergency department discharge summary instruc-
    tions: Where are knowledge deficits greatest? Acad Emerg Med 19(9) (2012)
    E1035–E1044
 6. Uzuner, Ö., Mailoa, J., Ryan, R., Sibanda, T.: Semantic relations for problem-
    oriented medical records. Artif Intell Med 50(2) (October 2010) 63–73
 7. Sun, W., Rumshisky, A., Uzuner, O.: Evaluating temporal relations in clinical text:
    2012 i2b2 Challenge. J Am Med Inform Assoc 20 (2013) 806–813
 8. Kelly, L., Goeuriot, L., Suominen, H., Schreck, T., Leroy, G., Mowery, D., Velupil-
    lai, S., Martinez, D., Chapman, W., Zuccon, G., Palotti, J.: Overview of the
    share/clef ehealth evaluation lab 2014. In: Lecture Notes in Computer Science
    (LNCS). (2014)
 9. Suominen, H., Schreck, T., Leroy, G., Hochheiser, H., Goeuriot, L., Kelly, L., Mow-
    ery, D., Nualart, J., Ferraro, G., Keim, D.: Task 1 of the CLEF eHealth Evaluation
    Lab 2014: visual-interactive search and exploration of eHealth data. In Cappel-
    lato, L., Ferro, N., Halvey, M., Kraaij, W., eds.: CLEF 2014 Evaluation Labs and
    Workshop: Online Working Notes, Sheffield, UK, CLEF (2014)
10. Goeuriot, L., Kelly, L., Lee, W., Palotti, J., Pecina, P., Zuccon, G., Hanbury, A.,
    Gareth J.F. Jones, H.M.: ShARe/CLEF eHealth Evaluation Lab 2014, Task 3:
    User-centred health information retrieval. In Cappellato, L., Ferro, N., Halvey,
    M., Kraaij, W., eds.: CLEF 2014 Evaluation Labs and Workshop: Online Working
    Notes, Sheffield, UK, CLEF (2014)
11. Elhadad, N., Chapman, W., OGorman, T., Palmer, M., Savova, G.: The ShARe
    schema for the syntactic and semantic annotation of clinical texts. under review.
12. :        ShARe CLEF eHealth website task 2 information extraction.
    https://sites.google.com/a/dcu.ie/clefehealth2014/task-2/2014-dataset Accessed:
    2014-06-16.
13. :             ShARe       CLEF       eHealth     website     task    2      informa-
    tion        extraction.                    https://drive.google.com/file/d/0B7oJZ-
    fwZvH5ZXFRTGl6U3Z6cVE/edit?usp=sharing Accessed: 2014-06-16.


                                      38
14. Saeed, M., Lieu, C., Raber, G., Mark, R.: MIMIC II: a massive temporal ICU
    patient database to support research in intelligent patient monitoring. Comput
    Cardiol 29 (2002)
15. CITI:                Collaborative      Institutional     Training     Initiative.
    https://www.citiprogram.org/ Accessed: 2013-06-30.
16. NIH:         National Institute of Health - ethics training module.
    http://ethics.od.nih.gov/Training/AET.htm Accessed: 2013-06-30.
17. Physionet: Physionet site. https:http://www.physionet.org/ Accessed: 2013-06-30.
18. Hripcsak, G., Rothschild, A.: Agreement, the F-measure, and reliability in infor-
    mation retrieval. J Am Med Inform Assoc 12(3) 296–8


                                         39
Table 2: System Performance, Task 2a: predict each attribute’s normalization slot value.
Accuracy per attribute type - Attributes Negation Indicator, Subject Class, Uncertainty
Indicator, Course Class, Severity Class, Conditional Class.

Attribute   System ID       Accuracy Attribute System ID         Accuracy
Negation    TeamHITACHI.2   0.969    Subject     TeamHCMUS.1     0.995
Indicator   RelAgent.2      0.944    Class       TeamHITACHI.2   0.993
            RelAgent.1      0.941                TeamHITACHI.1   0.990
            TeamASNLP       0.923                TeamUEvora.1    0.987
            TeamGRIUM.1     0.922                DFKI-Medical.1  0.985
            TeamHCMUS.1     0.910                DFKI-Medical.2  0.985
            LIMSI.1         0.902                LIMSI.1         0.984
            LIMSI.2         0.902                RelAgent.2      0.984
            TeamUEvora.1    0.901                RelAgent.1      0.984
            TeamHITACHI.1   0.883                LIMSI.2         0.984
            DFKI-Medical.2  0.879                TeamHPI         0.976
            DFKI-Medical.1  0.876                TeamCORAL.1.add 0.926
            TeamCORAL.1.add 0.807                TeamASNLP       0.921
            TeamHPI         0.762                TeamGRIUM.1     0.611
Uncertainty TeamHITACHI.1   0.960    Course      TeamHITACHI.2   0.971
Indicator   RelAgent.2      0.955    Class       TeamHITACHI.1   0.971
            RelAgent.1      0.955                RelAgent.1      0.970
            TeamUEvora.1    0.955                RelAgent.2      0.967
            TeamCORAL.1.add 0.941                TeamGRIUM.1     0.961
            DFKI-Medical.1  0.941                TeamCORAL.1.add 0.961
            DFKI-Medical.2  0.941                TeamASNLP       0.953
            TeamHITACHI.2   0.924                TeamHCMUS.1     0.937
            TeamGRIUM.1     0.923                DFKI-Medical.1  0.932
            TeamASNLP       0.912                DFKI-Medical.2  0.932
            TeamHPI         0.906                TeamHPI         0.899
            TeamHCMUS.1     0.877                TeamUEvora.1    0.859
            LIMSI.1         0.801                LIMSI.1         0.853
            LIMSI.2         0.801                LIMSI.2         0.853
Severity    TeamHITACHI.2   0.982    Conditional TeamHITACHI.1   0.978
Class       TeamHITACHI.1   0.982    Class       TeamUEvora.1    0.975
            RelAgent.2      0.975                RelAgent.2      0.963
            RelAgent.1      0.975                RelAgent.1      0.963
            TeamGRIUM.1     0.969                TeamHITACHI.2   0.954
            TeamHCMUS.1     0.961                TeamGRIUM.1     0.936
            DFKI-Medical.1  0.957                LIMSI.1         0.936
            DFKI-Medical.2  0.957                TeamASNLP       0.936
            TeamCORAL.1.add 0.942                LIMSI.2         0.936
            TeamUEvora.1    0.919                TeamCORAL.1.add 0.936
            TeamHPI         0.914                DFKI-Medical.1  0.936
            TeamASNLP       0.912                DFKI-Medical.2  0.936
            LIMSI.1         0.900                TeamHCMUS.1     0.899
            LIMSI.2         0.900                TeamHPI         0.819


                                          40
Table 3: System Performance, Task 2a: predict each attribute’s normalization slot value.
Accuracy per attribute type - Attributes Generic Class, Body Location, DocTime Class
and Temporal Expression.

 Attribute System ID       Accuracy Attribute System ID        Accuracy
 Generic TeamGRIUM.1       1.000    Body       TeamHITACHI.2   0.797
 Class     LIMSI.1         1.000    Location TeamHITACHI.1     0.790
           TeamHPI         1.000               RelAgent.2      0.756
           TeamHCMUS.1     1.000               RelAgent.1      0.753
           RelAgent.2      1.000               TeamGRIUM.1     0.635
           TeamASNLP       1.000               DFKI-Medical.2  0.586
           RelAgent.1      1.000               TeamHCMUS.1     0.551
           LIMSI.2         1.000               TeamASNLP       0.546
           TeamUEvora.1    1.000               TeamCORAL.1.add 0.546
           DFKI-Medical.1  1.000               TeamUEvora.1    0.540
           DFKI-Medical.2  1.000               LIMSI.1         0.504
           TeamHITACHI.2   0.990               LIMSI.2         0.504
           TeamCORAL.1.add 0.974               TeamHPI         0.494
           TeamHITACHI.1   0.895               DFKI-Medical.1  0.486
 DocTime TeamHITACHI.2     0.328    Temporal TeamHPI           0.864
 Class     TeamHITACHI.1   0.324    Expression RelAgent.2      0.864
           LIMSI.1         0.322               RelAgent.1      0.864
           LIMSI.2         0.322               TeamCORAL.1.add 0.864
           TeamHCMUS.1     0.306               TeamUEvora.1    0.857
           DFKI-Medical.1  0.179               DFKI-Medical.2  0.849
           DFKI-Medical.2  0.154               LIMSI.1         0.839
           TeamHPI         0.060               TeamHCMUS.1     0.830
           TeamGRIUM.1     0.024               TeamASNLP       0.828
           RelAgent.2      0.024               TeamGRIUM.1     0.824
           RelAgent.1      0.024               LIMSI.2         0.806
           TeamUEvora.1    0.024               TeamHITACHI.2   0.773
           TeamASNLP       0.001               TeamHITACHI.1   0.766
           TeamCORAL.1.add 0.001               DFKI-Medical.1  0.750


                                          41
Table 4: System Performance, Task 2b: predict each attribute’s cue slot value. Strict
and Relaxed F1-score, Precision and Recall (overall and per attribute type)

  Attribute   System ID              Strict                    Relaxed
                            F1-score Precision Recall F1-score Precision Recall
  Overall     TeamHITACHI.2 0.676    0.620     0.743 0.724     0.672     0.784
  Average     TeamHITACHI.1 0.671    0.620     0.731 0.719     0.672     0.773
              TeamHCMUS.1 0.544      0.475     0.635 0.648     0.583     0.729
              HPI.1         0.190    0.184     0.197 0.323     0.314     0.332
  Negation    TeamHITACHI.2 0.913    0.955     0.874 0.926     0.962     0.893
  Indicator   TeamHITACHI.1 0.888    0.897     0.879 0.905     0.912     0.897
              TeamHCMUS.1 0.772      0.679     0.896 0.817     0.735     0.919
              HPI.1         0.383    0.405     0.363 0.465     0.488     0.444
  Subject     TeamHCMUS.1 0.857      0.923     0.800 0.936     0.967     0.907
  Class       TeamHITACHI.1 0.125    0.068     0.760 0.165     0.092     0.814
              TeamHITACHI.2 0.112    0.061     0.653 0.152     0.085     0.729
              HPI.1         0.106    0.059     0.520 0.151     0.086     0.620
  Uncertainty TeamHITACHI.2 0.561    0.496     0.647 0.672     0.612     0.746
  Indicator   TeamHITACHI.1 0.514    0.693     0.408 0.655     0.802     0.553
              TeamHCMUS.1 0.252      0.169     0.494 0.386     0.275     0.646
              HPI.1         0.166    0.106     0.376 0.306     0.209     0.572
  Course      TeamHITACHI.1 0.645    0.607     0.689 0.670     0.632     0.712
  Class       TeamHITACHI.2 0.642    0.606     0.682 0.667     0.632     0.705
              TeamHCMUS.1 0.413      0.316     0.594 0.447     0.348     0.628
              HPI.1         0.226    0.153     0.435 0.283     0.196     0.510
  Severity    TeamHITACHI.2 0.847    0.854     0.839 0.850     0.857     0.843
  Class       TeamHITACHI.1 0.843    0.845     0.841 0.847     0.848     0.845
              TeamHCMUS.1 0.703      0.665     0.746 0.710     0.672     0.752
              HPI.1         0.364    0.306     0.448 0.396     0.336     0.483
  Conditional TeamHITACHI.1 0.638    0.744     0.559 0.801     0.869     0.743
  Class       TeamHITACHI.2 0.548    0.478     0.643 0.729     0.669     0.800
              TeamHCMUS.1 0.307      0.225     0.484 0.441     0.340     0.625
              HPI.1         0.100    0.059     0.315 0.317     0.209     0.658
  Generic     TeamHITACHI.1 0.225    0.239     0.213 0.304     0.320     0.289
  Class       TeamHITACHI.2 0.192    0.385     0.128 0.263     0.484     0.181
              HPI.1         0.100    0.058     0.380 0.139     0.081     0.470
              TeamHCMUS.1 0.000      0.000     0.000 0.000     0.000     0.000
  Body        TeamHITACHI.2 0.854    0.880     0.829 0.874     0.897     0.853
  Location    TeamHITACHI.1 0.847    0.866     0.829 0.868     0.885     0.852
              TeamHCMUS.1 0.627      0.568     0.700 0.750     0.701     0.807
              HPI.1         0.134    0.298     0.086 0.363     0.611     0.258
  Temporal TeamHCMUS.1 0.287         0.313     0.265 0.354     0.383     0.329
  Expression TeamHITACHI.2 0.275     0.226     0.354 0.370     0.310     0.458
              TeamHITACHI.1 0.269    0.217     0.356 0.364     0.300     0.461
              HPI.1         0.000    0.000     0.000 0.000     0.000     0.000


                                        42