On Evaluation of Automatically Generated
         Clinical Discharge Summaries

   Hans Moen1 , Juho Heimonen2,5 , Laura-Maria Murtola3,4 , Antti Airola2 ,
      Tapio Pahikkala2,5 , Virpi Terävä3,4 , Riitta Danielsson-Ojala3,4 ,
                 Tapio Salakoski2,5 , and Sanna Salanterä3,4
               1
                Department of Computer and Information Science,
            Norwegian University of Science and Technology, Norway
      2
        Department of Information Technology, University of Turku, Finland
         3
           Department of Nursing Science, University of Turku, Finland
                     4
                        Turku University Hospital, Finland
                5
                  Turku Centre for Computer Science, Finland
                            hans.moen@idi.ntnu.no
           {juaheim,lmemur,ajairo,aatapa,vmater,rkdaoj}@utu.fi
                     {tapio.salakoski,sansala}@utu.fi


      Abstract. Proper evaluation is crucial for developing high-quality com-
      puterized text summarization systems. In the clinical domain, the spe-
      cialized information needs of the clinicians complicates the task of eval-
      uating automatically produced clinical text summaries. In this paper we
      present and compare the results from both manual and automatic evalu-
      ation of computer-generated summaries. These are composed of sentence
      extracts from the free text in clinical daily notes – corresponding to indi-
      vidual care episodes, written by physicians concerning patient care. The
      purpose of this study is primarily to find out if there is a correlation
      between the conducted automatic evaluation and the manual evalua-
      tion. We analyze which of the automatic evaluation metrics correlates
      the most with the scores from the manual evaluation. The manual eval-
      uation is performed by domain experts who follow an evaluation tool
      that we developed as a part of this study. As a result, we hope to get
      some insight into the reliability of the selected approach to automatic
      evaluation. Ultimately this study can help us in assessing the reliability
      of this evaluation approach, so that we can further develop the under-
      lying summarization system. The evaluation results seem promising in
      that the ranking order of the various summarization methods, ranked
      by all the automatic evaluation metrics, correspond well with that of
      the manual evaluation. These preliminary results also indicate that the
      utilized automatic evaluation setup can be used as an automated and
      reliable way to rank clinical summarization methods internally in terms
      of their performance.


Keywords: Summarization Evaluation, Text Summarization, Clinical Text Pro-
cessing, NLP
  Copyright c 2014 by the paper’s authors. Copying permitted for private and academic
  purposes.
102      H. Moen et al.

1     Introduction

With the large amount of information generated in health care organisations
today, information overload is becoming an increasing problem for clinicians
[1,2]. Much of the information that is generated in relation to care is stored
in electronic health record (EHR) systems. The majority of this is free text –
stored as clinical notes – written on a daily basis by clinicians about care of
individual patients. The rest of the information contained in EHRs is mainly
images and structured information, such as medication, coded information and
lab values. Towards tackling the problems of information overload, there is a
need for (EHR) systems that are able to automatically generate an overview, or
summary, of the information in these health records - this applies to both free
text and structured information. Such systems would enable clinicians to spend
more time treating the patients, and less time reading up on information about
the patients. However, in the process of developing such summarization systems,
quick and reliable evaluation is crucial.
    A typical situation where information overload is frequently encountered is
when the attending physician is producing the discharge summary at the end
of a care episode. Discharge summaries are an important part of the commu-
nication between different professionals providing the health care services and
their aim to ensure the continuity of a patients care. However, there are chal-
lenges with these discharge summaries as they are often produced late, and the
information they contain tend to be insufficient. For example, one study showed
that discharge summaries exchanged between the hospital and the primary care
physicians is often lacking information, such as diagnostic test results (lacking
in 33-63%), treatment progression (lacking in 7-22%), medications (lacking in
2-40%), test results (lacking in 65%), counseling (lacking in 90-92%) and follow-
up proposals (lacking in 2-43%) [3]. One reason for this is that, during discharge
summary writing process, the physicians tend to simply not have the time to
read everything that has been documented in the clinical daily notes. Another
reason is the difficulty of identifying the most important information to include
in the discharge summary.
    Computer-assisted discharge summaries and standardized templates are mea-
sures for improving the transfer time and the quality of discharge information be-
tween the hospital and the primary care physicians [3]. Furthermore, computer-
assisted discharge summary writing using automatic text summarization could
improve the timeliness and quality of discharge summaries further. Another more
general user scenario where text summarization would be useful is when clini-
cians need to get an overview of the documented content in a care episode, in
particular in critical situations when this information is needed without delay.
    Automatic summarization of clinical information is a challenging task be-
cause of the different data types, the domain specificity of the language, and

    In: E.A.A. Jaatun, E. Brooks, K.E. Berntsen, H. Gilstad, M. G. Jaatun (eds.): Pro-
    ceedings of the 2nd European Workshop on Practical Aspects of Health Informatics
    (PAHI 2014), Trondheim, Norway, 19-MAY-2014, published at http://ceur-ws.org
   On Evaluation of Automatically Generated Clinical Discharge Summaries      103

the special information needs of the clinicians [4]. Producing a comprehensive
overview of the structured information is a rather trivial task [5]. However, that
is not the case for the clinical notes and the free text they contain. Previously,
Liu et al. [6] applied the automated text summarization methods of the MEAD
system [7] to Finnish intensive care nursing narratives. In this work the pro-
duced summaries were automatically evaluated against corresponding discharge
reports. The authors found that some of the considered methods outperformed
the random baseline method, however, the authors noted that the results were
overall quite disappointing, and that further work was needed in order to develop
reliable evaluation methods for the task.

    We have developed an extraction based text summarization system that at-
tempts to automatically produce a textual summary of the free text contained
in all the clinical (daily) notes related to a – possibly ongoing – care episode,
written by physicians. The focus of this paper is not on how the summariza-
tion system works, but rather on how to evaluate the summaries it produces. In
our ongoing work towards developing this system, we have so far seven different
summarization methods to evaluate, including a Random method and an Oracle
method. The latter method representing an upper bound for the automatic eval-
uation score. Having a way to quickly and automatically evaluate the summaries
that these methods produce is critical during method development phase. Thus
the focus of this paper is how to perform such automated evaluation in a reliable
and cost-effective way.

    Automatic evaluation of an extraction based summary is typically done
through having a gold standard, or “gold summary”, for comparison. A gold
summary is typically an extraction based summary produced by human experts
[8]. Then one measures the textual overlap, or similarity, between a targeted
summary and the corresponding gold summary, using some metric for this pur-
pose. However, we do not have such manually tailored gold summaries available.
Instead we explore the use of the original physician-made discharge summaries
for evaluation purposes as a means of overcoming this problem. These discharge
summaries contain sentence extracts, and possibly slightly rewritten sentences,
from the clinical notes. They also typically contain information that has never
been documented earlier in the corresponding care episode, which makes them
possibly suboptimal for the task of automatic evaluation.

    To explore whether this approach to automatic evaluation is viable, we have
also conducted manual evaluation of a set of summaries, and then compared this
to the results from the automatic evaluation. A possible correlation between how
the manual and automatic evaluation ranks the summarization methods would
mean that further automatic evaluation with this approach can be considered
somewhat reliable. In this study, automatic evaluation is mainly performed us-
ing the ROUGE evaluation package [9]. The manual evaluation was done by
domain experts who followed an evaluation scheme/tool that we developed for
this purpose. Figure 1 illustrates the evaluation experiment.
104     H. Moen et al.


                 Fig. 1. Illustration of the evaluation experiment.


2     Data

The data set used in this study contained the electronic health records of ap-
proximately 26,000 patients admitted to a Finnish university hospital between
the years 2005–2009 with any type of cardiac problem. An ethical statement
(17.2.2009 §67) and the organisational permission (2/2009) from the hospital
district was obtained before collection of this data set.
    To produce data suited for automatic summarization, discharge summaries
were extracted and associated to the daily notes they summarize. There were
two types of discharge summaries: internal (written when the patient is moved
to another ward and summarizing the time spent on the given ward) and final
(written when the patient leaves the hospital and summarizing the whole stay).
Note that a final discharge also summarizes any internal summaries written
during the stay.
    All notes and discharge summaries were lemmatized at the sentence level
using the morphological analyser FinTWOL [10] and the disambiguator FinCG
[11] by Lingsoft, Inc.6 . Stopwords were also removed7 . The preprocessed corpus
contained 66,884 unique care episodes with 39 million words from a vocabulary
of 0.6 million unique terms.
    The full corpus was utilized in deriving statistics about the language for some
of the summarization methods. For the summarization and evaluation experi-
ment, the corpus was narrowed down to the care episodes having I25 (Chronic
ischaemic heart disease; including its sub-codes) as the primary ICD-10 code
and consisting of at least 8 clinical notes, including the discharge summary. The
6
    http://www.lingsoft.fi
7
    http://www.nettiapina.fi/finnish-stopword-list/
    On Evaluation of Automatically Generated Clinical Discharge Summaries       105

latter condition justifies the use of text summarization. The data were then split
into the training and test sets, containing 159 and 156 care episodes, for the
parameter optimization and evaluation of summarization methods, respectively.


3    Text Summarization

All summarization methods used in this study are based on extraction-based
multi-document summarization. This means that each summary consist of a
subset of the content in the original sentences, found in the various clinical notes
that the summary is produced from [12]. This can be seen as a specialized type
of multi-document summarization since each document, or clinical note, belong
to the same patient, and together constitute a connected sequence. In the pre-
sented evaluation, seven different summarization methods are used, including
Random and Oracle, resulting in seven different summaries per care episode.
The original physician made discharge summary, Original, which is used as the
gold summary for automatic evaluation, is also included in the manual evalua-
tion. For the automatic evaluation, this gold summary is viewed as the perfect
summary, thus having a perfect F-score (see Section 4.1). As stated earlier, the
focus of this paper is on evaluating the text summaries produced by a summa-
rization system. Thus the description of the utilized summarization system and
the various methods used will be explained in more detail in a forthcoming ex-
tended version of this work. However, the two trivial control methods, Random
and Oracle, deserves some explanation here.

Random This is the baseline method, which works by composing a summary
through simply selecting sentences randomly from the various clinical notes. This
method should give some indication of the difficultly level of the summarization
task at hand.

Oracle This is a control-method that has access to the gold summary dur-
ing the summarization process. It basically tries to optimize the ROUGE-N2
F-scores for the generated summary according to the gold summary, using a
greedy search strategy. This summarization method can naturally not be used
in a real user scenario, but it represents the upper limit for what is possible
to achive in score for an extraction based summary, or summarization method,
when using ROUGE for evaluation.

    The other summarization methods are here referred to as SumMet1, SumMet2,
SumMet3, SumMet4 and SumMet5.
    For each individual care episode, the length of the corresponding gold sum-
mary served as the length limit for all the seven generated summaries. This
was mainly done so that a sensible automatic evaluation score (F-score, see Sec-
tion 4.1) could be calculated. In a more realistic user scenario, a fixed length
could be used, or e.g. a length that is based on how many clinical notes the sum-
maries are generated from. Each computer generated summary is run through
106      H. Moen et al.

a post-processing step, where each sentence are sorted according to when they
were written.


4     Evaluation Experiment

We conducted and compared two types of evaluation, automatic evaluation and
manual evaluation in order to evaluate the different summarization methods.
The purpose of this study is primarily to find out if there is a correlation between
the conducted automatic evaluation and the manual evaluation. This will further
reveal which of the automatic evaluation metrics that correlates the most with
the scores from the manual evaluation. As a result, we would get some insight
into the reliability of the selected approach to automatic evaluation. Ultimately
this study can help us in assessing the reliability of this evaluation approach, so
that we can further develop the underlying summarization system.
    In the automatic evaluation we calculated the F-score for the overlap be-
tween the generated summaries and the corresponding gold summaries using
the ROUGE evaluation package. As gold summaries we used the original dis-
charge summary written by a physician. This gold summary is thus considered
to be the optimal summary8 , so we assume that it to always have an F-score of
1.0.
    The conducted evaluation can be classified as so called intrinsic evaluation.
This means that the summaries are evaluated independently of how they poten-
tially affect some external task [8].


4.1     Automatic Evaluation

ROUGE metrics, provided by the ROUGE evaluation package [9] (see e.g. [13]),
are widely used as automatic performance measures in the text summarization
literature. To limit the number of evaluations, we selected four common variants:

 – ROUGE-N1. Unigram co-occurrence statistics.
 – ROUGE-N2. Bigram co-occurrence statistics.
 – ROUGE-L. Longest common sub-sequence co-occurrence statistics.
 – ROUGE-SU4. Skip-bigram and unigram co-occurrence statistics.

    These metrics are all based on finding word n-gram co-occurrences, or over-
laps, between a) one or more reference summaries, i.e. gold summary, and b)
the candidate summary to be evaluated.
    Each metric counts the number of overlapping units (the counting method
of which depends on the variant) and uses that information to calculate recall
(R), precision (P ), and F-score (F ). The recall is the ratio of overlapping units
to the total number of units in the reference while the precision is the ratio
of overlapping units to the total number of units in the candidate. The former
8
    This is of course not always the truth from a clinical perspective, but we leave that
    to another discussion.
      On Evaluation of Automatically Generated Clinical Discharge Summaries       107

describes how well the candidate covers the reference and the latter describes
the quality of the candidate. The F-score is then calculated as
                                          2P R
                                   F =         ,                                  (1)
                                         P +R
   The evaluations were performed with the lemmatized texts with common
stopwords and numbers removed.


4.2     Manual Evaluation

The manual evaluation was conducted independently by three domain experts.
Each did a blinded evaluation of the eight summary types (seven machine gen-
erated ones plus the original discharge summary) for five care episodes. Hence,
the total sum of evaluated summaries per evaluator was 40. All summaries were
evaluated with a 30-item schema, or evaluation tool (see Table 1). This tool was
constructed based on the content criteria guideline for medical discharge sum-
maries, used in the region where the data was collected. So each item correspond
to a criteria in this guideline. In this way, items were designed to evaluate the
medical content of the summaries from the perspective of discharge summary
writing. When evaluating a summary, each of these items were evaluated on a 4-
class scale from -1 to 2, where, -1 = not relevant, 0 = not included, 1 = partially
included and 2 = fully included. The evaluators also had all the corresponding
clinical notes at their disposal when performing the evaluation.
    The items in our tool are somewhat comparable to the evaluation criteria
used in an earlier study of evaluating neonate’s discharge summaries, where
the computer generated discharge summaries using lists of diagnoses linked to
ICD-codes [14]. However, the data summarized in the aforementioned work is
mainly structured and pre-classified data, thus the summarization methods or
performance is not comparable to our work.
    The evaluators experienced the manual evaluations, following the 30-item
tool, to be very difficult and extremely time consuming. This was mainly due
to the evaluation tool, i.e. its items, being very detailed and required a lot of
clinical judgement. Therefore, for this study, only five patient care episodes and
their corresponding summaries, generated by the summarization system, were
evaluated by all three evaluators. This number of evaluated summaries are too
small for generalization of the results, but this should still give some indication of
the quality of the various summarization methods in the summarization system.
The 30 items in the manual evaluation tool are presented in Table 1.


4.3     Evaluation Statistics

In order to test whether the differences in the automatic evaluation scores be-
tween the different summarization methods were statistically significant, we per-
formed the Wilcoxon signed-rank test [15] for each evaluation measure, and each
pair of methods at significance level p = 0.05. We also calculated the p-values
108      H. Moen et al.

                  Table 1. Items evaluated in the manual evaluation.
Evaluation criteria
Care period
Care place
Events (diagnoses/procedure codes) of care episode
Procedures of care episode
Long-term diagnoses
Reason for admission
Sender
Current diseases, which have impact on care solutions
Effects of current diseases, which have impact on care solutions
Current diseases, which have impact on medical treatment solutions
Effects of current diseases, which impact on medical treatment solutions
Course of the disease
Test results in chronological order with reasons
Test results in chronological order with consequences
Procedures in chronological order with reasons
Procedures in chronological order with consequences
Conclusions
Assessment of the future
Status of the disease at the end of the treatment period
Description of patient education
Ability to work
Medical certificates (including mention of content and duration)
Prepared or requested medical statements
A continued care plan
Home medication
Follow-up instructions
Indications for re-admission
Agreed follow-up treatments in the hospital district
Other disease, symptom or problem that requires further assessment
Information of responsible party for follow-up treatment


for manual evaluation. These were obtained with the paired Wilcoxon test com-
puted over the 30 mean content criteria scores of the 30-item evaluation tool (see
Table1). The mean scores were calculated by averaging the manually entered
scores of the three evaluators and five care episodes. The -1 values indicating
non-relevance were treated as missing values (i.e. they were ignored when cal-
culating the averages). Also here a significance level of p = 0.05 was used. The
agreement between the three evaluators was investigated by calculating the in-
traclass correlation coefficient (ICC) for all manually evaluated summaries using
the two-way-mixed model.
    To identify which of the automatic evaluation metrics that best follows the
manual evaluation, Pearson product-moment correlation coefficient (PPMCC)
and Spearman’s rank correlation coefficient (Spearman’s rho) [16] were calcu-
lated between the normalized manual evaluation scores and each of the automatic
evaluation scores (from Table 2).


5     Evaluation Results

The results from the automatic and the manual evaluations are presented in Ta-
ble 2. The scores from the automatic evaluation are calculated from the average
F-scores from the 156 test care episodes, while the results from the manual eval-
     On Evaluation of Automatically Generated Clinical Discharge Summaries        109

       Table 2. Evaluation results, each column are ranked internally by score.

      Rank ROUGE-N1 ROUGE-N2 ROUGE-L ROUGE-SU4 Manual
           F-score  F-score  F-score  F-score  (normmax )
       1 Original   Original Original Original Original
           1.0000   1.0000   1.0000   1.0000   1.0000
       2 Oracle     Oracle   Oracle   Oracle   SumMet2
           0.7964   0.7073   0.7916   0.6850   0.6738
       3 SumMet2    SumMet2  SumMet2  SumMet2  Oracle
           0.6700   0.5922   0.6849   0.5841   0.6616
       4 SumMet5    SumMet5  SumMet5  SumMet5  SumMet5
           0.5957   0.4838   0.5902   0.4723   0.6419
       5 SumMet1    SumMet1  SumMet1  SumMet1  SumMet3
           0.4785   0.3293   0.4717   0.3115   0.5326
       6 SumMet4    SumMet4  SumMet4  SumMet4  SumMet1
           0.3790   0.2363   0.3725   0.2297   0.5167
       7 Random     Random   Random   SumMet3  Random
           0.3781   0.2094   0.3695   0.2013   0.5161
       8 SumMet3    SumMet3  SumMet3  Random   SumMet4
           0.3582   0.2041   0.3521   0.2001   0.5016


uation are the average scores from a subset of five care episodes (also included
in the automatic evaluation), all evaluated by three domain experts. The latter
scores have all been normalized by dividing the scores of the highest ranking
method. This was done in an attempt to scale these scores to the F-scores from
the automatic evaluation.
    All automatic metrics and the manual evaluation agreed in terms of what
summarization method belongs to the top three, and the bottom four. When
calculating significance levels for the automatic evaluations for the five highest
ranked methods, the differences were always significant. However, in several cases
the differences between the three lowest ranked methods, those being SumMet4,
SumMet3 and Random, were not statistically significant. These results are in agree-
ment with the fact that all the evaluation measures agreed on which the five best
performing methods were, whereas the three worst methods are equally bad, all
performing on a level that does not significantly differ from the approach of
picking the sentences for the summary randomly.
    For the manual evaluation, the original discharge summaries scored signifi-
cantly higher than any of the generated summaries. Furthermore, the summaries
produced by the Oracle, SumMet2 and SumMet5 methods were evaluated to be
significantly better than those produced by the four other methods. Thus, the
manual evaluation divided the automated summarization methods into two dis-
tinct groups, one group that produced seemingly meaningful summaries, and the
other that did not work significantly better than the Random method. The divi-
8
    The original discharge summary is of course not a product of any of the summariza-
    tion methods.
110     H. Moen et al.

Table 3. PPMCC and Spearman’s rho results, indicating how the scoring by the
automatic evaluation metrics correlates with the normalized manual evaluation scores.

      Evaluation metric PPMCC (p-values) Spearman’s rho (p-values)
      ROUGE-N1          0.9293 (0.00083) 0.8095 (0.01490)
      ROUGE-N2          0.9435 (0.00043) 0.8095 (0.01490)
      ROUGE-L           0.9291 (0.00084) 0.8095 (0.01490)
      ROUGE-SU4         0.9510 (0.00028) 0.8571 (0.00653)


Fig. 2. Graph showing the evarage manual scores (normmax ), calculated from five care
episodes (evaluated by three domain experts), and average F-scores by ROUGE-SU4,
calculated from 156 care episodes. The order, from left to right, is sorted according to
descending manual scores.


sion closely agrees with that of the automated evaluations, the difference being
that in the manual evaluation also SumMet1 ended up in the bottom group of
badly performing summarization methods.

    In Table 3 are the PPMCC and Spearman’s rho results, indicating how each
automatic evaluation metric correlates with the manual evaluation scores. Spear-
mans rho is a rank-correlation measure, so it does not find any difference between
most of the measures, since they rank the methods in exactly the same order
(except ROUGE-SU4, which has a single rank difference compared to others).
In contrast, PPMCC measures the linear dependence taking into account magni-
tudes of the scores in addition to the ranks, so it observes some extra differences
between the measures. This shows that ROUGE-SU4 has the best correlation
compared to the manual evaluation. Figure 2 illustrates the normalized manual
evaluation scores with the ROUGE-SU4 F-scores.
      On Evaluation of Automatically Generated Clinical Discharge Summaries    111

6      Discussion

All automatic evaluation metrics and the manual evaluation agreed that the top
three automatic summarization methods significantly outperform the Random
method. These methods are SumMet2, SumMet5 and Oracle. Thus we can with a
certain confidence assume that this reflects the actual performance of the utilized
summarization methods. Oracle is of course not a proper method, given that it
is “cheating”, but it is a good indicator for what the upper performance limit it.
    The reliability of the manual evaluation is naturally rather weak, given that
only five care episodes were evaluated by the three evaluators. The findings
of the manual evaluation are not generalizable due to the small number of care
episodes evaluated. Therefore, more manual evaluation is needed to confirm these
findings.
    On a more general level, the results indicate that the utilized approach –
using the original discharge summaries as gold summaries – is a seemingly viable
approach. This also means that the same evaluation framework can potentially
be transferred to clinical text in other countries and languages who follow a
similar hospital practice as in Finland.
    The manual evaluation results show that the various summarization methods
are less discriminated in terms of scores when compared to the automatic eval-
uation scores. We believe that this is partly to blame for the small evaluation
set these scores are based on, and also because of the evaluation tool that was
utilized. For these reasons we are still looking into ways to improve the manual
evaluation tool before we conduct further manual evaluation. It is interesting to
see that SumMet2 is considered to outperform the Oracle method, according to
the manual evaluators.


6.1     Lessons learned from the manual evaluation

The agreement between the evaluators in the manual evaluation was calculated
with the 40 different summaries evaluated by each of the three evaluators. The
ICC value for the absolute agreement was 0,68 (95% CI 0,247-0,853, p<0,001).
There is no definite limit in the literature on how to interpret ICC values, but
there are guidelines that suggest that values below 0.4 are poor, values from
0.4 to 0.59 are fair, values from 0.6 to 0.74 are good and values from 0.75 to
1.0 are excellent in terms of the level of interrater agreement [17]. The agree-
ment between the evaluators in the manual evaluation was good, based on these
suggested limits. This means that there were some differences between the eval-
uations conducted by the three evaluators, which indicates that the criteria used
in the 30-item manual evaluation tool allowed this variance, and therefore, the
tool with its items need further development. Another aspect is that the evalu-
ators would need more training concerning the use of the criteria and possibly
more strict guidelines.
    Furthermore, the evaluators reported that the manual evaluation was diffi-
cult and very time consuming, due to the numerous and detailed items in the
112    H. Moen et al.

manual evaluation tool. They also reported that the judgement process nec-
essary when evaluating the summaries was too demanding. It became obvious
that several of the items in the evaluation tool were too specifically targeting
structured information. This means information that is already identified and
classified in the health record system, which does not need to be present in the
unstructured free text from where the summaries are generated. Examples are
’Care period’, ’Care place’ and ’Sender’. In the future, a shorter tool, i.e. less
items, with stricter criteria and more detailed guidelines for the evaluators is
needed. One important property of such a new tool would be, when used by the
human evaluators, that good and bad summaries (i.e. summarization methods)
are properly discriminated in terms of scoring.


7     Conclusion and Future Work
In this paper we have presented the results from automatic and manual evalu-
ation of seven different methods for automatically generating clinical text sum-
maries. The summary documents was composed from the free text of the clinical
daily notes written by physicians related to patient care.
    Seven automatic summarization methods were evaluated. For the automatic
evaluation the corresponding original discharge summaries were used as gold
summaries for doing the automatic evaluation. Among these summarization
methods were the control-methods Random and Oracle. Four ROUGE metrics
were used for the automatic evaluation, ROUGE-N1, ROUGE-N2, ROUGE-L
and ROUGE-SU4.
    The evaluation results seem promising in that the ranking order of the var-
ious summarization methods, ranked by all the automatic evaluation metrics,
correspond well with that of the manual evaluation. These preliminary results
indicates that the utilized automatic evaluation setup can be used as an auto-
mated and reliable way to rank clinical summarization methods internally in
terms of their performance.
    More manual evaluation, on a larger sample of care episodes, is needed to con-
firm the findings in this study. In this context, more research is needed to make
a manual evaluation tool that better discriminates good from bad summaries,
as well as being easier to use by evaluators. This preliminary work provided us
good insight and ideas about how to further develop the manual evaluation tool,
suited for a larger-scale manual evaluation.
    As future work, we also plan to conduct so called extrinsic evaluation of the
summarization methods, meaning that the various summaries produced by the
system are evaluated in terms of their impact on clinical work.


8     Acknowledgements
This study was financially supported partly by the Research Council of Norway
through the EviCare project (NFR project no. 193022), the Turku University
Hospital (EVO 2014), and the Academy of Finland (project no. 140323). The
   On Evaluation of Automatically Generated Clinical Discharge Summaries            113

study is a part of the research projects of the Ikitik consortium
(http://www.ikitik.fi).

References
 1. Hall, A., Walton, G.: Information overload within the health care system: a liter-
    ature review. Health Information & Libraries Journal 21(2) (2004) 102–108
 2. Van Vleck, T.T., Stein, D.M., Stetson, P.D., Johnson, S.B.: Assessing data rele-
    vance for automated generation of a clinical summary. In: AMIA Annual Sympo-
    sium Proceedings. Volume 2007., American Medical Informatics Association (2007)
    761
 3. Kripalani, S., LeFevre, F., Phillips, C.O., Williams, M.V., Basaviah, P., Baker,
    D.W.: Deficits in communication and information transfer between hospital-based
    and primary care physicians: implications for patient safety and continuity of care.
    Jama 297(8) (2007) 831–841
 4. Feblowitz, J.C., Wright, A., Singh, H., Samal, L., Sittig, D.F.: Summarization of
    clinical information: A conceptual model. Journal of biomedical informatics 44(4)
    (2011) 688–699
 5. Roque, F.S., Slaughter, L., Tkatšenko, A.: A comparison of several key information
    visualization systems for secondary use of electronic health record content. In:
    Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data
    Mining of Health Documents, Association for Computational Linguistics (2010)
    76–83
 6. Liu, S.: Experiences and reflections on text summarization tools. International
    Journal of Computational Intelligence Systems 2(3) (2009) 202218
 7. Radev, D.R., Jing, H., Budzikowska, M.: Centroid-based summarization of mul-
    tiple documents: Sentence extraction, utility-based evaluation, and user studies.
    In: Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic Summariza-
    tion. Volume 4 of NAACL-ANLP-AutoSum ’00., Association for Computational
    Linguistics (2000) 21–30
 8. Afantenos, S., Karkaletsis, V., Stamatopoulos, P.: Summarization from medical
    documents: a survey. Artificial intelligence in medicine 33(2) (2005) 157–177
 9. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In Marie-
    Francine Moens, S.S., ed.: Text Summarization Branches Out: Proceedings of the
    ACL-04 Workshop, Barcelona, Spain, Association for Computational Linguistics
    (July 2004) 74–81
10. Koskenniemi, K.: Two-level model for morphological analysis. In Bundy, A., ed.:
    Proceedings of the 8th International Joint Conference on Artificial Intelligence.
    Karlsruhe, FRG, August 1983, William Kaufmann (1983) 683–685
11. Karlsson, F.: Constraint grammar as a framework for parsing running text. In:
    Proceedings of the 13th Conference on Computational Linguistics - Volume 3.
    COLING ’90, Stroudsburg, PA, USA, Association for Computational Linguistics
    (1990) 168–173
12. Nenkova, A., McKeown, K.: Automatic summarization. Foundations and Trends
    in Information Retrieval 5(23) (2011) 103–233
13. Dang, H.T., Owczarzak, K.: Overview of the tac 2008 update summarization task.
    In: Proceedings of text analysis conference. (2008) 1–16
14. Lissauer, T., Paterson, C., Simons, A., Beard, R.: Evaluation of computer gen-
    erated neonatal discharge summaries. Archives of disease in childhood 66(4 Spec
    No) (1991) 433–436
114     H. Moen et al.

15. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1 (1945)
    80–83
16. Lehman, A.: JMP for basic univariate and multivariate statistics: a step-by-step
    guide. SAS Institute (2005)
17. Cicchetti, D.V.: Guidelines, criteria, and rules of thumb for evaluating normed and
    standardized assessment instruments in psychology. Psychological assessment 6(4)
    (1994) 284