Evaluating classification power of linked
admission data sources with text mining
   Simon KOCBEKa,b,1, Lawrence CAVEDONa, David MARTINEZb,c, Christopher
BAINd,e, Chris MAC MANUSd, Gholamreza HAFFARIe, Ingrid ZUKERMANe, Karin
                                     VERSPOORb
             a
               Computer Science & Info Tech, RMIT University, Melbourne
 b
   Dept of Computing and Information Systems, University of Melbourne, Melbourne
                            c
                              MedWhat.com, San Francisco
             d
               Health Informatics Department, Alfred Hospital, Melbourne
         e
           Faculty of Information Technology, Monash University, Melbourne


             Abstract. Lung cancer is a leading cause of death in developed countries. This
             paper presents a text mining system using Support Vector Machines for detecting
             lung cancer admissions. Performance of the system using different clinical data
             sources is evaluated. We use radiology reports as an initial data source and add
             other sources, such as pathology reports, patient demographic information and
             hospital admission information. Results show that mining over linked data sources
             significantly improves classification performance with a maximum F-Score
             improvement of 0.057.

             Keywords. Text mining, natural language processing, lung cancer, linked hospital
             data


Introduction

Text and data mining are proving to be increasingly important and powerful techniques
for extracting information and insights from Health and Hospital Information Systems
[1-5]. Mining hospital data holds the potential for new discoveries as well as improved
efficiencies and communication within hospital systems. Much valuable information in
hospital records is represented in free text format, e.g., radiology and pathology reports,
requiring the application of Text Mining (TM) and Natural Language Processing (NLP)
techniques.
     Most previous clinical text mining applications have made use of a single textual
data source, e.g., radiology reports, in order to identify or mine information related to a
single condition (e.g., [1,2]). However, the increase in data linkage (i.e., multiple data
sources being linked by patient id) in Hospital Information Systems is creating
opportunities for more powerful and accurate text mining techniques that combine
insights from multiple data sources [6].
     In this paper, we describe performance of text mining in the context of the
challenge of identifying patients admitted to a hospital for treatment for lung cancer.
Lung cancer is a leading cause of death in developed countries, and automatically

    1
        Corresponding Author.
mapping patient admissions to ICD (International Code of Diseases) directly from
hospital records is a precursor to automated ICD-coding, a massively time-consuming
manual process at the core of the procedure followed to fund hospitals.
     The focus of this paper is to evaluate the value of data linkage and investigate the
source of value within different hospital data sources. In particular, we consider a large
collection of radiology and pathology reports, along with associated metadata sources,
and build classifiers for each type of data source, as well as their combination. Our
results confirm that, as might be expected, jointly mining multiple linked data
sources improves text classification performance. Analysis also identifies which
information source is most valuable for mining for the specified disease, although we
expect this to vary with different diseases.


1. Related work

A substantial amount of relevant disease information exists in various types of medical
records. Much of this information is in the form of free text; hence text mining
represents a promising strategy for building machine learning classifiers that take
advantage of the richness of such records. Both radiology and pathology reports have
been studied as a source of specific clinical information in previous text mining studies.
A pathology report describes the results of examining cells and tissues under a
microscope after a biopsy or surgery. A radiology report represents a specialist’s
interpretation of images related to a patient’s signs and symptoms.
     Hripscak et al. [1] used NLP techniques to evaluate the automatic coding of
889,921 chest radiology reports. Nguyen et al. [2] performed classification of lung
cancer stages from pathology reports. In their follow-up work [3], a rule-based system
was used to classify cancer-notifiable pathology reports from a small corpus (approx.
500 reports), obtaining very high sensitivity, specificity and Positive Predictive Value
(PPV). Pathology reports have also been analysed to extract breast cancer
characteristics into a knowledge model [4] and to identify relevant named entities [5].
     In previous work [7] we built a system for detecting lung cancer admissions based
on radiology reports linked to patient metadata for the financial years 2012-2013 and
2013-2014. A similar approach is adopted in this paper, where we use TM techniques
to extract useful information about lung cancer. We extend the prior work by exploring
the impact of incorporating two additional data sources: pathology reports and
radiology questions (i.e., the purpose stated by the clinician for requesting a scan). We
also measure statistical significance of classification performance using the different
data sources. Note that the goal of this paper is not to achieve better classification
performance than previous systems, but to achieve comparable performance and
explore the value of various data sources in mining information related to a specified
question.


2. Methods

2.1. Data source

The data for this study was extracted from the Alfred Health Informatics Platform,
called REASON [8], which provides a single data warehouse view of multiple data
sources within the Alfred Health system, linked by unique anonymised patient id. Data
for the current study was extracted from REASON under ethics approval from the
Alfred Health Human Research Ethics Committee, in the form of a de-identified set. A
high-level architecture of the REASON platform is shown in Figure 1. Table 1
provides an overview of some of the key record types (and number of records) in the
platform relevant to our current task, though it is not a complete listing.
     For the purpose of this study, we extracted textual form of radiology and pathology
reports for the financial years 2012-2013 and 2013-2014. Each report was assigned an
admission identifier, which is in turn linked to patient metadata. The following
metadata associated with each admission were extracted: patient’s demographic data
(gender, age, ethnic origin, country, language, marital status, religion, and death date)
and hospital-related admission data (hospital code, admission date and time, discharge
date and time, length of stay, reason for the admission, admission unit, discharge unit,
admission type, source, destination and criteria). Radiology reports were also
associated with radiology questions, i.e., a short description of the reason given by the
clinician for requesting the scan. The initial number of admission records used in this
study was as follows: 40,800 radiology reports; 20,872 pathology reports; and 121,700
metadata entries.


                        Figure 1. A high level architectural view of REASON.


                   Table 1. Example of numbers of records by type in REASON.
                            Data                             Record numbers
                         Admissions                              881,653
                    Emergency Encounter                          912,931
                  Pathology Results- Atomic                    43,606,065
                  Pathology Results- Textual                     667,303
                           Patients                             1,884,527
             Pharmacy Drug Dispense Transactions                4,131,227
                      Radiology Reports                          756,164
                    Radiology Test Orders                        792,312
                     Surgeries Performed                         158,853
2.2. Gold Standard data set

Each admission is associated with a set of ICD-10 codes, which are annotated in the
admission record by an internal clinical coder for reporting purposes. These are used in
our study as ground truth to build the gold standard data set. The ICD codes are ignored
when testing the classifiers – i.e., the classification task consists of identifying those
records which contain the ICD code of interest in the gold standard data set.
     To identify positive lung cancer cases we used the ICD-10 code C34.*: Malignant
neoplasm of bronchus and lung. In our dataset, only 496 out of 40,800 admissions with
radiology reports were positive for lung cancer. The highly skewed nature of the data
poses a specific challenge to automated machine learning approaches, which generally
perform better over balanced class distributions. To address this problem, we
performed subsampling, randomly selecting a subset of negative admissions to balance
the datasets. Other, more time complex methods (such as oversampling [9]) could have
been used; however, due to time constraints and the high number of experiments to be
run, these methods were not appropriate for this work. The final gold standard dataset
therefore contained 992 admissions. All admissions contained radiology report and
radiology question, 833 admissions also contained metadata, and 518 admissions also
contained pathology report.

2.3. Data representation

Machine learning algorithms require a representation of relevant features of each data
point that can be used to build a predictive classifier. The feature representation we
adopted for our task combines characteristics obtained from text reports, along with the
patient and hospital metadata linked to each admission.
     Text in radiology reports, radiology questions and pathology reports was
processed with the MetaMap tool [10] from the US National Library of Medicine.
MetaMap is a program that identifies and normalises biomedical terminology from the
Unified Medical Language System (UMLS) Metathesaurus in biomedical text. Below
is a short sample of MetaMap-annotated phrases from the sentence “replaced with a
right frontal approach”.
    Meta Mapping (701):
     748 C0559956: Replaced (Replacement) [Functional Concept]
     748 C0205090: Right [Spatial Concept]
     778 C2316681: Frontal approach [Functional Concept]

We employed the NegEx module to identify the polarity (negative or positive, e.g.,
“Non contrast in the brain”) of phrases. NegEx is a simple algorithm included in
MetaMap that implements several regular expressions that indicate negation, filters out
sentences containing phrases that falsely appear to be negation phrases, and limits the
scope of the negation phrases [11].
    We collected phrases mapped into UMLS concepts for each sentence. Identified
phrases were marked with whether the concepts were found in a positive or negative
context. Phrases from different reports of the same kind (e.g., radiology reports)
belonging to the same admission were merged such that repeating phrase was counted
only once. We then built series of feature vectors r[_q][_p][_m]. The r feature vector
represents our baseline and contains a “bag” (i.e., an unordered list) of biomedical
phrases from radiology reports only. Other feature vectors add the following optional
sources: q – radiology questions, p – pathology reports, and m – metadata.
2.4. Classification and evaluation

We treated ICD-codes as targets for classification. To identify those data sources that
contain the most valuable information for identifying lung cancer admissions, a
classification framework was built for each feature vector described above.
We used the Weka Toolkit [12] implementation of the Support Vector Machine
algorithm, since it has performed robustly in our previous work [7].
     Evaluation of TM and NLP systems typically involves the following three metrics:
precision, recall and F-Score. Precision of positive/negative class (also called
positive/negative predictive value) is the ratio of correctly classified positive/negative
values to the number of all instances classified as positive/negative. Recall of
positive/negative class is computed as the number of correctly classified instances from
the positive/negative class divided by the number of all instances from the
positive/negative class; this is also known as sensitivity. F-score is the weighted
harmonic mean of precision and recall.
     We performed 10-fold cross-validation, where we randomly split data into
train/test halves 10 times. We measured precision, recall and F-Score for each fold. We
calculated statistical significance for F-Score using the Wilcoxon signed-rank test, as
recommended in [13].


3. Results

Table 3 shows precision, recall and F-Score measurements for the SVM classifiers built
for 8 different combinations of feature vectors. The classifier with the lowest score (r)
correctly classified 801 admissions, while the classifier with the highest score
(r_q_p_m) correctly classified 915 admissions.

             Table 3. Precision, recall and F-Score for classifiers built on 8 different feature vectors.
                   r         r_q         r_p         r_m         r_q_p     r_q_m        r_p_m          r_q_p_m
 Precision       0.875      0.898       0.888        0.902       0.906      0.912        0.920           0.932
  Recall         0.873      0.896       0.886        0.901       0.904      0.911        0.917           0.930
 F-Score         0.873      0.896       0.886        0.901       0.904      0.911        0.917           0.930


Table 4 shows F-Score differences between pairs of classifiers with different
combinations of feature vectors. Column names are initial combinations and row
names add information. Boldfaced values represent statistically significant results, and
comparisons that are not applicable have no values. For example, the left top cell
represents the F-Score difference between the classifier built on phrases from radiology
reports and the classifier with added radiology question phrases (r_q).

             Table 4. F-Score differences between pairs of classifiers and statistical significance.
                  r             r_q          r_p              r_m        r_q_p       r_q_m         r_p_m
      +q        +0.023                      +0.018           +0.010                                +0.013
      +p        +0.013        +0.008                         +0.016                  +0.019
      +m        +0.028        +0.015        +0.031                       +0.026

As can be seen, the enhanced classifier performed significantly better than the baseline
system (r), with an F-Score difference of +0.023. Similarly, the top right cell shows
that a classifier which uses all the features (r_q_p_m) performs better than the classifier
without radiology question phrases (r_p_m); however, this difference was not
statistically significant.


4. Discussion

Our baseline classifier for automatically identifying cases of lung cancer built on only
radiology report phrases shows comparable performance to that in our previous work
[7] (results are not directly comparable since the two datasets involve different
timeframes). Precison, recall and F-Score yield similar results for single feature vector
(single column in Table 3), which indicates that our classifiers misclassified similar
number of positive and negative examples. Including additional admission data sources
improved classification performance. The classifier with the highest performance was
built using features from all four data sources. However, statistical tests showed that
not all performance increases were significant. An example of a non-significant
improvement is combining radiology reports with pathology reports (First column in
Table 4, r+p). In contrast, adding metadata or radiology questions to radiology reports
significantly improved performance. In addition, these two data sources significantly
improved the performance when added to already combined radiology and pathology
reports (third column in Table 4). Finally, adding metadata to already combined
radiology and pathology reports with radiology questions further improves
performance (Column 5 of Table 4). Pathology reports significantly increased
performance only when added to the combination of radiology reports, radiology
questions, and metadata.
     Not unexpectedly, our results indicate that more informed systems can be built by
including multiple data sources. Radiology questions and metadata seem to contain
crucial information for detecting lung cancer cases, significantly improving
performance when added to radiology reports or to the combination of radiology and
pathology reports. The reason for lack of statistical significance when adding pathology
reports to train the system may be due to a dearth of pathology reports (only 518 of 992
admissions with a radiology report had pathology reports associated with them).


5. Conclusion

We have shown that mining multiple linked data sources improves classification
performance of lung cancer ICD-10 codes from textual data, as compared to using a
single data source. We expect similar results for other diseases and plan to use different
ICD-10 codes as targets for classification in our future work. In addition, we plan to
use other techniques to address the problem of highly skewed data sets such as
oversampling [9] or cost-sensitive learning [14]. Finally, we plan to use methods for
identifying features from specific data sources that most influence classification
performance. Our data have a high number of features compared to number of samples,
and we expect that some of these features are redundant or irrelevant: we plan to apply
feature selection methods [15], which should also shorten model training times on the
whole dataset and reduce the potential of over-fitting to the data.
References

[1] G. Hripcsak, J.H. Austin, P.O. Alderson, C. Friedman, Use of natural language processing to translate
      clinical information from a database of 889,921 chest radiographic reports, Radiology, 224 (2002), pp.
      157–163.
[2] A.N. Nguyen, M.J. Lawley, D.P. Hansen, R.V. Bowman, B.E. Clarke, E.E. Duhig, et al., Symbolic rule-
      based classification of lung cancer stages from free-text pathology reports, J. Am. Med. Inform. Assoc.,
      17 (2010), pp. 440–445.
[3] A. Nguyen, J. Moore, G. Zuccon, M. Lawley, S. Colquist, Classification of pathology reports for cancer
      registry notifications, In: Health Informatics: Building a Healthcare Future Through Trusted
      Information-Selected Papers from the 20th Australian National Health Informatics Conference (Hic
      2012) 178, (2012) 150.
[4] A. Coden, G. Savova, I. Sominsky, M. Tanenblatt, J. Masanz, K.S.J. Cooper, et al. Automatically
      extracting cancer disease characteristics from pathology reports into a Disease Knowledge
      Representation Model, J. Biomed. Inform., 42 (2009), pp. 937–949.
[5] M. Tanenblatt, A. Coden, I. Sominsky, The ConceptMapper approach to named entity recognition,
      Language Resources and Evaluation, European Language Resources Association, Malta (2010), pp.
      546–551
[6] J. Sorace, D.R. Aberle, D. Elimam, S. Lawvere, O. Tawfik, W.D. Wallace, Integrating pathology and
      radiology disciplines: an emerging opportunity?, BMC medicine, 10(1), (2012) 100.
[7] D. Martinez, L. Cavedon, Z. Alam, C. Bain, K. Verspoor, Text mining for lung cancer cases over large
      patient admission data, Big Data Conference, Abstract Book. Big Data Conference, Melbourne April.
      (2014), pp24-25.
[8] C. Bain, C. MacManus, Advancing data management and usage in a major Australian health service: The
      REASON discovery platform™, Data Science & Engineering (ICDSE), 2014 International Conference
      on (2014), 38–43.
[9] H. Han, W.Y. Wang, B.H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data
      sets learning. Advances in intelligent computing. Springer Berlin Heidelberg, (2005). 878-887.
[10] A. R. Aronson, Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap
      program. AMIA Annual Symposium Proceedings, Washington DC, (2001) 17—21.
[11] W.W. Chapman, W. Bridewell, P. Hanbury, et al. A simple algorithm for identifying negated findings
      and diseases in discharge summaries. J Biomed Inform 34 (2001).
[12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The WEKA Data Mining
      Software: An Update. SIGKDD Explorations, Volume 11, Issue 1, (2009).
[13] J. Demšar, Statistical comparisons of classifiers over multiple data sets, The Journal of Machine
      Learning Research 7 (2006), 1–30.
[14] N. Thai-Nghe, Z. Gantner, L. Schmidt-Thieme, Cost-sensitive learning methods for imbalanced data, in
      Proceeding of IEEE International Joint Conference on Neural Networks (IJCNN10), (2010).
[15] D.D. Lewis, Feature selection and feature extraction for text categorization. Proceedings of the
      workshop on Speech and Natural Language. Association for Computational Linguistics, (1992).