<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating classification power of linked admission data sources with text mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simon KOCBEK</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lawrence CAVEDON</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David MARTINEZ</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher BAIN</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris MAC MANUS</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gholamreza HAFFARI</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ingrid ZUKERMAN</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karin VERSPOOR</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science &amp; Info Tech, RMIT University</institution>
          ,
          <addr-line>Melbourne</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept of Computing and Information Systems, University of Melbourne</institution>
          ,
          <addr-line>Melbourne</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculty of Information Technology, Monash University</institution>
          ,
          <addr-line>Melbourne</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Health Informatics Department, Alfred Hospital</institution>
          ,
          <addr-line>Melbourne</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>MedWhat.com</institution>
          ,
          <addr-line>San Francisco</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Lung cancer is a leading cause of death in developed countries. This paper presents a text mining system using Support Vector Machines for detecting lung cancer admissions. Performance of the system using different clinical data sources is evaluated. We use radiology reports as an initial data source and add other sources, such as pathology reports, patient demographic information and hospital admission information. Results show that mining over linked data sources significantly improves classification performance with a maximum F-Score improvement of 0.057.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Text mining</kwd>
        <kwd>natural language processing</kwd>
        <kwd>lung cancer</kwd>
        <kwd>linked hospital data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Text and data mining are proving to be increasingly important and powerful techniques
for extracting information and insights from Health and Hospital Information Systems
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1-5</xref>
        ]. Mining hospital data holds the potential for new discoveries as well as improved
efficiencies and communication within hospital systems. Much valuable information in
hospital records is represented in free text format, e.g., radiology and pathology reports,
requiring the application of Text Mining (TM) and Natural Language Processing (NLP)
techniques.
      </p>
      <p>
        Most previous clinical text mining applications have made use of a single textual
data source, e.g., radiology reports, in order to identify or mine information related to a
single condition (e.g., [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ]). However, the increase in data linkage (i.e., multiple data
sources being linked by patient id) in Hospital Information Systems is creating
opportunities for more powerful and accurate text mining techniques that combine
insights from multiple data sources [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>In this paper, we describe performance of text mining in the context of the
challenge of identifying patients admitted to a hospital for treatment for lung cancer.
Lung cancer is a leading cause of death in developed countries, and automatically
mapping patient admissions to ICD (International Code of Diseases) directly from
hospital records is a precursor to automated ICD-coding, a massively time-consuming
manual process at the core of the procedure followed to fund hospitals.</p>
      <p>The focus of this paper is to evaluate the value of data linkage and investigate the
source of value within different hospital data sources. In particular, we consider a large
collection of radiology and pathology reports, along with associated metadata sources,
and build classifiers for each type of data source, as well as their combination. Our
results confirm that, as might be expected, jointly mining multiple linked data
sources improves text classification performance. Analysis also identifies which
information source is most valuable for mining for the specified disease, although we
expect this to vary with different diseases.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Related work</title>
      <p>A substantial amount of relevant disease information exists in various types of medical
records. Much of this information is in the form of free text; hence text mining
represents a promising strategy for building machine learning classifiers that take
advantage of the richness of such records. Both radiology and pathology reports have
been studied as a source of specific clinical information in previous text mining studies.
A pathology report describes the results of examining cells and tissues under a
microscope after a biopsy or surgery. A radiology report represents a specialist’s
interpretation of images related to a patient’s signs and symptoms.</p>
      <p>
        Hripscak et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] used NLP techniques to evaluate the automatic coding of
889,921 chest radiology reports. Nguyen et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] performed classification of lung
cancer stages from pathology reports. In their follow-up work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a rule-based system
was used to classify cancer-notifiable pathology reports from a small corpus (approx.
500 reports), obtaining very high sensitivity, specificity and Positive Predictive Value
(PPV). Pathology reports have also been analysed to extract breast cancer
characteristics into a knowledge model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and to identify relevant named entities [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In previous work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] we built a system for detecting lung cancer admissions based
on radiology reports linked to patient metadata for the financial years 2012-2013 and
2013-2014. A similar approach is adopted in this paper, where we use TM techniques
to extract useful information about lung cancer. We extend the prior work by exploring
the impact of incorporating two additional data sources: pathology reports and
radiology questions (i.e., the purpose stated by the clinician for requesting a scan). We
also measure statistical significance of classification performance using the different
data sources. Note that the goal of this paper is not to achieve better classification
performance than previous systems, but to achieve comparable performance and
explore the value of various data sources in mining information related to a specified
question.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Methods</title>
      <sec id="sec-3-1">
        <title>2.1. Data source</title>
        <p>
          The data for this study was extracted from the Alfred Health Informatics Platform,
called REASON [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which provides a single data warehouse view of multiple data
sources within the Alfred Health system, linked by unique anonymised patient id. Data
for the current study was extracted from REASON under ethics approval from the
Alfred Health Human Research Ethics Committee, in the form of a de-identified set. A
high-level architecture of the REASON platform is shown in Figure 1. Table 1
provides an overview of some of the key record types (and number of records) in the
platform relevant to our current task, though it is not a complete listing.
        </p>
        <p>For the purpose of this study, we extracted textual form of radiology and pathology
reports for the financial years 2012-2013 and 2013-2014. Each report was assigned an
admission identifier, which is in turn linked to patient metadata. The following
metadata associated with each admission were extracted: patient’s demographic data
(gender, age, ethnic origin, country, language, marital status, religion, and death date)
and hospital-related admission data (hospital code, admission date and time, discharge
date and time, length of stay, reason for the admission, admission unit, discharge unit,
admission type, source, destination and criteria). Radiology reports were also
associated with radiology questions, i.e., a short description of the reason given by the
clinician for requesting the scan. The initial number of admission records used in this
study was as follows: 40,800 radiology reports; 20,872 pathology reports; and 121,700
metadata entries.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Gold Standard data set</title>
        <p>Each admission is associated with a set of ICD-10 codes, which are annotated in the
admission record by an internal clinical coder for reporting purposes. These are used in
our study as ground truth to build the gold standard data set. The ICD codes are ignored
when testing the classifiers – i.e., the classification task consists of identifying those
records which contain the ICD code of interest in the gold standard data set.</p>
        <p>
          To identify positive lung cancer cases we used the ICD-10 code C34.*: Malignant
neoplasm of bronchus and lung. In our dataset, only 496 out of 40,800 admissions with
radiology reports were positive for lung cancer. The highly skewed nature of the data
poses a specific challenge to automated machine learning approaches, which generally
perform better over balanced class distributions. To address this problem, we
performed subsampling, randomly selecting a subset of negative admissions to balance
the datasets. Other, more time complex methods (such as oversampling [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]) could have
been used; however, due to time constraints and the high number of experiments to be
run, these methods were not appropriate for this work. The final gold standard dataset
therefore contained 992 admissions. All admissions contained radiology report and
radiology question, 833 admissions also contained metadata, and 518 admissions also
contained pathology report.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Data representation</title>
        <p>Machine learning algorithms require a representation of relevant features of each data
point that can be used to build a predictive classifier. The feature representation we
adopted for our task combines characteristics obtained from text reports, along with the
patient and hospital metadata linked to each admission.</p>
        <p>
          Text in radiology reports, radiology questions and pathology reports was
processed with the MetaMap tool [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] from the US National Library of Medicine.
MetaMap is a program that identifies and normalises biomedical terminology from the
Unified Medical Language System (UMLS) Metathesaurus in biomedical text. Below
is a short sample of MetaMap-annotated phrases from the sentence “replaced with a
right frontal approach”.
        </p>
        <p>
          Meta Mapping (701):
748 C0559956: Replaced (Replacement) [Functional Concept]
748 C0205090: Right [Spatial Concept]
778 C2316681: Frontal approach [Functional Concept]
We employed the NegEx module to identify the polarity (negative or positive, e.g.,
“Non contrast in the brain”) of phrases. NegEx is a simple algorithm included in
MetaMap that implements several regular expressions that indicate negation, filters out
sentences containing phrases that falsely appear to be negation phrases, and limits the
scope of the negation phrases [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>We collected phrases mapped into UMLS concepts for each sentence. Identified
phrases were marked with whether the concepts were found in a positive or negative
context. Phrases from different reports of the same kind (e.g., radiology reports)
belonging to the same admission were merged such that repeating phrase was counted
only once. We then built series of feature vectors r[_q][_p][_m]. The r feature vector
represents our baseline and contains a “bag” (i.e., an unordered list) of biomedical
phrases from radiology reports only. Other feature vectors add the following optional
sources: q – radiology questions, p – pathology reports, and m – metadata.</p>
      </sec>
      <sec id="sec-3-4">
        <title>2.4. Classification and evaluation</title>
        <p>
          We treated ICD-codes as targets for classification. To identify those data sources that
contain the most valuable information for identifying lung cancer admissions, a
classification framework was built for each feature vector described above.
We used the Weka Toolkit [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] implementation of the Support Vector Machine
algorithm, since it has performed robustly in our previous work [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Evaluation of TM and NLP systems typically involves the following three metrics:
precision, recall and F-Score. Precision of positive/negative class (also called
positive/negative predictive value) is the ratio of correctly classified positive/negative
values to the number of all instances classified as positive/negative. Recall of
positive/negative class is computed as the number of correctly classified instances from
the positive/negative class divided by the number of all instances from the
positive/negative class; this is also known as sensitivity. F-score is the weighted
harmonic mean of precision and recall.</p>
        <p>
          We performed 10-fold cross-validation, where we randomly split data into
train/test halves 10 times. We measured precision, recall and F-Score for each fold. We
calculated statistical significance for F-Score using the Wilcoxon signed-rank test, as
recommended in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Results</title>
      <p>As can be seen, the enhanced classifier performed significantly better than the baseline
system (r), with an F-Score difference of +0.023. Similarly, the top right cell shows
that a classifier which uses all the features (r_q_p_m) performs better than the classifier
without radiology question phrases (r_p_m); however, this difference was not
statistically significant.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Discussion</title>
      <p>
        Our baseline classifier for automatically identifying cases of lung cancer built on only
radiology report phrases shows comparable performance to that in our previous work
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (results are not directly comparable since the two datasets involve different
timeframes). Precison, recall and F-Score yield similar results for single feature vector
(single column in Table 3), which indicates that our classifiers misclassified similar
number of positive and negative examples. Including additional admission data sources
improved classification performance. The classifier with the highest performance was
built using features from all four data sources. However, statistical tests showed that
not all performance increases were significant. An example of a non-significant
improvement is combining radiology reports with pathology reports (First column in
Table 4, r+p). In contrast, adding metadata or radiology questions to radiology reports
significantly improved performance. In addition, these two data sources significantly
improved the performance when added to already combined radiology and pathology
reports (third column in Table 4). Finally, adding metadata to already combined
radiology and pathology reports with radiology questions further improves
performance (Column 5 of Table 4). Pathology reports significantly increased
performance only when added to the combination of radiology reports, radiology
questions, and metadata.
      </p>
      <p>Not unexpectedly, our results indicate that more informed systems can be built by
including multiple data sources. Radiology questions and metadata seem to contain
crucial information for detecting lung cancer cases, significantly improving
performance when added to radiology reports or to the combination of radiology and
pathology reports. The reason for lack of statistical significance when adding pathology
reports to train the system may be due to a dearth of pathology reports (only 518 of 992
admissions with a radiology report had pathology reports associated with them).</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>
        We have shown that mining multiple linked data sources improves classification
performance of lung cancer ICD-10 codes from textual data, as compared to using a
single data source. We expect similar results for other diseases and plan to use different
ICD-10 codes as targets for classification in our future work. In addition, we plan to
use other techniques to address the problem of highly skewed data sets such as
oversampling [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or cost-sensitive learning [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Finally, we plan to use methods for
identifying features from specific data sources that most influence classification
performance. Our data have a high number of features compared to number of samples,
and we expect that some of these features are redundant or irrelevant: we plan to apply
feature selection methods [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which should also shorten model training times on the
whole dataset and reduce the potential of over-fitting to the data.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hripcsak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Austin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.O.</given-names>
            <surname>Alderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <article-title>Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports</article-title>
          ,
          <source>Radiology</source>
          ,
          <volume>224</volume>
          (
          <year>2002</year>
          ), pp.
          <fpage>157</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.N.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.J.</given-names>
            <surname>Lawley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.P.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.V.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.E.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.E.</given-names>
            <surname>Duhig</surname>
          </string-name>
          , et al.,
          <article-title>Symbolic rulebased classification of lung cancer stages from free-text pathology reports</article-title>
          ,
          <source>J. Am. Med</source>
          . Inform. Assoc.,
          <volume>17</volume>
          (
          <year>2010</year>
          ), pp.
          <fpage>440</fpage>
          -
          <lpage>445</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zuccon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lawley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Colquist</surname>
          </string-name>
          ,
          <article-title>Classification of pathology reports for cancer registry notifications</article-title>
          ,
          <source>In: Health Informatics: Building a Healthcare Future Through Trusted Information-Selected Papers from the 20th Australian National Health Informatics Conference (Hic</source>
          <year>2012</year>
          )
          <volume>178</volume>
          , (
          <year>2012</year>
          )
          <fpage>150</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Coden</surname>
          </string-name>
          , G. Savova, I. Sominsky,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tanenblatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Masanz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.S.J.</given-names>
            <surname>Cooper</surname>
          </string-name>
          , et al.
          <article-title>Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Biomed</surname>
          </string-name>
          . Inform.,
          <volume>42</volume>
          (
          <year>2009</year>
          ), pp.
          <fpage>937</fpage>
          -
          <lpage>949</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tanenblatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Coden</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sominsky</surname>
          </string-name>
          ,
          <article-title>The ConceptMapper approach to named entity recognition, Language Resources and Evaluation, European Language Resources Association</article-title>
          ,
          <string-name>
            <surname>Malta</surname>
          </string-name>
          (
          <year>2010</year>
          ), pp.
          <fpage>546</fpage>
          -
          <lpage>551</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sorace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.R.</given-names>
            <surname>Aberle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Elimam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lawvere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tawfik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.D.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <article-title>Integrating pathology and radiology disciplines: an emerging opportunity?</article-title>
          ,
          <source>BMC medicine</source>
          ,
          <volume>10</volume>
          (
          <issue>1</issue>
          ), (
          <year>2012</year>
          )
          <fpage>100</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cavedon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Verspoor</surname>
          </string-name>
          ,
          <article-title>Text mining for lung cancer cases over large patient admission data</article-title>
          ,
          <source>Big Data Conference, Abstract Book. Big Data Conference</source>
          , Melbourne April. (
          <year>2014</year>
          ),
          <fpage>pp24</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bain</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>MacManus, Advancing data management and usage in a major Australian health service: The REASON discovery platform™</article-title>
          ,
          <source>Data Science &amp; Engineering (ICDSE)</source>
          , 2014 International Conference on (
          <year>2014</year>
          ),
          <fpage>38</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.H.</given-names>
            <surname>Mao</surname>
          </string-name>
          , Borderline-SMOTE:
          <article-title>a new over-sampling method in imbalanced data sets learning</article-title>
          .
          <source>Advances in intelligent computing</source>
          . Springer Berlin Heidelberg, (
          <year>2005</year>
          ).
          <fpage>878</fpage>
          -
          <lpage>887</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Aronson</surname>
          </string-name>
          ,
          <article-title>Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</article-title>
          .
          <source>AMIA Annual Symposium Proceedings</source>
          , Washington DC, (
          <year>2001</year>
          )
          <fpage>17</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.W.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bridewell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , et al.
          <article-title>A simple algorithm for identifying negated findings and diseases in discharge summaries</article-title>
          .
          <source>J Biomed Inform</source>
          <volume>34</volume>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          ,
          <source>The WEKA Data Mining Software: An Update. SIGKDD Explorations</source>
          , Volume
          <volume>11</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>1</given-names>
          </string-name>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Demšar</surname>
          </string-name>
          ,
          <article-title>Statistical comparisons of classifiers over multiple data sets</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          <volume>7</volume>
          (
          <year>2006</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thai-Nghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gantner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt-Thieme</surname>
          </string-name>
          ,
          <article-title>Cost-sensitive learning methods for imbalanced data</article-title>
          ,
          <source>in Proceeding of IEEE International Joint Conference on Neural Networks (IJCNN10)</source>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>D.D. Lewis</surname>
          </string-name>
          ,
          <article-title>Feature selection and feature extraction for text categorization</article-title>
          .
          <source>Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics</source>
          , (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>