<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text mining for lung cancer cases over large patient admission data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dr Lawrence Cavedon</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Martinez</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lawrence Cavedon</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zaf Alam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher Bain</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karin Verspoor</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alfred Health</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Monash University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>RMIT University</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The University of Melbourne</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>24</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>SUMMARY We describe a text mining system running over a large clinical repository for the detection of lung cancer admissions, and evaluate its performance over different scenarios. Our results show that a Machine Learning classifier is able to obtain significant gains over a keyword-matching approach, and also that combining patient metadata with the textual content further improves performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Dr Lawrence Cavedon is a Senior Lecturer in the School
of Computer Science and IT at RMIT University, and
until recently a Senior Researcher at NICTA’s Victorian
Research Laboratory, where he was a member of the
Biomedical Informatics team. Lawrence’s current research
includes text mining for biomedical applications, spoken
dialogue management, and other topics in Artificial
Intelligence.</p>
    </sec>
    <sec id="sec-2">
      <title>RESULTS</title>
      <p>We constructed a baseline system using a simple term/phrase-matching approach, using the following (manually constructed) list of terms: “lung cancer”, “lung
malignancy”, “lung malignant”, “lung neoplasm”, “lung tumour”, and “lung carcinoma”. The performance of this approach is shown at the bottom of Table 1, using
the standard metrics of precision (i.e., positive predictive value), recall (i.e., sensitivity), and F-score (the harmonic mean of them). Precision in particular is low,
indicating that many identified phrases were negated or neutral with respect to lung cancer. Recall is higher, but the baseline still fails to identify over one quarter
of relevant admissions.</p>
      <p>We applied the ML approach outlined above. We report here the results of the basic pipeline without use of feature selection: applying feature selection actually
reduced performance, possibly because of the low proportion of positive instances in our dataset. Cross-validation was applied using random stratified 10-fold
cross-validation. The results of this experiment are shown in the top two rows of Table 1 for two settings: (i) full feature set (including the metadata described
above), and (ii) textual features only. There is clear improvement over the baseline in both cases, particularly in precision. The use of metadata contributes to higher
performance, which illustrates the importance of linking different sources of data.</p>
    </sec>
    <sec id="sec-3">
      <title>CLASSIFIER</title>
      <p>Full feature set (including metadata)
Textual features only
Baseline</p>
    </sec>
    <sec id="sec-4">
      <title>PRECISION</title>
      <p>0.871 (0.047)
0.855 (0.048)
0.643</p>
    </sec>
    <sec id="sec-5">
      <title>RECALL</title>
      <p>0.820 (0.057)
0.800 (0.052)
0.742</p>
    </sec>
    <sec id="sec-6">
      <title>F-SCORE</title>
      <p>0.843 (0.041)
0.825 (0.034)
0.689</p>
      <p>As a final experiment, we split the data into 3-month periods and performed two tests: (i) Test over each period using all previous history as training; and (ii) Test
over each period using only the previous 3-month block as training. The results of this evaluation (using the full feature set) are shown in Figure 1, along with the
keyword-matching baseline. We can see that, once we have accumulated enough training, using full history produces higher F-score than using only the previous
quarter. However performance reaches a peak and then decreases over the final quarter, suggesting the possibility of changes in reporting that the model does not
capture; further analysis is required to build a robust system.</p>
      <p>1. Time-series performance over the different classifiers</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>Our analysis shows promising results for automatically identifying cases of lung cancer from radiology reports, with results clearly superior to a simple
keywordmatching baseline. The experiments also highlight that the model does not always improve with more data, and error analysis is required to interpret the drop in
performance for the last 3-month subset of our dataset. While the techniques themselves are fairly standard, an interesting finding is the performance improvement
when using metadata on top of the textual features, illustrating the importance of relying on different data sources in building more informed systems. In future work,
we plan to integrate other types of clinical information in textual form, such as pathology reports, and evaluate using other disease codes.
1Martinez, Cavedon and Verspoor are no longer affiliated with NICTA. NICTA is funded by the Australian Government through the Dept. of Communications and the Australian Research Council through the ICT Centre of Excel ence Program.
2International Classification of Diseases: http://www.who.int/classifications/icd/en/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Aronson</surname>
          </string-name>
          .
          <article-title>Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</article-title>
          .
          <source>AMIA Annual Symposium Proceedings</source>
          , Washington DC,
          <year>2001</year>
          :
          <fpage>17</fpage>
          -
          <lpage>21</lpage>
          . 2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hal</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <source>The WEKA Data Mining Software: An Update. SIGKDD Explorations</source>
          ,
          <year>2009</year>
          , Volume
          <volume>11</volume>
          , Issue 1. 3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Suominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ananda-Rajah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cavedon</surname>
          </string-name>
          ,
          <article-title>Biosurveil ance for Invasive Fungal Infections via text mining, CLEF Wshop on Cross-Language Eval of Methods, Applications, Resources for eHealth Document Analysis</article-title>
          ,
          <year>Rome 2012</year>
          . 4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hal</surname>
          </string-name>
          .
          <article-title>Correlation-based Feature Subset Selection for Machine Learning</article-title>
          .
          <source>PhD thesis</source>
          , Dept. Comp. Sci.,
          <string-name>
            <given-names>U.</given-names>
            <surname>Waikato</surname>
          </string-name>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>