<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Dr Lawrence Cavedon Senior Lecturer RMIT University</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><roleName>Dr</roleName><forename type="first">Lawrence</forename><surname>Cavedon</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">The University of Melbourne</orgName>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">RMIT University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">David</forename><surname>Martinez</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">The University of Melbourne</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Zaf</forename><surname>Alam</surname></persName>
							<affiliation key="aff3">
								<orgName type="institution">Alfred Health d Monash University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Christopher</forename><surname>Bain</surname></persName>
							<affiliation key="aff3">
								<orgName type="institution">Alfred Health d Monash University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Karin</forename><surname>Verspoor</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">The University of Melbourne</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="laboratory">&apos;s Victorian Research Laboratory</orgName>
								<orgName type="institution">NICTA</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Dr Lawrence Cavedon Senior Lecturer RMIT University</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">BF92CF733251CAD07E6128C455FD449D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T17:15+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>where he was a member of the Biomedical Informatics team. Lawrence's current research includes text mining for biomedical applications, spoken dialogue management, and other topics in Artificial Intelligence.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>INTRODUCTION</head><p>The increasing availability of linked electronic patient data creates opportunities for analysis, prediction, and automation of tasks. A challenge is that much of this data remains in text format, requiring the use of Natural Language Processing (NLP) techniques to extract actionable information. Text classification according to disease is a crucial technique for retrieving specific cases or creating patient cohorts, for enabling analytics and detection of patterns of disease occurrence, or supporting resource-planning a hospital system. It can also be a prelude to automatic ICD-coding, providing support for an extremely time-consuming manual process.</p><p>We describe initial work using data from an Informatics Platform developed at Alfred Health in Melbourne. We investigate the task of automatically assigning the ICD-10 code corresponding to lung cancer (C34, Malignant neoplasm of bronchus and lung) to a patient admission record, via application of a sophisticated text classifier using Machine Learning (ML), over two years of radiology reports from a hospital (756,520 text reports, along with associated metadata) for training and evaluation. We use manually assigned ICD codes to rigorously evaluate performance on different scenarios, using both cross-validation and time-series views of the dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>METHOD</head><p>The dataset for this study was extracted from the Alfred Health Informatics Platform (called REASON); it consists of all radiology reports for financial years 2011-2012 and 2012-2013. Each report is assigned an admission identifier, which is in turn linked to patient metadata, including demographics, reason for admission, etc. The metadata includes the ICD-10 codes assigned to the admission, which are used as ground truth to build a gold standard. We define the task as a binary classification problem: determine whether each admission in the test set is associated to the ICD-10 code for lung cancer: C34, Malignant neoplasm of bronchus and lung. An admission is represented by radiology scans linked to it, along with associated metadata.</p><p>Classification of lung cancer is a challenging task for automatic systems for two reasons: (i) manually-crafted keywords and phrases produce large numbers of false negatives, and also several false positives; and (ii) for our dataset only 0.8% of the admissions were positive for lung cancer: the highly-skewed nature of the data poses a specific challenge to automated ML approaches, which generally perform better over balanced class distributions.</p><p>A classifier was developed using a classical supervised learning framework. For feature representation we combined characteristics obtained from the text, along with the metadata linked to each admission, leaving out any ICD-codes since those are the target for predictions. Text in the reports was processed using the MetaMap tool 1 from the US National Library of Medicine: this identifies phrases and the polarity (negative or positive) of each, using the integrated module NegEx. We created a feature vector combining phrases obtained from MetaMap, the Bag-of-Words (BOW) representation of the text, and the metadata fields. We used the Weka Toolkit 2 implementation of the Support Vector Machine algorithm, since this has performed robustly in our previous work (e.g. <ref type="bibr" target="#b2">3</ref> ). We also tested the effect of applying a greedy correlation-based feature subset selection filter 4 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">-4 april 2014 | melbourne</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RESULTS</head><p>We constructed a baseline system using a simple term/phrase-matching approach, using the following (manually constructed) list of terms: "lung cancer", "lung malignancy", "lung malignant", "lung neoplasm", "lung tumour", and "lung carcinoma". The performance of this approach is shown at the bottom of Table <ref type="table" target="#tab_0">1</ref>, using the standard metrics of precision (i.e., positive predictive value), recall (i.e., sensitivity), and F-score (the harmonic mean of them). Precision in particular is low, indicating that many identified phrases were negated or neutral with respect to lung cancer. Recall is higher, but the baseline still fails to identify over one quarter of relevant admissions.</p><p>We applied the ML approach outlined above. We report here the results of the basic pipeline without use of feature selection: applying feature selection actually reduced performance, possibly because of the low proportion of positive instances in our dataset. Cross-validation was applied using random stratified 10-fold cross-validation. The results of this experiment are shown in the top two rows of Table <ref type="table" target="#tab_0">1</ref> for two settings: (i) full feature set (including the metadata described above), and (ii) textual features only. There is clear improvement over the baseline in both cases, particularly in precision. The use of metadata contributes to higher performance, which illustrates the importance of linking different sources of data. As a final experiment, we split the data into 3-month periods and performed two tests: (i) Test over each period using all previous history as training; and (ii) Test over each period using only the previous 3-month block as training. The results of this evaluation (using the full feature set) are shown in Figure <ref type="figure" target="#fig_0">1</ref>, along with the keyword-matching baseline. We can see that, once we have accumulated enough training, using full history produces higher F-score than using only the previous quarter. However performance reaches a peak and then decreases over the final quarter, suggesting the possibility of changes in reporting that the model does not capture; further analysis is required to build a robust system. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CLASSIFIER</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CONCLUSION</head><p>Our analysis shows promising results for automatically identifying cases of lung cancer from radiology reports, with results clearly superior to a simple keywordmatching baseline. The experiments also highlight that the model does not always improve with more data, and error analysis is required to interpret the drop in performance for the last 3-month subset of our dataset. While the techniques themselves are fairly standard, an interesting finding is the performance improvement when using metadata on top of the textual features, illustrating the importance of relying on different data sources in building more informed systems. In future work, we plan to integrate other types of clinical information in textual form, such as pathology reports, and evaluate using other disease codes.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. Time-series performance over the different classifiers</figDesc><graphic coords="2,223.50,471.98,232.88,122.31" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Results table for the different evaluations. Standard deviation is shown between parentheses.</figDesc><table><row><cell></cell><cell>PRECISION</cell><cell>RECALL</cell><cell>F-SCORE</cell></row><row><cell>Full feature set (including metadata)</cell><cell>0.871 (0.047)</cell><cell>0.820 (0.057)</cell><cell>0.843 (0.041)</cell></row><row><cell>Textual features only</cell><cell>0.855 (0.048)</cell><cell>0.800 (0.052)</cell><cell>0.825 (0.034)</cell></row><row><cell>Baseline</cell><cell>0.643</cell><cell>0.742</cell><cell>0.689</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">lawrence.cavedon@rmit.edu.au</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1">Martinez, Cavedon and Verspoor are no longer affiliated with NICTA. NICTA is funded by the Australian Government through the Dept. of Communications and the Australian Research Council through the ICT Centre of Excellence Program.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2">International Classification of Diseases: http://www.who.int/classifications/icd/en/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Aronson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AMIA Annual Symposium Proceedings</title>
				<meeting><address><addrLine>Washington DC</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="17" to="21" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The WEKA Data Mining Software: An Update</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Frank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Holmes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Pfahringer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Reutemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGKDD Explorations</title>
				<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">11</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Biosurveillance for Invasive Fungal Infections via text mining</title>
		<author>
			<persName><forename type="first">D</forename><surname>Martinez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Suominen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ananda-Rajah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cavedon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF Wshop on Cross-Language Eval of Methods, Applications, Resources for eHealth Document Analysis</title>
				<meeting><address><addrLine>Rome</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Correlation-based Feature Subset Selection for Machine Learning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hall</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
		<respStmt>
			<orgName>Dept. Comp. Sci., U. Waikato</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
