AIH 2012 Automatic Classification of Cancer Notifiable Death Certificates Luke Butt1 , Guido Zuccon1 , Anthony Nguyen1 , Anton Bergheim2 , Narelle Grayson2 1 The Australian e-Health Research Centre, Brisbane, Queensland, Australia; 2 Cancer Institute NSW, Alexandria, New South Wales, Australia. {luke.butt, guido.zuccon, anthony.nguyen}@csiro.au {anton.bergheim, narelle.grayson}@cancerinstitute.org.au Abstract. The timely notification of cancer cases is crucial for can- cer monitoring and prevention. However, the abstraction and classifica- tion of cancer from the free-text of pathology reports and other relevant documents, such as death certificates, are complex and time-consuming activities. In this paper we investigate approaches for the automatic de- tection of cases where the cause of death is a notifiable cancer from free-text death certificates supplied to Cancer Registries. A number of machine learning classifiers were investigated. A large set of features were also extracted using natural language techniques and the Medtex toolkit; features include stemmed words, bi-grams, and concepts from the SNOMED CT medical terminology. The investigated approaches were found to be very effective in identifying death certificates where the cause of death was a notifiable cancer. Best performance was achieved by a Support Vector Machine (SVM) classifier with an overall F-measure of 0.9647 when evaluated on a set of 5,000 free-text death certificates. This classifier considers as features stemmed token bigrams and information from SNOMED CT concepts filtered by morphological abnormalities and disorders. However, our analysis shows that it is the selection of features that most influences the performance of the classifiers rather than the type of classifier or the feature weighting schema. Specifically, we found that stemmed token bigrams with or without SNOMED CT concepts are the most effective feature. In addition, the combination of token bi- grams and SNOMED CT information was found to yield the best overall performance. Keywords: death certificates, Cancer Registry, cancer monitoring and reporting, machine learning, natural language processing, SNOMED CT 1 Introduction Cancer notification and reporting is an important and fundamental process for providing an accurate picture of the impact of cancer, the nature and extent of cancer, and to direct research efforts for the cure of cancer. Cancer Registries col- lect and interpret data from a large number of sources, helping to improve cancer 65 AIH 2012 prevention and control, as well as treatments and survival rates for patients with cancer. The manual coding of documents, such as pathology reports and death cer- tificates, with respect to notifiable cancers and corresponding synoptic factors (such as primary site, morphology, etc.) is a laborious and time consuming pro- cess. Cancer Registries strive to provide timely and accurate information on cancer incidence and mortality in the community. They receive large quantities of data from a range of sources, including hospitals, pathology laboratories and Registries of Births, Deaths and Marriages (which issues releases of death cer- tificates). It is estimated that incident cases within Cancer Registries that have death certificate only notifications amount to about 1-5% of the total cases; de- lays in the processing of this data may cause underestimation of the incidence of cancer. Computational methods for the automatic abstraction of relevant infor- mation have the possibility to enhance a Cancer Registry’s workflow, providing time and costs savings as well as timely cancer incidence information and mor- tality information. This automatic process is however challenging, both for the complex nature of the language used in the reports, and for the high level of recall and accuracy required. Previous works have attempted to provide automatic cancer coding from free-text pathology reports collected by Cancer Registries. For example, Nguyen et al. [1] used natural language processing techniques and a rule-based system to automatically extract relevant synoptic factors from electronic pathology re- ports. Similarly, Zuccon et al. [2] showed how these techniques could cope with character recognition errors generated by scanning free-text pathology reports stored in paper form. Machine learning approaches have also been considered; for instance, D’Avolio et al. [3] have tested approaches based on supervised machine learning (Conditional Random fields and Maximum Entropy) and have shown its effectiveness for the classification of pathology reports that were consistent with cancer in the domains of colorectal, prostate, and lung cancer. Cancer Registries have access to a number of data sources beyond pathology reports. One such data source is death certificates. Death certificates are a rich source of data that can support cancer surveillance, monitoring and reporting. These certificates contain free-text sections that report the cause of the death of an individual. An example of the free-text content of a death certificate where the cause of death is a notifiable cancer is given in Figure 1, while Figure 2 is an example of a non-notifiable death certificate. Limited works have focused on computational methods for automatically classifing death certificates with respect to the cause of death. The Super- MICAR system and its related tools1 provide a semi-automatic coding of the cause of death in death certificates. The system identifies keywords and expres- sions from the free-text documents that indicate possible causes of death; this is done through the use of a standard set of expressions encoded in a predefined vocabulary. Extracted free-text expressions are then converted to one or more 1 Consult http://www.cdc.gov/nchs/nvss/mmds/super_micar.htm (last visited 19th November 2012) for further details. 66 AIH 2012 (I)A) MAXILLARY TUMOR, 2 YEARS B) PULMONARY OEDEMA, 1 WEEK (II) CEREBROVASCULAR ACCIDENT/DYSPLASIA, 20 YEARS ASTHMA Fig. 1. A de-identified death certificate where the cause of death is a notifiable cancer. I(A) CEREBROVASCULAR ACCIDENT 48 HOURS (B) CEREBRAL ARTERIOSCLEROSIS YEARS (C) HYPERTENSION YEARS II CHRONIC ALCOHOLISM YEARS Fig. 2. A de-identified death certificate where the cause of death is not a notifiable cancer. ICD-10 codes which are then aggregated into a single ICD-10 underlying cause of death through the use of a rule-base. While doctor reported death certificates can be fed directly into the system, Coroner reported ones require additional pre-processing. A consistent number (between 15 and 20 percent according to a US study [4]) of death certificates cannot be coded through SuperMICAR and related tools, and thus require manual coding. A recent work has successfully classified death certificates related to pneumonia and influenza using a natural language processing pipeline and rule-based system [5]. However, to the best of our knowledge, no previous research has been conducted to investigate fully automatic methods that go beyond keyword spotting of standard cause of death expressions to classifying death certificates, in particular focusing on certificates where the main cause of death is cancer. Furthermore, while Australian Can- cer Registries can acquire free-text death certificates on a fortnightly basis from the Registry of Births Deaths and Marriages, coded causes of death produced by SuperMICAR (and related products) are released by the Australian Bureau of Statistics on a yearly basis. Computational methods able to tackle the fast identification of death certificates where the cause of death is a notifiable can- cer would enhance the cancer reporting and monitoring capabilities of Cancer Registries. In this paper, we focus on the problem of automatically identifying death certificates where the main cause of death is cancer. This problem is cast into a binary classification problem, i.e. death certificates are classified as containing a death cause related to cancer or vice versa as not containing a death cause related to cancer. Several machine learning classifiers were investigated for this task. These include support vector machine, Naive Bayes, decision trees, and boosting algorithms. A state-of-the-art information extraction tool (Medtex [6]) is used to create different set of features that are used to train the classifiers; dif- ferent feature weighting schemas were also considered. Features include stemmed tokens, n-grams, as well as SNOMED CT concept ids and tokens from fully spec- ified names of SNOMED CT concepts, among others. SNOMED CT is a medical terminology which formally describes in detail the coverage and knowledge of topics and terminology used in the medical domain. Our approaches are tested on 5,000 de-identified death certificates acquired from an Australian Cancer Registry, using 10-fold cross validation for allow- ing robust training and testing. Our experimental results demonstrate that the 67 AIH 2012 choice of classifier and weighting schema, although being important, is not crit- ical for achieving high classification effectiveness. Instead, the choice of features used to represent content of death certificates is the determining factor for high classification effectiveness. Specifically, stemmed token bigrams are found to be the single most important features among those extracted. Furthermore, we found that SNOMED CT features provide consistent increments in classification effectiveness if used along with stemmed token bigrams; although not providing a large increment, the combined use of stemmed token bigrams and SNOMED CT morphology provide the best classification effectiveness in our experiments. Next, we detail the approaches adopted in this paper. Then, in Section 3 we outline our empirical evaluation methodology; classification results obtained by the investigated approaches are reported in Section 4. An analysis of the results is developed in Section 4.1. The paper concludes in Section 5 summarising our main contribution and directions for future work. 2 Approaches for Automatic Classification of Death Certificates In this paper we investigate supervised machine learning approaches for the detection of death certificates where the cause of death is a notifiable cancer. These approaches are characterised by three main variables: (1) the features extracted from the documents (Section 2.1), (2) the weighting schemas applied to the features to represent documents (Section 2.2), and (3) the specific binary classifier used to individuate certificates where the cause of death is a notifiable cancer (Section 2.3). 2.1 Automatic Feature Extraction Machine learning algorithms require data to be represented by features, such as the words that occur in a text document. We used the information extraction capabilities of the Medtex system2 for obtaining a set of meaningful features from the free-text of the death certificates. The feature sets investigated in this paper are: stem: a token stem, i.e. the stemmed version of a word contained in a certificates stemBigram: the bi-gram formed by two token stems, i.e. a pair of adjacent stemmed words as found in a certificates concept: SNOMED CT concepts as found in the free-text of the certificates using the Medtex system conceptFull: the tokens of the fully specified name of the extracted SNOMED CT concepts 2 Medtex comprises both information extraction capabilities (extracting both low level information such as word tokens and stems, punctuation, etc., and higher level se- mantic information such as UMLS and SNOMED CT concepts [1]) and classification capabilities integrated via its rule-based engine. 68 AIH 2012 concFullMorph: the tokens of the fully specified name of extracted SNOMED CT concepts that are morphologic abnormalities or disorders concBigram: the bigram formed by two adjacent SNOMED CT concept ids concFullBigram: the bigram formed by two adjacent tokens in the fully specified name of concepts extracted from SNOMED CT While features like stem and stemBigram are commonly used for classifying free-text documents, features based on SNOMED CT concepts and its properties such as tokens from the fully specified name have not been exploited by previ- ous works that attempted to classify free-text death certificates. SNOMED CT provides a standard clinical terminology used to map various descriptions of a clinical concept to a single standard clinical concept. In this work, the SNOMED CT ontology was used as an underlying mechanism to classify free-text using se- mantically matching SNOMED CT concepts. In addition, we also considered pair-wise combinations of features that pro- vided promising results on preliminary experiments. In this paper we shall re- port the results obtained by all features used singularly, and of the combinations concept + stem, concept + stemBigram, concFullMorph + stemBigram, and con- cBigram + stemBigram, which has shown promise in preliminary investigations. Next, we consider the example death certificates given in Figure 1 and Fig- ure 2 to describe how a feature set is constructed. To build the feature represen- tations, we examine each death certificate and for each occurring instance of a feature in the certificate we assign a value of 1, while the absence of a feature is marked by a zero entry value. Note that these values are subsequently modified according to the feature weighting functions, as we shall describe in Section 2.2. After all certificates have been processed in this manner, we add a final feature cancerNotifiable, whose value is obtained from ground truth judgements supplied with the data. Table 1 shows an extract of the feature data constructed for the two example death certificates. The task of the machine learning classifiers is to predict the value of the cancerNotifiable feature, given the learning data supplied. Features stem stemBigram concept conceptFull ... Cerebrovascular accident Cerebral arteriosclerosis ACCID DYSPLASIA Neoplasm of maxilla YEAR ASTHMA cancerNotifiable ALCOHOL ACCID 48 126550004 230690007 20 YEAR TUMOR ACCID WEEK YEAR ... ... ... ... Document ... Figure 1 1 0 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 ... 1 Figure 2 1 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 ... 0 Table 1. Feature data built from two example death certificates. 69 AIH 2012 Note that no further processing is applied to the text, for example, for remov- ing punctuation, identifying section or list labels, or for removing or correcting typographical errors present in the free-text. While adequate text pre-processing may enhance the quality of the text itself and thus of the extracted features, we left this for future work and instead we focused on investigating weighting schemas for the selected features and binary classifiers. 2.2 Feature Weighting A number of weighting schemes for capturing the local importance of a feature in a report were tested. Binary coefficients were used to encode the presence or absence of a feature. We refer to this schema as binary. The weighting schema composed by the feature frequency f (F) of feature F was used to capture the number of times a specific feature appeared within a document. We shall refer to this weighting schema as frequency. Variations of the frequency weighting schema were also experimented with. In this weighting schema, features frequencies were directly translated into weights, i.e. weights are linearly derived from frequencies. Variations consider non-linear functions of the frequency of a feature. A first variation was to scale the appearance of feature F in a free-text death certificate by the function 1 + log(f (F)) if f (F) ≥ 1, and 0 if the feature was absent. This function would capture the fact that little importance is given to subsequent appearances of a feature F in a document: the logarithm of a number greater than one plateaus rapidly. In the following, we shall refer to this weighting schema as LogF, i.e. logarithm of the frequency. A second variation was to assign increasing weights to features that appear with high frequencies within the death certificate. To this aim, the appearance of feature F was weighted according to the function ef (F ) , while a zero value was assigned to absent features. It is suggested that, given the short length of the considered death certificates, the unexpected multiple occurrence of a feature would provide strong evidence that that feature is important for the document. Using the exponential function to weight occurrences of a feature would assign dominating scores to features that occur frequently in a document. We shall refer to this weighting function as expF. Note that only local weighting functions were used to assign scores to fea- tures,that is, weights were computed only by taking into account the frequencies of appearance of a feature in a text, thus ignoring the distribution of that feature on a global level, i.e. across the dataset. The incorporation of global occurrence statistics within the weighting schemas is left to future work. 2.3 Automatic Classification Methodology A number of common classifiers were evaluated. These comprised statistical mod- els (Naive Bayes), support vector machines (SPegasos), decision trees (C4.5), and 70 AIH 2012 boosting algorithms (AdaBoost). We considered the implementations of these algorithms provided in the Weka toolkit [7]. The multinomial Naive Bayes classifier determines the class of a death cer- tificate according to the features that occur in the text and their weights. The SPegasos classifier uses a stochastic gradient descent algorithm and a hinge loss function to produce the separation hyperplane used by the linear support vector machine. In the C4.5 classifier, information gain is used for choosing at each level of the decision tree the most effective feature able to split the data into the two binary classes considered here (i.e. death certificates related to cancers and those not related to cancer). Adaboost minimises of a convex loss function built from the prediction of a base weak classifier. A simple binary decision tree classifier that constructs one-level trees was used as base classifier for Adaboost. Parameters of all classifiers were set to the default values described in Witten et al. [7]. 3 Experimental Methodology 3.1 Data A set of 5,000 free-text death certificates was acquired from Cancer Institute NSW, the institutional entity responsible for maintaining the Central Cancer Registry in New South Wales. Ethics approval was granted by the NSW Popu- lation & Health Services Research Ethics Committee for this study including to use the de-identified data. The free-text documents were short in length, con- taining on average 13.08 words; the (unstemmed) vocabulary contained 3,751 unique words (including section headings and labels). Cause of death classifications based on ICD-10 codes accompanied the re- ports. This coding set was acquired from the Australian Bureau of Statistics, who releases coded data yearly. ICD-10 codings were used to determine the class each death certificates belonged to. A list of ICD-10 codes that are cancer notifiable was provided by Cancer Institute NSW. The 5,000 death certificates were extracted from Cancer Institute NSW archives so that documents were uniformly split across the two classes, i.e. 2,500 certificates were coded with ICD-10 codes that are for notifiable cancers accord- ing to the business rules of Cancer Institute NSW, while the remaining 2,500 were not cancer notifiable. The causes of death of the 2,500 death certificates for notifiable cancers span a total of 367 unique ICD-10 codes. 3.2 Evaluation A 10-fold cross validation methodology was used to train and test the classifica- tion algorithms. In this methodology, the dataset was randomly divided into 10 stratified3 folds of equal dimensions. A model for each classifier was then learnt 3 Folds were automatically stratified with respect to the two target classes, not the ICD-10 codes. 71 AIH 2012 on nine of these folds, leaving one fold out for testing. The process was repeated by selecting a new fold for testing, while a new model was learnt from the re- maining folds. Classification effectiveness was then averaged across the folds left out for testing in each iteration. F-Measure (F-m) was used as primary metric to evaluate the efficacy of the implemented classifiers; accuracy, recall (sensitivity, Rec) and precision (posi- tive predictive value, Prec) were also recorded, along with the number of true positive (TP), false positve (FP), true negative (TN), and false negative (FN) classifications. 4 Results and Discussion The combination of 10 features, 4 weighting schemas, and 4 classifiers requires the evaluation of a total of 160 classifier settings (referred to as runs in the following) on the dataset consisting of 5,000 death certificates. While we eval- uated all combinations of features, weighting schema and classifiers, given the large number of combinations, it is not feasible to report the individual results for each of the runs. Thus, we report only the settings of the 40 most effective runs in terms on F-measure, our primary evaluation metric (Table 2), with the F-measure of each classifier over all experimented settings graphically shown in Figure 3. Later in the paper we shall consider a summary evaluation of the vari- ability of results provided by features, weighting schemas, and classifiers. This analysis will comprise of the results from all runs. 0.95 0.90 F-measure 0.85 0.80 0.75 0.70 Naive Bayes Supp. Vec. Mach. C4.5 Adaboost Classifier Fig. 3. Boxplot summarising the F-measure performance of the investigated classifiers over all considered settings. The results reported in Table 2 suggest that the tested approaches are highly effective in discriminating between those death certificates that contain a cancer notifiable cause of death and those that do not. 72 AIH 2012 Classifier Feature Weight Prec Rec F-m TP FN FP TN SPegasos concFullMorph + stemBigram frequency .9794 .9504 .9647 2376 124 50 2450 SPegasos concFullMorph + stemBigram logF .9786 .9500 .9641 2375 125 52 2448 SPegasos concept + stemBigram logF .9770 .9508 .9637 2377 123 56 2444 SPegasos concFullMorph + stemBigram binary .9770 .9504 .9635 2376 124 56 2444 SPegasos concept + stemBigram binary .9766 .9504 .9633 2376 124 57 2443 SPegasos concept + stemBigram frequency .9766 .9504 .9633 2376 124 57 2443 SPegasos stemBigram binary .9761 .9488 .9623 2372 128 58 2442 SPegasos concFullMorph + stemBigram expF .9773 .9476 .9622 2369 131 55 2445 SPegasos stemBigram logF .9753 .9476 .9612 2369 131 60 2440 SPegasos stemBigram expF .9785 .9444 .9611 2361 139 52 2448 SPegasos stemBigram frequency .9764 .9452 .9606 2363 137 57 2443 SPegasos concept + stemBigram expF .9741 .9460 .9598 2365 135 63 2437 C4.5 concept + stemBigram logF .9800 .9392 .9592 2348 152 48 2452 C4.5 concept + stemBigram expF .9800 .9392 .9592 2348 152 48 2452 C4.5 concept + stemBigram frequency .9800 .9392 .9592 2348 152 48 2452 C4.5 concept + stemBigram binary .9799 .9384 .9587 2346 154 48 2452 C4.5 concFullMorph + stemBigram logF .9856 .9324 .9583 2331 169 34 2466 C4.5 concFullMorph + stemBigram expF .9856 .9324 .9583 2331 169 34 2466 C4.5 concFullMorph + stemBigram frequency .9856 .9324 .9583 2331 169 34 2466 C4.5 stemBigram logF .9848 .9320 .9577 2330 170 36 2464 C4.5 stemBigram expF .9848 .9320 .9577 2330 170 36 2464 C4.5 stemBigram frequency .9848 .9320 .9577 2330 170 36 2464 C4.5 concFullMorph + stemBigram binary .9848 .9320 .9577 2330 170 36 2464 C4.5 stemBigram binary .9848 .9308 .9570 2327 173 36 2464 AdaBoost concept + stemBigram binary 1 .8816 .9371 2204 296 0 2500 AdaBoost concept + stemBigram logF 1 .8816 .9371 2204 296 0 2500 AdaBoost concept + stemBigram expF 1 .8816 .9371 2204 296 0 2500 AdaBoost concept + stemBigram frequency 1 .8816 .9371 2204 296 0 2500 AdaBoost concFullMorph + stemBigram binary 1 .8816 .9371 2204 296 0 2500 AdaBoost concFullMorph + stemBigram logF 1 .8816 .9371 2204 296 0 2500 AdaBoost concFullMorph + stemBigram expF 1 .8816 .9371 2204 296 0 2500 AdaBoost concFullMorph + stemBigram frequency 1 .8816 .9371 2204 296 0 2500 AdaBoost stemBigram binary 1 .8784 .9353 2196 304 0 2500 AdaBoost stemBigram logF 1 .8784 .9353 2196 304 0 2500 AdaBoost stemBigram expF 1 .8784 .9353 2196 304 0 2500 AdaBoost stemBigram frequency 1 .8784 .9353 2196 304 0 2500 SPegasos stem logF .9588 .9120 .9348 2280 220 98 2402 SPegasos stem frequency .9611 .9096 .9346 2274 226 92 2408 Naive Bayes stemBigram binary .9658 .9036 .9337 2259 241 80 2420 Naive Bayes concept + stemBigram binary .9606 .9076 .9334 2269 231 93 2407 Table 2. Top 40 results with respect to decrease F-measure (F-m). Overall, the best classifier is the support vector machine implementation pro- vided by SPegasos when used on concFullMorph + stemBigram features, i.e. the fully specified names of concepts associated to morphological abnormalities and disorders as encoded in SNOMED CT, weighted using raw frequencies. SPegasos is found to be very effective also when other combinations of weighting schemas 73 AIH 2012 and features are considered. In addition, this support vector machine classifier shows the smallest variance across all considered settings (Figure 3). Among the best performing classifiers, AdaBoost used in conjunction with stemmed bigrams features achieved perfect precision (Prec= 1), at the expense of recall. Although these results are remarkable, high precision may be considered less important than high recall in such task. In fact, in a Cancer Registry setting, it is preferable to have high recall and be considering death certificate that are incorrectly reported as containing cancer notifiable cause of death, than to have missed cancer cases. This becomes particularly important if the missed cancer cases refer to rare cancers. AdaBoost also exhibits the highest variance across experiment settings among the considered classifiers (see Figure 3). 4.1 The Impact of Classifiers, Weighting Schemas, and Features To better understand the role of specific features, weighting schema, and classi- fiers on the effectiveness of the tested approaches, an analysis of the empirical results where each of the three key characteristics were treated as the controlled variable is performed. We start by examining the impact of each classification model on the overall effectiveness of the approaches. Table 3 reports maximum (Max(F-m)), mini- mum (Min(F-m)), difference (∆), and variance of F-measure over all runs of each classifier model. SPegasos is found to be the classifier achieving the high- est maximum and minimum F-measure values, thus extending the observations made on this classifier when examining the results of Table 2. Instead, while the Naive Bayes classifier was not found to be amongst the most effective classifica- tion models in our experiments, its robustness is second only to that of SPegasos, with performance ranges between 0.9337 and 0.7428 in F-Measure. While models such as C4.5 and Adaboost achieve higher values of F-measure than Naive Bayes, their minimum performances are lower than that recorded for Naive Bayes. Classifier Max(F-m) Min(F-m) ∆ Variance SPegasos 0.9647 0.7767 0.1880 5.10 · 10−3 Naive Bayes 0.9337 0.7428 0.1909 5.10 · 10−3 C4.5 0.9592 0.7355 0.2237 7.35 · 10−3 AdaBoostM1 0.9371 0.6954 0.2417 7.88 · 10−3 Table 3. Classification effectiveness across the four classifiers ordered by increasing max-min F-measure range (∆). We continue by analysing the influence of weighting schemas on the classifi- cation results of the approaches investigated in this work. Simple raw frequency weighting, i.e. frequency, is found to be the most effective weighting schema. How- ever, no weighting schema appears to be significantly better than another: while 74 AIH 2012 Weight Max(F-m) Min(F-m) ∆ Variance binary 0.9635 0.6954 0.2681 6.81 · 10−3 frequency 0.9647 0.6954 0.2693 6.74 · 10−3 logF 0.9641 0.6954 0.2687 6.80 · 10−3 expF 0.9622 0.6954 0.2668 6.53 · 10−3 Table 4. Classification effectiveness across the four weighting schema ordered by in- creasing max-min F-measure range (∆). frequency achieves the best performance with a F-measure of 0.9647, the highest F-measure of the worst performing schema is 0.9622 (expF), just 0.003% lower than frequency. Furthermore, all weighting schema exhibit the same effectiveness when considering the worst performing settings. Thus the range of performance differences and their variance do not significantly differ across weighting schema. This may be due to the fact that death certificates are in general short docu- ments, where features occur uniformly. Feature Max(F-m) Min(F-m) ∆ Variance stemBigram 0.9623 0.9275 0.0348 2.02 · 10−4 concept + bigramStem 0.9637 0.9267 0.0370 2.16 · 10−4 concFullMorph + stemBigram 0.9647 0.9255 0.0392 2.33 · 10−4 concBigram + stemBigram 0.8443 0.7677 0.0766 8.01 · 10−4 concBigram 0.8443 0.7677 0.0766 8.01 · 10−4 concFullBigram 0.7768 0.6954 0.0814 8.93 · 10−4 conceptFull 0.809 0.7177 0.0913 1.17 · 10−3 concept + stemBigram 0.9302 0.838 0.0922 8.39 · 10−4 concept 0.8743 0.7792 0.0951 1.13 · 10−3 stem 0.9348 0.8131 0.1217 1.36 · 10−3 Table 5. Classification effectiveness across the ten features ordered by increasing max- min F-measure range (∆). Feature is the final variable of our analysis, and the one with the greatest im- pact on classification results. The use of the concFullMorph + stemBigram feature provide the highest F-measure (0.9647), while concFullBigram yields the lowest maximal F-measure (0.7768): a significant difference of 19.48%. The smallest variance was demonstrated by stemBigram (2.02 · 10−4 ), making it the most robust feature in our experiment; in addition this feature yielded a maximal F- measure of only 0.003% lower than the best value recorded in our experiments. The minimal F-measure yield by the stemBigram feature was also greater than the greatest F-measure values obtained when using half of the features investi- 75 AIH 2012 gated in our study. These results provide strong indication that, of the variables analysed, the choice of feature provides the greatest contribution to the classifi- cation effectiveness. 5 Conclusions Timely processing of cancer notifications is critical for timely reporting of cancer incidence and mortality. Death certificates are a rich source of data on cancer mortality. Cancer registries acquire free-text death certificates on a regular (e.g. fortnightly) basis. However, the cause of death information needs to be classified to facilitate reporting of cancer mortality. Cause of death information classified using ICD-10 codes is only available on an annual basis. In this paper we inves- tigated the automatic classification of death certificates to individuate cancer notifiable cause of deaths. The investigated approaches achieved overall strong classification effectiveness, with a support vector machine classifier trained with token bigram features and information from the SNOMED CT medical ontol- ogy, and weighted by their frequency in the documents yielding an F-measure of 0.9647. The choice of features, rather than that of classifiers or weighting schema, was found to be the determining factor for high effectiveness. Future efforts will be directed towards an in depth error analysis, in particular examining the distance between the prediction produced by a classifier and the decision threshold. We also plan to extend the investigation to predict the actual ICD-10 codes associated to cause of death related to cancer, so as to further assist clinical coders in processing cancer notifications. References 1. Nguyen, A., Moore, J., Lawley, M., Hansen, D., Colquist, S.: Automatic extraction of cancer characteristics from free-text pathology reports for cancer notifications. In: Health Informatics Conference. (2011) 117–124 2. Zuccon, G., Nguyen, A., Bergheim, A., Wickman, S., Grayson, N.: The impact of OCR accuracy on automated cancer classification of pathology reports. Studies in health technology and informatics 178 (2012) 250 3. D’Avolio, L., Nguyen, T., Farwell, W., Chen, Y., Fitzmeyer, F., Harris, O., Fiore, L.: Evaluation of a generalizable approach to clinical information retrieval using the automated retrieval console (ARC). Journal of the American Medical Informatics Association 17(4) (2010) 375–382 4. Harris, K.: Selected data editing procedures in an automated multiple cause of death coding system. In: Proceedings of the Conference of European Statistics. (1999) 5. Davis, K., Staes, C., Duncan, J., Igo, S., Facelli, J.: Identification of pneumonia and influenza deaths using the death certificate pipeline. BMC Medical Informatics and Decision Making 12(1) (2012) 37 6. Nguyen, A.N., Lawley, M.J., Hansen, D.P., Bowman, R.V., Clarke, B.E., Duhig, E.E., Colquist, S.: Symbolic rule-based classification of lung cancer stages from free- text pathology reports. Journal of the American Medical Informatics Association 17(4) (2010) 440–445 7. Witten, I., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2011) 76