Classifying Encounter Notes in the Primary Care Patient Record Thomas Brox Røst and Øystein Nytrø and Anders Grimsmo1 Abstract. The ability to automate the assignment of primary care tion of bacterial pneumonia from chest X-ray reports [4], finding ad- medical diagnoses from free-text holds many interesting possibili- verse drug events in outpatient medical records [10] and discharge ties. We have collected a dataset of free-text clinical encounter notes summaries [19], and identifying suspicious findings in mammogram and their corresponding manually coded diagnoses and used it to reports [12]. A common feature of such systems is that they restrict built a document classifier. Classifying a test set of 2,000 random en- themselves to a narrow clinical domain with a clearly defined vocab- counter notes yielded a coding accuracy rate of 49.7 %. Automated ulary and a limited form of discourse, such as one would find in spe- coding of primary care encounter notes is a novel application area, cialized hospital reports. Our long-term goal is to draw on research and though imperfect our method proves interesting enough to war- from these areas and explore the usefulness of similar techniques rant further research. on the primary care patient record. However, the lack of empirical knowledge on the content in primary care documentation raises the need for preliminary investigations on the narrative structure found 1 Introduction therein. This initial study attempts to use supervised document class- fication to explore if there is a correspondence between the diagnosis In this study we attempt to classify primary care clinical encounter and the documented encounter. Besides from the previously men- notes into their corresponding diagnoses. We do so by learning docu- tioned possible benefits of automated coding, a secondary purpose ment classifiers from a manually coded dataset collected from a Nor- is to learn more about the informational value and underlying docu- wegian primary care center. Research have shown that the manual mentational patterns in primary care encounter notes. diagnosis coding of primary care encounter notes tend to be of high quality [20]. This, coupled with the the size of the dataset, makes the application area interesting from an information retrieval and docu- 2 Background ment classification point of view. In the long term, being able to infer Among the characteristic features of primary care encounter notes diagnoses from written text might prove useful in e.g. detection of are sparseness, brevity, heavy use of abbreviations and many spelling incorrect diagnoses and improving electronic patient record systems. mistakes. The notes are normally written during the consultation by We consider this study as an initial exploration into applying proven the treating physician, this in contrast with hospital patient records document classification techniques onto a novel application area. which are usually dictated by the physician and then transcribed by The electronic patient record (EPR) has gradually attained a secretary. A typical encounter note might look something like this: widespread usage in primary care. In Norway, more than 90 % of primary care physicians are routinely using computer-based patient- Inflamed wounds over the entire body. Was treated w/ apocillin record systems [3] and many have been doing so for more than 15 and fucidin cream 1 mth. ago. Still using fucidin. Taking sam- years. A typical feature of most commercial EPR systems in use to- ple for bact. Beginning tmnt. with bactroban. Call in 1 week for day is that the encounter note, which is the main documentation of test results2 . the doctor-patient consultation, is written as free-text narrative. There are perfectly practical reasons for this: Unstructured free-text is easy To classify such notes we rely on the presence of manually coded to write and represents the traditional way of documenting patient diagnosis codes. The use of clinical codes in primary care is com- treatment. However, this makes the information within less suitable mon in the United Kingdom, the Netherlands, and Norway [16]. The for automated processing and thereby keeps the EPR from fulfilling motivation for coding is both for reimbursement and statistical pur- its full potential as a useful tool for both research and clinical prac- poses. In our experimental dataset the notes are coded according to tice. Attempts have been made to create EPRs that impose varying the ICPC-2 coding system. ICPC-2 is the second edition of the In- degrees of structure on the clinical narrative, but with limited suc- ternational Classification of Primary Care, a coding system which cess so far. purpose is to provide a classification that reflects the particular needs To alleviate this problem, many researchers have attempted to use and aspects of primary care [11]. Using a single ICPC code, each natural language processing (NLP), text classification and text min- health care encounter can be classified so that both the reasons for ing techniques on clinical narrative. Some NLP systems have proven encounter, diagnoses or problems, and process of care are evident. very useful in a number of clearly defined domains, such as detec- Together, these elements make out the core constituent parts of the health care encounter in primary care. Moreover, one or more en- 1 Department of Computer and Information Science and The Norwegian counters associated with the same health problem or disease form an EHR Research Centre, Norwegian University of Science and Technol- episode of care [9]. ogy, Trondheim, Norway. Emails: {brox, nytroe}@idi.ntnu.no, anders.grimsmo@medisin.ntnu.no 2 Translated from the Norwegian ICPC-2 follows a bi-axial structure with 17 chapters along one on a manually created look-up table. We have not found examples of axis and 7 components along the other. The chapters are single-letter similar attempts at automated ICPC classification in the literature. representations of body systems (Table 1) while the components are As for classification techniques, this study uses support vector ma- two-digit numeric values (Table 2). As an example, ”R02” is the chines (SVM). SVMs have proved useful and have shown good gen- ICPC code for shortness of breath. eral performance for text classification tasks [13] when compared with other classifiers. Our goal for this study is not to compare clas- Table 1. ICPC chapter codes. sification methods; this will be explored further in future work. Chapter code Description A General and unspecified 3 Methods and Data B Blood, blood-forming organs and immune mechanism D Digestive We have collected a dataset from a medium-sized general practice F Eye office in Norway. The data consists of encounter notes for a total of H Ear 10,859 patients in the period from 1992 to 2004. All in all, there K Circulatory are 482,902 unique encounters. The Norwegian Health Personnel L Musculoskeletal N Neurological Act [1] requires that caregivers provide “relevant and necessary in- P Psychological formation about the patient and about the health care” in the patient R Respiratory record. In practice, this manifests itself as a combination of struc- S Skin tured and unstructured information about the encounter. Information T Endocrine, metabolic and nutritional such as personal details about the patient, prescriptions, laboratory U Urological W Pregnancy, child-bearing, family planning results, medical certificates and diagnosis codes is typically available X Female genital in structured format, while encounter notes, referrals and discharge Y Male genital notes comes in the form of unstructured free-text. For the purposes Z Social problems of this paper, we have only considered the encounter notes and the accompanying ICPC-2 diagnosis code. A known source of noise is that a minority of the notes are likely to be written in Danish or nynorsk (literally “New Norwegian”) rather Table 2. ICPC component codes. than standard Norwegian (bokmål). There are also more than 20 dif- ferent authors, so there may be differences in documentational style Number Range Description as well. Interns fresh out of medical school may for example be more 1 01-29 Complaint and symptom component inclined to document more thoroughly than an experienced physi- 2 30-49 Diagnostic, screening, and preventive component cian. 3 50-59 Medication, treatment, procedures component 4 60-61 Test results component The dataset has been automatically anonymized using a custom- 5 62-63 Administrative component built anonymization tool [22]. Each word or token is controlled 6 64-69 Referrals and other reasons for encounter against a database of words that are known to be insensitive and a 7 70-99 Diagnosis/disease component set of rules that deal with alphanumeric patterns such as medication doses, date ranges, and laboratory test values. Sensitive tokens are replaced with a general identifier or an identifier that shows the type There are several examples of attempts to automate the coding of of token that was replaced. diagnoses [5, 15, 18, 21, 23], all of which concern themselves with Each encounter will typically consist of a written note of highly the alternative ICD code. ICD is a more complex code than ICPC variable length and zero or more accompanying ICPC codes. 287,868 and is more suited for specialized usage in hospitals. March [18] of the available encounters have one or more ICPC codes (Table 3). describes the use of Bayesian learning to achieve automated ICD coding of discharge diagnoses. Franz [5] compares coding methods Table 3. Number of ICPC codes per encounter. with and without the use of an underlying lexicon and concludes that lexicon-based methods perform no better than lexicon-free methods, Number of ICPC codes Number of encounters unless one adds conceptual knowledge. Larkey [15] found that using 1 235,860 a combination of different classifiers yielded improved automatic as- 2 44,651 signment of ICD codes. There is a practical purpose to automated 3 6,037 ICD coding: ICD is a more complex code than ICPC and accord- ≥4 1,320 ingly manual ICD encoding takes up a lot of time. There have also been other approaches towards automated coding of clinical text. Hersh [8] attempted to predict trauma registry procedure codes from There are some notable differences in terms of code use between emergency room dictations. Aronow [2] classified encounter notes in hospital and primary care settings. Larkey [15] describes a test set of order to find acute exacerbations of asthma and radiology reports for discharge summaries with a mean of 4.43 ICD-9 codes per document, certain findings, this through the use of Bayesian inference networks while Nilsson [20] notes that a set of Swedish general practice patient and the ID3 decision tree algorithm. Document classification and IR records has a mean of 1.1 ICD-10 codes per record. While there may has been applied in other medical domains as well, such as clustering be regional and cultural differences with respect to coding practice, of medical paper abstracts [17]. the latter corresponds with our findings of 1.2 ICPC-2 codes per note Examples of automated ICPC coding are harder to come by. Letril- (Table 3). liart [16] describes a string matching system that assigns ICPC codes Since we concern ourselves with the relation between the en- from free-text sentences containing hospital referral reasons, based counter note and the ICPC code, we discard all encounters with more 2 than one code in order to avoid ambiguity in the training data. Of the used the SVM-Light3 toolkit. 235,860 encounters that are left, 175,167 have an accompanying en- We use word and phrase frequencies as the base component when counter note. constructing feature vectors for the classifiers. If we were to rely on The use of ICPC codes as classification bins for encounter notes is single words alone we would lose some contextual information [8], essentially a multi-class classification problem. Since there are 726 so frequency counts are performed on all unigrams, bigrams and tri- distinct ICPC codes it becomes practical to reduce the class dimen- grams in the encounter note, excluding stop words. The occurrence sionality. We choose to group codes according to their chapter value, of an n-gram is recorded as a true value in the feature vector. While so that we are left with the 17 single-letter body codes as classes. n-grams may be a simplistic way of representing context, it still al- When grouping encounter notes by their ICPC chapter value we lows us to catch phrases and turns of words that may have discerning note that there is a varying degree of verbosity. The use of sparse en- qualities. counter notes is often common in primary care, for instance when re- As is common with word-based feature vectors, it is useful to ap- newing recurring prescriptions. To determine average note verbosity ply some dimension-reducing technique to limit the size of the vec- for each ICPC chapter, all relevant encounter notes are tokenized. tor. The challenge lies in pruning those features that are the most After removing stop words, whitespace and other noisy elements, inconsequential to the classifier’s predictive qualities. For this ex- the average length and standard deviation is calculated (Table 4). periment we adapt a technique described in [14]. For each classifier the frequency of all unigrams, bigrams and trigrams occurring in all training notes for both classes are counted. If an n-gram occurs in Table 4. Average note length by ICPC chapter. more than 7.5 % of either the true or the false class notes it is tagged as a likely candidate for inclusion. All candidates are then ranked Chapter Avg. No words St. dev. Samples according to their true class frequency to false class frequency ra- N (Neurological) 40 33.2 5,637 tio. Finally the top 100 candidates are chosen as the most relevant D (Digestive) 39 30.0 11,386 Z (Social) 36 35.1 570 features. As an example, Table 5 shows the 20 first selected features X (Female genital) 36 27.1 6,244 from the F (Eye) versus P (Psychological) classifier. P (Psychological) 32 35.6 9,939 A (General) 32 28.9 12,052 Y (Male genital) 31 24.9 1,993 Table 5. F versus P classifier, 20 most relevant features. F (Eye) 31 23.5 4,998 L (Musculoskeletal) 29 26.8 36,493 Original n-gram Appr. English translation Comment R (Respiratory) 28 21.8 22,846 kloramf chloramph Abbreviation K (Circulatory) 27 25.6 21,089 cornea cornea H (Ear) 27 21.3 5,526 øyelokk eyelid W (Pregnancy) 26 24.5 5,614 rusk dust U (Urological) 26 25.2 4,502 hø øye right eye Abbreviation T (Endocrine) 26 22.4 5,498 kloramfenikol chloramphenicol S (Skin) 26 20.3 18,432 rdt red N/A 23 20.6 6,545 ve øye left eye Abbreviation B (Blood) 22 23.3 2,348 øye eye øyet the eye injeksjon injection puss pus We note that Larkey’s discharge summaries [15] has a mean length øyne eyes of 633 words, which is more than an order magnitude higher than hø right Abbreviation our notes. Notwithstanding cultural and institutional differences, this ve left Abbreviation begge both Abbreviation highlights how hospital discharge summaries usually provide a more ved us after examination Abbreviation self-contained description of the patient and his ailments. In the Nor- us examination Abbreviation wegian health care system the patient will typically use just one pri- lett easily mary care physician who acts as a gatekeeper for specialized hospital ser sees care when necessary. Accordingly descriptions of the patient’s state may span several encounter notes in the primary care patient record. Since many classification techniques, including support vector 2.000 notes were selected at random from the 175.167 available machines (SVM), are restricted to dealing with binary classification notes to be used as a test set; the remaining notes were used to train tasks, we have to reduce our multi-class classification task into a set the classifiers. As seen from Table 4, this implies that the amount of of binary tasks. For each pair of classes (i, j) : i, j ∈ {A, B, . . . , Z} training data available for each classifier will differ. where i, j = 1 . . . c, j 6= i we create a two-class classifier < i, j >. If c is the number of classes, we end up with c(c − 1) binary clas- sifiers, or 17 × 16 = 272 in this case. This technique is known as 4 Results double round robin classification [6]. The classifier < i, j > will then solely consist of training examples from encounter notes with Table 6 shows the results from attempting to classify the 2.000 test ICPC chapter codes i and j. To determine the final predicted class cases. A total of 994 cases were classified correctly, giving an overall of any given note we feed it through each classifier and record the accuracy rate of 49.7 %. As a comparison, guessing for the most fre- result. The class that receives the highest number of predictions is quent chapter code (L) all the time will yield an accuracy of 20.8 %. chosen to be the most likely one. In case of ties we choose the class The displayed results are from a single test run. with the highest number of occurrences in the training set, or, as a 3 http://svmlight.joachims.org/ last resort, pick one at random. To build and run the classifiers we 3 Table 6. Predicted classes of 2,000 notes in test set. Correct ICPC Predicted ICPC chapter chapter A B D F H K L N P R S T U W X Y Z Sum % correct A 13 0 10 0 0 13 71 0 3 25 12 0 0 0 2 0 0 149 8.7 % B 0 0 0 0 0 1 25 0 0 6 0 0 0 0 0 0 0 32 0.0 % D 1 0 64 0 0 1 47 0 0 4 9 0 0 0 1 0 0 127 50.3 % F 0 0 0 19 0 1 30 1 0 5 2 0 0 0 0 0 0 58 32.7 % H 0 0 0 0 16 2 29 0 0 10 4 0 0 0 1 0 0 62 25.8 % K 0 0 3 0 0 158 56 0 0 5 0 0 0 0 1 0 0 223 70.8 % L 0 0 3 0 0 5 348 1 0 5 9 0 1 0 1 0 0 373 93.2 % N 2 0 2 0 0 9 42 4 3 1 0 0 0 0 3 0 0 66 6.0 % P 1 0 2 0 0 5 93 0 33 4 0 0 0 0 3 0 0 141 23.4 % R 3 0 3 0 0 5 73 0 0 170 2 0 0 0 2 0 0 258 65.8 % S 0 0 2 0 3 2 84 0 1 3 128 0 0 0 0 0 0 223 57.3 % T 1 0 2 0 0 8 30 1 5 2 0 2 0 0 2 0 0 53 3.7 % U 0 0 0 0 0 2 31 0 5 1 2 0 1 0 0 0 0 42 2.3 % W 0 0 0 0 0 7 56 0 1 0 0 0 0 15 4 0 0 83 18.0 % X 0 0 6 0 0 8 45 0 1 3 1 0 0 3 23 0 0 90 25.5 % Y 0 0 1 0 0 2 14 0 1 0 0 0 1 0 0 0 0 19 0.0 % Z 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0.0 % 5 Discussion and Future Work the anonymization tool only allows known non-sensitive words, it is likely that special and unusual words are lost. Such words may have When considering the results, we must bear in mind that they are a higher predictive effect than more common words. Comparing the from a single run. To verify their validity they should be averaged classifier on a non-anonymized dataset could possibly indicate how over several test runs of independent samples. much of destructive effect that is incurred due to anonymization. Even though the accuracy varies a lot for the individual chapters, The choice of ICPC chapter codes as class indicators is not neces- the results are still quite promising. The most notable feature is how sarily a natural choice. Indeed, this may be seen as a simplification of the L (Muscoloskeletal) class appears to soak up the majority of the the problem of diagnosis prediction. Alternatives include grouping misclassified cases. We are not sure why this is happening. The L according to ICPC component codes or, as a natural follow-up, at- group constitutes the largest group in the training set, followed by tempting to classify into the full ICPC codeset of 726 different codes. the R, K and S groups. When attempting to perform the same classi- fication task without the L cases the S group became the major mis- classification bin, but in a less dramatic fashion; the overall accuracy ACKNOWLEDGEMENTS rate rose to 57.5 %. In general, our naive, largely domain-ignorant approach granted results that are interesting enough to legitimate fur- Thanks go to Amund Tveit, Ole Edsberg, Inger Dybdahl Sørby and ther work in this area. Gisle Bjørndal Tveit for comments and suggestions. There are several possible approaches to approving the predictive quality of the classifier. We made no attempts to normalize the vocab- REFERENCES ulary in the training data. Techniques such as stemming or mapping terms to a common controlled vocabulary would reduce the number [1] Act of 18 may 2001 no. 24 on personal health data filing systems and of relevant features. This would also involve dealing with common the processing of personal health data, 2004.04.12 2001. [2] D. B. Aronow, S. Soderland, J. M. Ponte, F. Feng, W. B. Croft, and misspellings [7] and dialect terms, both of which are quite common W. G. Lehnert, ‘Automated classification of encounter notes in a com- in our dataset. Wilcox [24] notes that the use of expert knowledge can puter based medical record’, Medinfo, 8 Pt 1, 8–12, (1995). provide a significant boost to medical text report classifiers. It would [3] Elisabeth Bayegan, Knowledge Representation for Relevance Ranking also be worth investigating if the use of accompanying information of Patient-Record Contents in Primary-Care Situations, Ph.D. disserta- tion, Norwegian University of Science and Technology (NTNU), 2002. from the EPR, such as lab results and prescriptions, can help im- [4] M. Fiszman, W. W. Chapman, D. Aronsky, R. S. Evans, and P. J. Haug, prove classification quality. Another possible approach is to view the ‘Automatic detection of acute bacterial pneumonia from chest x-ray re- encounter note in its longitudinal context by also considering notes ports’, J Am Med Inform Assoc, 7(6), 593–604, (2000). Evaluation from previous (and following) encounters. Studies Journal Article. We made no efforts to control the amount of noise in the classi- [5] Pius Franz, Albrecht Zaiss, Stefan Schulz, Udo Hahn, and Rdiger Klar, ‘Automated coding of diagnoses - three methods compared’, in Pro- fiers or to screen the notes in the test data set. Very short notes and ceedings of the Annual Symposium of the American Society for Medical notes with non-standard language use were not discarded. Also, the Informatics (AMIA), Los Angeles, CA, USA, (2000). influence of n-gram feature threshold selection on the quality of the [6] Johannes Frnkranz, ‘Round robin classification’, J. Mach. Learn. Res., results could have been evaluated. Similarly, the effect of using ad- 2, 721–47, (2002). [7] W. R. Hersh, E. M. Campbell, and S. E. Malveau, ‘Assessing the feasi- ditional parameters such as average note length and n-gram partial bility of large-scale natural language processing in a corpus of ordinary coincidence would have been worth investigating. medical records: a lexical analysis’, Proc AMIA Annu Fall Symp, 580– The a priori anonymization could also influence the results. Since 4, (1997). 4 [8] W. R. Hersh, T. K. Leen, P. S. Rehfuss, and S. Malveau, ‘Automatic prediction of trauma registry procedure codes from emergency room dictations’, Medinfo, 9 Pt 1, 665–9, (1998). [9] I. M. Hofmans-Okkes and H. Lamberts, ‘The international classifica- tion of primary care (icpc): new applications in research and computer- based patient records in family practice’, Fam Pract, 13(3), 294–302, (1996). [10] B. Honigman, P. Light, R. M. Pulling, and D. W. Bates, ‘A comput- erized method for identifying incidents associated with adverse drug events in outpatients’, Int J Med Inform, 61(1), 21–32, (2001). Journal Article. [11] WONCA International, ICPC-2: International Classification of Pri- mary Care, Oxford Medical Publications, 2 edn., 1998. [12] N. L. Jain and C. Friedman, ‘Identification of findings suspicious for breast cancer based on natural language processing of mammogram re- ports’, Proc AMIA Annu Fall Symp, 829–33, (1997). [13] Thorsten Joachims, ‘Text categorization with suport vector machines: Learning with many relevant features’, in ECML ’98: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142, London, UK, (1998). Springer-Verlag. [14] Andries Kruger, C. Lee Giles, Frans Coetzee, Eric Glover, Gary Flake, Steve Lawrence, and Cristian Omlin, ‘Deadliner: Building a new niche search engine’, in Ninth International Conference on Information and Knowledge Management, CIKM 2000, Washington, DC, (2000). [15] Leah S. Larkey and W. Bruce Croft, ‘Combining classifiers in text cate- gorization’, in SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 289–97, Zurich, Switzerland, (1996). ACM Press. [16] L. Letrilliart, C. Viboud, P. Y. Boelle, and A. Flahault, ‘Automatic cod- ing of reasons for hospital referral from general medicine free-text re- ports’, Proc AMIA Symp, 487–91, (2000). [17] Pavel Makagonov, Mikhail Alexandrov, and Alexander Gelbukh, ‘Clus- tering abstracts instead of full texts’, Lecture Notes in Computer Sci- ence, 3206, 129–35, (2004). [18] Alan D. March, Eitel J. M. Laura, and Jorge Lantos, ‘Automated icd9- cm coding employing bayesian machine learning: a preliminary explo- ration’, in Simposio de Informtica y Salud 2004, (2004). [19] G. B. Melton and G. Hripcsak, ‘Automated detection of adverse events using natural language processing of discharge summaries’, J Am Med Inform Assoc, 12(4), 448–57, (2005). [20] G. Nilsson, H. Ahlfeldt, and L. E. Strender, ‘Textual content, health problems and diagnostic codes in electronic patient records in general practice’, Scand J Prim Health Care, 21(1), 33–6, (2003). Journal Ar- ticle. [21] Y. Satomura and M. B. do Amaral, ‘Automated diagnostic indexing by natural language processing’, Med Inform (Lond), 17(3), 149–63, (1992). [22] Amund Tveit, Ole Edsberg, Thomas Brox Røst, Arild Faxvaag, Øys- tein Nytrø, Torbjørn Nordgård, Martin Thorsen Ranang, and Anders Grimsmo, ‘Anonymization of general practioner’s patient records’, in Proceedings of the HelsIT’04 Conference, Trondheim, Norway, (2004). [23] Rodrigo F. Vale, Berthier A. Ribeiro-Neto, Luciano R.S. de Lima, Al- berto H.F. Laender, and Hermes R.F. Junior, ‘Improving text retrieval in medical collections through automatic categorization’, Lecture Notes in Computer Science, 2857, 197–210, (2003). [24] A. B. Wilcox and G. Hripcsak, ‘The role of domain knowledge in au- tomating medical text report classification’, J Am Med Inform Assoc, 10(4), 330–8, (2003). 5