Introduction

Medical Record Retrieval and Extraction for Professional Information Access

Chia-Chun Lee

cclee@nlg.csie.ntu.edu.tw 0

Hen-Hsen Huang

hhhuang@nlg.csie.ntu.edu.tw 0

Hsin-Hsi Chen

hhchen@ntu.edu.tw 0 0 Department of Computer Science and Information Engineering National Taiwan University Taipei , Taiwan

17 25

This paper analyzes the linguistic phenomena in medical records in different departments, including average record size, vocabulary, entropy of medical languages, grammaticality, and so on. Five retrieval models with six pre-processing strategies on different parts of medical records are explored on an NTUH medical record dataset. Both coarse-grained relevance evaluation on department level and fine-grained relevance evaluation on course and treatment level are conducted. Query accesses to the medical records in medical languages of smaller entropy tend to have better performance. The departments related to generic parts of body such as Departments of Internal Medicine and Surgery may confuse the retrieval, in particular, for Departments of Oncology and Neurology. Okapi model with stemming achieves the best performance on both department and course and treatment levels.

Introduction

Case study is indispensable for learning medical knowledge. The course and treatments of similar cases provide important references, in particular, for medical students or junior physicians. How to retrieve relevant medical records effectively and efficiently is an essential research topic. TREC 2011 [Voo11] and 2012 [Voo12] Medical Records track provides test collections for patient retrieval based on a set of clinical criteria. Several approaches such as concept-based [Koo11], query expansion [Din11], and knowledge-based [Dem11] have been proposed to improve the retrieval performance. In this paper, we investigate medical record retrieval on an NTUH dataset provided by National Taiwan University Hospital. Given a chief complaint and/or a brief history, we would like to find the related medical records, and propose examination, medicine and surgery that may be performed for the input case.

The structure of this paper is organized as follows. The characteristics of the domain-specific dataset are addressed and analyzed in Section 2. Several information retrieval models and medical term extraction methods are explored on the dataset in Section 3. Both coarse-grained relevance evaluation on department level and fine-grained relevance evaluation on course and treatment level are conducted and discussed in Section 4. Finally, Section 5 concludes the remarks. 2

Description of the NTUH Medical Record Dataset In the NTUH dataset, almost all medical records are written in English. A medical record is composed of three major parts, including a chief complaint, a brief history, and a course and treatment. A chief complaint is a short statement specifying the purpose of a patient’s visit and the patient’s physical discomfort, e.g., Epigastralgia for 10 days, Tarry stool twice since last night, and so on. It describes the symptoms found by the patient and the duration of these symptoms. A brief history summarizes the personal information, the physical conditions, and the past medical treatment of the patient. In an example shown in Figure 1, the first paragraph lists the personal information and the physical conditions, and the second paragraph shows the past medical treatment. A course and treatment describes the treatment processes and the treatment outcomes in detail. Figure 2 is an example of a course and treatment, where medication administration, inspection, and surgery are enclosed in <a></a>, , and <s></s> pairs, respectively.

There are 113,625 medical records in the NTUH experimental dataset after those records consisting of scheduled cases, empty complaints, complaints written in Chinese, and treatments without mentioning any examination, medicine, and surgery are removed. Table 1 lists mean (µ) and standard deviation (σ) of chief complaint (CC), brief history (BH), course and treatment (CT), and medical record (MR) in terms of the number of words used in the corresponding part. Here a word is defined to be a character string separated by spaces. The patient and the physician names are removed from the dataset for the privacy issues. The brief history is the longest, while the chief complaint is the shortest.

The 113,625 medical records are categorized into 14 departments based on patients’ visits. The statistics is illustrated in Table 2. Departments of Internal Medicine and Surgery have the first and the second largest amount of data, while Departments of Dental and Dermatology have the smallest amount. Table 3 shows the length distribution of these 14 departments. For the chief complaint, Department of Urology has the smallest mean, and Department of Dermatology has the largest mean. For the brief history, Department of Ophthalmology has the smallest mean and standard deviation, and Department of Psychiatry has the largest mean. Overall, Department of Dental has the smallest mean, and Department of Psychiatry has the largest mean as well as standard deviation.

This 53-year-old man had underlying hypertension, and old CVA. He suffered from gallbladder stone with cholecystitis about one month ago. He was treated medically in hospital and then was discharged with a stable condition.

The patient suffered from right upper abdominal pain after lunch with nausea and vomiting suddenly on Jan 4th 2006. There was no aggravating or relieving factor noted.

The abdominal pain was radiated to the back. He visited our ER immediately. … PAST HISTORY 1. HTN(+), DM(-); Old CVA 3 years ago, Low back pain suspected spondylopathy Acute …

After admission, <a> Heparin </a> was given immediately. Venous duplex showed left common iliac vein partial stenosis. Pelvic-lower extremity revealed bilateral mid. femoral vein occlusion. Angiography showed total occlusion of left iliac vein, femoral vein and popliteal vein. IVC filter was implanted. Transcatheter intravenous urokinase therapy was started on 1/11 for 24 hours infusion. Follow up angiography showed partial recanalization of left iliac vein. Stenting was donefrom distal IVC through left common iliac vein to external iliac vein. <s> Ballooming </s> was also performed. …

From the linguistic point of view, we also investigate the vocabulary size and entropy of the medical language overall for the dataset and individually for each department. Table 4 summarizes the statistics. Shannon [Sha51] estimated word entropy for English as 11.82 bits per word, but there has been some debate about this estimate, with Grignetti [Gri64] estimating entropy to be 9.8 bits per word. In the NTUH medical dataset, the entropy is 11.15 bits per word, a little smaller than Shannon entropy and larger than Grignetti entropy. Departments related to definite parts of body, e.g., dental, ear, nose & throat, ophthalmology and orthopedics, have lower entropy. 7.52 293.46 137.77 438.75 7.73 191.03 84.22 282.98

Surgery 2.84 189.83 184.69 291.07 3.04 126.37 103.71 179.89 8.29 3.34 418.46 201.19 170.44 193.36 597.19 301.34

Pediatrics Comparatively, departments related to generic parts have larger entropy. In particular, Department of Ophthalmology has the lowest entropy, while Department of Internal Medicine has the largest entropy. Medical records are frequently below par in grammaticality. Spelling errors are very common in this dataset. Some common erroneous words and their correct forms enclosed in parentheses are listed below for reference: histropy (history), ag (ago/age), withour (without), denid (denied), and recieved (received). Some words are ambiguous in the erroneous form, e.g., “ag” can be interpreted as “ago” or “age” depending on its context. Besides grammatical problems, shorthand notation or abbreviation occurs very often. For example, “opd” is an abbreviation of “outpatient department” and “yrs” is a shorthand notation of “years-old”. Furthermore, physicians tend to mix English and Chinese in the NTUH dataset. In Department of Psychiatry, the chief complaint of psychiatric disorder patients is more descriptive, and it is hard to write down the descriptions completely in English. Physicians express the patients’ descriptions bilingually, e.g., “Chronic insomnia for 10+ years 10 FM2 , . Furthermore, physicians tend to name hospitals in Chinese. The retrieval scenario is specified as follows. Given a chief complaint and/or a brief history, physicians plan to retrieve the similar cases from the historical medical records and reference to the possible course and treatments. Chief complaints and/or brief histories in the historical medical records can be regarded as queries. Words may be stemmed and stop words may be removed before indexing. Spelling checker is introduced to deal with grammaticality issue. Besides words, medical terms are also recognized as indices. Different IR models can be explored on different parts of medical records. In the empirical study, Lemur Toolkit (http://www.lemurproject.org/) is adopted and five retrieval models including tf-idf, okapi, KL-divergence, cosine, and indri are experimented.

The medical terms such as examination, medicine, and surgery are extracted from the course and treatment of the retrieved medical records. Medical term recognition [Aba11] is required. Ontology-based and pattern-based approaches are adopted. The ontology-based approach adopts the resources from the Unified Medical Language System (UMLS) maintained by National Library of Medicine. The UMLS covers a wide range of terms in medical domain, and relations between these medical terms. Among these resources, the Metathesaurus organizes medical terms into groups of concepts. Moreover, each concept is assigned at least one Semantic Type. Semantic Types provide categorization of concepts at a more general level, and therefore are well-suited to be incorporated. The pattern-based approach adopts patterns such as “SURGERY was performed on DATE” to extract medical terms. The idea comes from the special written styles of medical records. A number of patterns frequently repeat in medical records. The following lists some examples for the pattern “SURGERY was performed on DATE”: paracentesis was performed on 2010-01-08, repositioning was performed on 2008/04/03, incision and drainage was performed on 2010-01-15, and tracheostomy was performed on 2010/1/11.

We follow our previous work [Che12] to extract frequent patterns from medical record dataset and apply them to recognize medical terms. The overall schedule is summarized as follows.

(a) Medical Entity Classification: Recognize medical named entities including surgeries, diseases, drugs, etc. by the ontology-based approach, transform them into the corresponding medical classes, and derive a new corpus. (b) Frequent Pattern Extraction: Employ n-gram models in the new corpus to extract a set of frequent patterns. (c) Linguistic Pattern Extraction: For each pattern, randomly sample sentences having this pattern, parse these sentences, and keep the pattern if there is at least one parsing sub-tree for it.

(d) Pattern Coverage Finding: Check coverage relations among higher order patterns and lower order patterns, and remove those lower patterns being covered.

To evaluate the performance of the retrieval and extraction models, 10-fold cross validation is adopted. We conduct two-phase evaluation. In the first phase, the input query is a chief complaint and the output is the retrieved top-n medical records. We aim to evaluate the quality of the returned n medical records. There is no ground truth or relevance judgments available, surrogate relevance judgments are therefore used. Recall that each medical record belongs to a department. Let the input chief complaint belong to department d, and the departments of the top-n retrieved medical records be d1, d2, …, dn. Here, we postulate that medical record i is relevant to the input chief complaint, if di of medical record i is equal to d. In this way, we can compute precision@k, mean average precision (MAP), and nDCG as traditional IR. In addition, we can regard the returned n medical records as a cluster and compute the department distribution of the cluster. The retrieval is regarded as correct if the dominant department of the cluster is the same as the department of the input query (i.e., the input chief complaint). In this way, we can compute the confusion matrix among actual and proposed departments and observe the effects on retrieval performance.

In the second phase, we conduct much finer evaluation. The input is a chief complaint and a brief history, and the output is top-1 course and treatment selected from the historical NTUH medical records. Recall that examination, medicine and surgery are three key types of medical entities specified in a course and treatment. We would like to know if the retrieved medical record adopts the similar course and treatment as the input query. Thus the evaluation unit is the three types of entities. We extract examinations, medicines and surgeries from the courses and treatments of an input query and the retrieved medical record, respectively, by medical term recognition. They are named as GE, GM, and GS for ground truth (i.e., the course and treatment of the input query), and PE, PM, and PS for the proposed treatment (i.e., the course and treatment of the returned medical record), respectively. The Jaccard's coefficient between the ground truth and the proposed treatment is a metric indicating if the returned medical records are relevant and interesting to physicians. It is defined as: total number of common entities in the ground truth and the proposed answer divided by sum of the entities in the ground truth and the proposed answer for each query. The evaluation is done for each medical entity type. That is, Jaccard's coefficient for examination=|GE∩PE|/|GE∪PE|, Jaccard's coefficient for medicine=|GM∩PM|/|GM∪PM|, and Jaccard's coefficient for surgery=|GS∩PS|/|GS∪PS|. Note that the denominator will be zero, if both the ground truth and the proposed answer do not contain any medical entities of the designated type. In this case, we set Jaccard's coefficient to be 1. The average of the Jaccard's coefficients of all the input queries is considered as a metric to evaluate the performance of the retrieval model on the treatment level. 0.6590 0.7422 0.6678 0.7498 0.6430 0.7285 0.6622 0.7447 0.6380 0.7246 0.6700 0.7385 0.6800 0.7489 0.6691 0.7380 0.6521 0.7217 0.6557 0.7251

S5 0.6502 0.7348 0.6588 0.7427 0.6489 0.7329 0.6340 0.7186 0.6365 0.7221 medical records are retrieved and compared. For strategies S5 and S6, we extract gender (male/female), age (0-15, 16-45, 46-60, 61+), and other information from brief history besides chief complaints.

S1: using chief complaints S2: S1 with stop word removal S3: S1 with porter stemming S4: S1 with both stop word removal and porter stemming S5: using chief complaints and the first two sentences in brief histories

S6: S5 with porter stemming Overall, the performance tendency is okapi>tf-idf>cos>kl>indri no matter which strategies are used. Removing stop words tend to decrease the performance. Using porter stemming is useful when chief complaints are employed only. Introducing brief histories decreases the performance. Okapi retrieval model with strategy S3 performs the best when top 5 medical records are retrieved. In fact, Okapi+S3 is not significantly better than Okapi+S1, but both are significantly better than Okapi with other strategies (p value <0.0001) on MAP and nDCG. When S3 is adopted, Okapi is significantly better than the others.

We further evaluate the retrieval models with precision@k shown in Table 6. The five retrieval models at the setting k=1 are significantly better than those at k=3 and k=5. Most of the precision@k are larger than 0.7 at k=1. It means the first medical record retrieved is often relevant. Okapi with strategy S3 is still the best under precision@k. Moreover, we examine the effects of the parameter n in the medical record retrieval. Only the best two retrieval models in the above experiments, i.e., tf-idf and okapi with strategy S3, are shown in Figure 3. We can find MAP decreases when n becomes larger in both models. It means noise is introduced when more medical records are reported. The Okapi+S3 model is better than the tf-idf+S3 model in all the settings. k=1 k=3 k=5 0.72 0.7 0.68 0.66 0.64 0.62 0.6 0.58

MAP@10 0.8295 0.6263 0.8282 0.6695 0.4612 0.7635 0.3236 0.9155 0.8326 0.6509 0.5921 0.8183 0.5889 0.7494 nDCG@10 0.8744 0.7003 0.8715 0.7688 0.5731 0.8100 0.4185 0.9371 0.8802 0.7290 0.6771 0.8608 0.6943 0.8037 This paper studies the medical record retrieval and extraction with different retrieval models under different strategies on department and course and treatment levels. Both coarse-grained and fine-grained relevance evaluations with various metrics are conducted. The medical records in medical languages of smaller entropy tend to have better retrieval performance. The departments related to generic parts of body such as Departments of Internal Medicine and Surgery may confuse the retrieval, in particular, for Departments of Oncology and Neurology. Okapi model achieves the best on department and treatment levels (in particular, medicine prediction and surgery prediction). To construct an evaluation dataset for medical record retrieval and extraction is challenging because the assessors which are domain experts cost much. In this paper, we postulate that the medical records belong to the same departments as the input queries are relevant. Such an evaluation may be underestimated because cross department is not necessarily wrong in real cases. For example, the treatment of tumors may be related to more than one department. Real user study is necessary for advanced evaluation. Besides, medical records may be in more than one language. Cross language medical retrieval will be explored in the future.

Acknowledgments

Research of this paper was partially supported by National Science Council (Taiwan) under the contract NSC 101-2221-E-002-195-MY3. We are very thankful to National Taiwan University Hospital for providing NTUH the medical record dataset. The authors also thank anonymous reviewers for their helpful comments. [Aba11]

A. B. Abacha, and P. Zweigenbaum. Medical entity recognition: a comparison of semantic and statistical methods. Proceedings of the 2011 Workshop on Biomedical Natural Language Processing, 2011, 56-64.

[Che12] H.-B. Chen , H. -H. Huang , J.

Tjiu , C.-T. Tan and H.-H.

Chen . A statistical medical summary translation system . Proceedings of 2012 ACM SIGHIT International Health Informatics Symposium , January 2012 , 101 - 110 .

[Dem11]

Demner-Fushman ,

Abhyankar , et al. A Knowledge-Based Approach to Medical Records Retrieval . Proceedings of TREC , 2011 .

[Din11]

Dinh and

Tamine . IRIT at TREC 2011: Evaluation of Query Expansion Techniques for Medical Record Retrieval . Proceedings of TREC , 2011 .

[Got12]

Goth . Analyzing medical data . Communications of the ACM , 55 ( 6 ): 13 - 15 , June, 2012 .

[Gri64] M. C.

Grignetti. A note on the entropy of words in printed English . Information and Control , 7 : 304 - 306 , 1964 .

[Hei01]

D. T.

Heinze ,

M. L.

Morsch , and

Holbrook . Mining free-text medical records . Proceedings of AMIA Annual Symposium , 2001 , 254 - 258 .

[Her09]

Hersh . Information retrieval: A health and biomedical perspective , 3rd ed. Springer, 2009 .

[Hua12] H.-H. Huang , C.-C.

Lee , and H. -H. Chen . Outpatient department recommendation based on medical summaries . Proceedings of the Eighth Asia Information Retrieval Societies Conference , LNCS 7675 , 518 - 527 .

[Jen06] L. J. Jensen , J. S. and

Bork . Literature mining for the biologist: from information retrieval to biological discovery . Nature Reviews Genetics , 7 : 119 - 129 , February 2006 .

[Koo11]

Koopman ,

Lawley and P. Bruza. AEHRC & QUT at TREC 2011 Medical Track: A Concept-Based Information Retrieval . Proceedings of TREC , 2011 .

[Ram11]

Ramos . Acute myocardial infarction patient data to assess healthcare utilization and treatments . ProQuest, UMI Dissertation Publishing , 2011 .

[Sha50]

C. E.

Shannon . Prediction and entropy of printed English. Bell System Tech . J, 30 ( 1 ): 50 - 64 , 1950 .

[Voo12]

Voorhees and

Hersh . Overview of the TREC 2012 Medical Records Track . Proceedings of TREC , 2012 .

[Voo11]

Voorhees and

Tong . Overview of the TREC 2011 Medical Records Track . Proceedings of TREC , 2011 .