=Paper=
{{Paper
|id=Vol-2142/paper8
|storemode=property
|title=Identification of serious illness conversations in unstructured clinical notes using deep neural networks
|pdfUrl=https://ceur-ws.org/Vol-2142/paper8.pdf
|volume=Vol-2142
|authors=Isabel Chien,Alvin Shi,Alex Chan,Charlotta Lindvall
|dblpUrl=https://dblp.org/rec/conf/ijcai/ChienSCL18
}}
==Identification of serious illness conversations in unstructured clinical notes using deep neural networks==
Identification of Serious Illness Conversations in Unstructured Clinical Notes using Deep Neural Networks Isabel Chien1 , Alvin Shi1 , Alex Chan2 , and Charlotta Lindvall3,4 1 Massachusetts Institute of Technology, Cambridge, USA chieni@mit.edu, alvinshi@mit.edu 2 Harvard T.H. Chan School of Public Health, Boston, USA alexchan@mail.harvard.edu 3 Dana-Farber Cancer Institute, Boston, USA 4 Brigham and Women’s Hospital, Boston, USA clindvall@mail.harvard.edu Abstract. Advance care planning, which includes clarifying and doc- umenting goals of care and preferences for future care, is essential for achieving end-of-life care that is consistent with the preferences of dying patients and their families. Physicians document their communication about these preferences as unstructured free text in clinical notes; as a result, routine assessment of this quality indicator is time consuming and costly. In this study, we trained and validated a deep neural network to detect documentation of advanced care planning conversations in clinical notes from electronic health records. We assessed its performance against rigorous manual chart review and rule-based regular expressions. For de- tecting documentation of patient care preferences at the note level, the algorithm had high performance; F1-score of 92.0 (95% CI, 89.1-95.1), sensitivity of 93.5% (95% CI, 90.0%-98.0%), positive predictive value of 90.5% (95% CI, 86.4%-95.1%) and specificity of 91.0% (95% CI, 86.4%- 95.3%) and consistently outperformed regular expression. Deep learning methods offer an efficient and scalable way to improve the visibility of documented serious illness conversations within electronic health record data, helping to better quality of care. Keywords: deep learning, end-of-life care, palliative care, natural lan- guage processing, clinical notes, electronic health records 1 Introduction and Related Work To ensure that patients receive care that is consistent with their goals, clinicians must communicate with seriously ill patients about their treatment preferences. More than 80% of Americans say they would prefer to die at home, if possible. Despite this, 60% of Americans die in acute care hospitals and 20% die in an In- tensive Care Unit (ICU)[1]. Advance care planning, which includes clarifying and documenting goals of care and preferences for future care, is essential for achiev- ing end-of-life care that is consistent with the preferences of seriously ill patients 2 and their families. Inadequate communication is associated with more aggressive care near the time of death, decreased use of hospice and increased anxiety and depression in surviving family members[2–5]. Several studies have demonstrated the potential of advanced care planning to improve end-of-life outcomes (e.g., reducing unintended ICU admissions and increasing hospice enrollment). In the absence of explicit goals of care decisions, clinicians may provide clinical care[6] that does not provide a meaningful benefit to the patient[7] and, in the worse case, interferes with the treatment of other patients[6]. For these reasons, it is recommended that care preferences are discussed and documented in the EHR within the first 48 hours of an ICU admission[8, 9]. In recent years a consensus has emerged that such conversations are an es- sential component of practice and must be monitored to improve care quality. However, the difficulty of retrieving documentation about these conversations from the electronic health record has limited rigorous research on the preva- lence and quality of clinical communication. For example, the National Quality Forum (NQF) recommends that goals of care be discussed and documented in the EHR within the first 48 hours of an ICU admission, especially for frail and seriously ill patients. This was one of only two Centers for Medicare and Med- icaid Services recommended palliative care quality measures for the Medicare Hospital Inpatient Quality Reporting program[10]. Yet, despite widespread sup- port, routine assessment of this and similar quality measures have proven nearly impossible because the information is embedded as non-discrete free-text within clinical notes. Manual chart review is time-consuming and expensive to scale [11– 13]. Consequently, many end-of-life quality metrics are simply not assessed, and their impact on distal and important patient outcomes have been insufficiently evaluated. The emergence of omnipresent EHRs and powerful computers present novel opportunities to apply advanced computational methods such as natural lan- guage processing (NLP)[14] to assess end-of-life quality metrics including doc- umentation of ACP. NLP enables machines to process or understand natural language in order to perform tasks like extracting communication quality em- bedded as non-discrete free-text within clinical notes[15]. Two main approaches to NLP information extraction exist. Rule-based ex- traction uses a pre-designed set of rules[14], which involves computing curated rules specified by experts, resulting in algorithms that detect specific words or phrases. This approach works well for smaller defined sets of data such as when searching for all the brand names of a generic medication (e.g., if X is present, then Y=1). However, rule-based approaches fail when the desired information appears in a large variety of contexts within the free text[16]. Recent advances in machine learning coupled with increasingly powerful com- puters have created an opportunity to apply advanced computational methods, such as deep learning, to assess the content of free-text documentation within clinical notes. Such approaches possess the potential to broaden the scope of research on serious illness communication, and when implemented in real-time, to change clinical practice. 3 In contrast to rule-based methods, deep learning does not depend upon pre- defined set of rules. Instead, these algorithms learn patterns from a labeled set of free-text notes and apply them to future datasets[16]. A deep learning-based approach works well for tasks for which the set of extraction rules is very large, unknown, or both. In deep learning, algorithms can learn feature representations that aid in interpreting varied language. In this study, we used deep learning[17] to train models to detect documenta- tion of serious illness conversations, and we assess the performance of these deep learning models against manual chart review and rule based regular expression. 2 Data 2.1 Data Source We derived our sample from the publicly available ICU database, Multi Pa- rameter Intelligent Monitoring of Intensive Care (MIMIC) III, developed by the Massachusetts Institute of Technology (MIT) Lab for Computational Physiol- ogy and Beth Israel Deaconess Medical Center (BIDMC)[18]. It is a repository of de-identified administrative, clinical, and survival outcome data from more than 58,000 ICU admissions at BIDMC from 2001 through 2012. Between 2008 and 2012, the dataset also included clinical notes associated with each ICU ad- mission. The Institutional Review Board of the BIDMC and MIT have approved the use of the MIMIC-III database by any investigator who fulfills data-user re- quirements. The study was deemed exempt by the Partners Institutional Review Board. 2.2 Cohort The study population included adult patients (age ≥18) who were admitted to the medical, surgical, coronary care, or cardiac surgery ICU. The training and validation set included physician notes from patients who died during the hospital admission to ensure that we would have sufficient examples of docu- mentation of care preferences. We excluded patients who did not have physician notes within the first 48 hours because these patients either died shortly after admission or transferred out of the ICU. 2.3 Clinical domains Our main outcome was to identify documentation of care preferences within 48 hours of an ICU admission in seriously ill patients. We aimed to detect the bi- nary absence or presence of any clinical text that fit specified documentation of domains: patient care preferences (goals of care conversations or code status limitations), goals of care conversations, code status limitations, family com- munication (which included communication or attempt to communicate with family that did not result in documented care preferences), and full code status. 4 Domains were chosen by board-certified, experienced palliative care clinicians through a lengthy and iterative process. They determined categories that are both relevant to widespread existing palliative care quality measures and in- teresting to future research questions. The specifications of each domain are outlined (Table 1). Table 1. Clinical domain specifications. Domain Documentation example Fulfills criteria for goals of care conversations and/or code Patient care preferences status limitations Explicitly shown preferences about the patients goals, Goals of care values, or priorities for treatment and outcomes. Does NOT conversations include presumed full code status or if obtained from other sources. Explicitly shown preference of patients care restricting the Code status limitations invasive care. Includes taken over preference from previous admission. Explicit conversations held during ICU stay period with Communication with patients or family members about the patients goals, family values, or priorities for treatment and outcomes. Explicitly or implicitly shown preference for full set of invasive care including intubation and resuscitation. Full code status Includes presumed full code status or if obtained from other sources. 2.4 Annotation We developed a set of abstraction guidelines to ensure reliable abstraction be- tween annotators. Each annotator identified clinical text that fit specified com- munication domains and labeled the portions of text identified for a domain, with no restrictions on length of a single annotation. A gold standard dataset, considered to contain true positives and true neg- atives, was developed through manual annotation by a panel of four clinicians. Annotation was done using PyCCI, a clinical text annotation software developed by our team. Each note was annotated by at least two clinicians and annota- tions were then validated by a third clinician. Similar to previously published chart abstraction studies performed for this measure, the abstraction team had real-time access to a US board certified hospice and palliative medicine attend- ing physician-expert reviewer, met weekly, and used a log to document common questions and answers to facilitate consistency[11, 19]. The clinician coders manually annotated an average of 239 notes each (SD, 196), for a total of 641 notes. Each note contained an average of 1397 tokens (IQR, 1004-1710). The inter-rater reliability among the four clinician annotators 5 was kappa > 0.65 at the note level for each domain. The performance of each clinician coder was varied–for example, they identified documentation of care preferences with a sensitivity ranging from 77-92% (in comparison to the final gold standard). 3 Methods 3.1 Pre-processing Annotated notes were pre-processed for both rule-based regular expression and neural network methods. First, texts were cleaned to remove any extraneous spaces, lines, or characters. Each cleaned note was tokenized, which means it was split into identifiable elements–in this case, words and punctuation. We used the Python module spaCy in order to tokenize intelligently, based on the structure of the English language[20]. Labels were associated with individual tokens and datasets were split out by domain, as each method was run separately. 3.2 Regular expression Our baseline model is a simple regular expression based on pre-curated rules for each domain. Appendix A shows the rules used for each domain. These rules are keywords that the regular expression program identifies as belonging to its corresponding domain, taking into account varieties in punctuation and case. To create the regular expression library, we identified tokens that were sensitive and specific for each prediction task. We calculated sensitivity by evaluating the proportion of a token’s total number of occurrences that were labeled for each domain. We evaluated specificity by evaluating what proportion of a token’s total number of occurrences were in a note that was in an unlabeled note for each domain. A board-certified clinician used these data points–sensitivity, specificity, frequency that each token appeared on the labeled text and frequency in texts outside of the domain–and their clinical knowledge to generate a list of terms that could likely be generalized. Regular expressions identify patterns of characters exactly as they are spec- ified in a set of rules. If text in the note matches a keyword in the regular expression library for the domain, it is labelled as positive for that concept. This method acts as a baseline to compare our algorithm against. We used a regular expression program, ClinicalRegex, also developed by our lab[30]. ClinicalRegex is easily accessible and intuitive to navigate, which makes it an efficient choice for groups that are not able to employ computer scientists. We have chosen to com- pare our deep learning methods against an easily understandable and accessible method to illustrate the benefits of more complex methods. 3.3 Artificial neural network Deep learning involves training a neural network to learn data representation and fulfill a specified task. We trained algorithms to identify clinical text doc- umentation of serious illness communication. During the training process, the 6 neural network learns to identify and categorize tokens (individual words and symbols) as belonging to each of the pre-specified domains and maximizes prob- ability across predicted token labels[21]. The specific neural network used, NeuroNER, was developed by Dernoncourt et al. for the purpose of named-entity recognition[22]. NeuroNER has been eval- uated for use in the de-identification of patient notes[21]. It allows for each token to be labelled only with a single label. However, tokens in our study were of- ten associated with multiple labels. For example, a sentence could indicate that both communication with family occurred and that goals of care were discussed. In order to allow for multi-class labelling, a separate, independent model was trained per domain. For each domain, the data set was split up into randomized training and validation sets, with 70% (449 notes) of the set in training, and 30% (192 notes) in validation. With the parameters derived from this training process, the model is run on the validation data set to examine its performance on a data set it was not specifically tuned to fit. Performance on the validation set also determines when training converges, indicating that the model is optimally trained. Training con- verges when there has been no improvement on the validation set performance in ten epochs. The neural network ultimately determines domain labels for each token. From the predicted token-level results, a note-level classification is deter- mined by the presence or absence of labelled tokens by domain in each note. We used Tensorflow version 1.4.1 and trained our models on a NVIDIA Titan X Pascal GPU. Below are the hyperparameters selected for our use: – character embedding dimension: 25 – character-based token embedding LSTM dimension: 25 – token embedding dimension: 100 – label prediction LSTM dimension: 100 – dropout probability: 0.5 For our experiments, we were able to compare our gold standard labels, derived from manual annotation by clinicians as described in Section 2.4, to the predicted output to evaluate the performance of the neural network and the regular expression method. 4 Results 4.1 Evaluation metrics Algorithm performance was determined at two levels: token-level and note-level, referring to the binary absence or presence of a label at these levels. Token-level results are more specific and allow accurate identification of relevant text within clinical notes. Note-level results allow determination of whether documentation of communication occurred. At both of these levels, we calculated the following metrics: sensitivity, specificity, positive predictive value, accuracy, and F1-score. The F1-score is the harmonic average of positive predictive value and sensitivity. 7 It allows us to determine the success of our algorithm both in identifying true positives as well as true negatives. The 95% confidence intervals for all metrics were determined via bootstrap- ping[23]; each trained network model was validated for 1,000 trials in addition to the reported performance point. During each trial, a validation set of 192 notes was created by random sampling with replacement of the original validation set of 192 unique notes. This creates an approximate distribution of performance for the model. In basic bootstrap technique, the 2.5th and 97.5th percentiles of the distributions for each metric are taken as the 95% confidence interval[24]. 4.2 Performance Table 2 summarizes the performance of the regular expression method and Table 3 summarizes the performance of the neural networks in identifying documenta- tion of serious illness communication at the note level, for each clinical domain, on the validation set. Figure 1 displays a comparison in the F1-scores for each domain. For identification of documentation of patient care preferences, the al- gorithm achieved an F1-score of 92.0 (95% CI, 89.1-95.1), with 93.5% (95% CI, 90.0%-98.0%) sensitivity, 90.5% (95% CI, 86.4%-95.1%) positive predictive value and 91.0% (95% CI, 86.4%-95.3%) specificity. For identification of family com- munication without documentation of preferences, the algorithm achieved an F1-score of 0.91 (95% CI, 0.87-0.94), with 90.7% (95% CI, 86.0%-95.9%) sensi- tivity, 90.7% (95% CI, 86.5%-94.8%) positive predictive value and 92.5% (95% CI, 89.2%-97.8%) specificity. Token-level performance is displayed in Appendix B. At the note-level, we have been able to achieve high accuracy for all domains and see that in the validation set, the neural network outperforms the regular expression method in every domain for F1-score, significantly so in identifying patient care preferences, goals of care conversations, and communication with family. These domains contain more complex and diverse language, which are successfully identified by the neural network. A static library is not able to capture the diversity in these domains, necessitating the use of machine learning. 4.3 Error analysis A review of documentation that the neural networks identified as serious ill- ness conversations that was not labeled serious illness conversations in the gold standard (false positives) showed that the algorithm identified documentation that clinician coders missed. Though our gold standard was rigorously reviewed and validated, there still remains room for human error. Comparing the iden- tified text from the neural network and regular expression methods, we found that as expected, the neural network was able to identify complex and unique language that the regular expression method was not. Doctors employ diverse and non-standardized language in clinical notes; we require more flexible and ex- tensible methods in order to efficiently process this information. Static libraries cannot capture the full complexity of language without sacrificing sensitivity or 8 Table 2. Performance (%) of the regular expression method on the validation data set. Positive Domain F1-score Accuracy Sensitivity Predictive Specificity Value Patient care 76.0 78.6 70.7 82.3 86.0 preferences Goals of care 37.2 57.8 26.1 64.9 87.0 conversations Code status 94.3 96.4 98.3 90.6 95.5 limitations Communication 43.6 67.7 27.9 100.0 100.0 with family Full code status 90.9 88.5 84.6 98.2 96.8 Table 3. Performance (%) of the neural networks on the validation data set. Values in parentheses are 95% confidence intervals. Positive Domain F1-score Accuracy Sensitivity Predictive Specificity Value Patient care 92.0 92.2 93.5 90.5 91.0 preferences (89.1-95.1) (89.6-95.1) (90.0-98.0) (86.4-95.1) (86.4-95.3) Goals of care 85.7 89.1 85.1 86.3 91.5 conversations (80.4-90.3) (85.6-92.4) (78.4-91.5) (80.0-93.0) (87.7-95.7) Code status 95.9 97.4 98.3 93.5 97.0 limitations (93.0-98.7) (95.8-99.2) (96.9-100.0) (89.2-97.7) (95.0-98.9) Communication 90.7 91.7 90.7 90.7 92.5 with family (87.4-93.9) (89.1-94.4) (86.0-95.9) (86.5-94.8) (89.1-95.9) 98.5 97.9 100.0 97.0 93.5 Full code status (97.5-99.4) (96.6-99.2) (100.0-100.0) (95.1-98.9) (89.2-97.7) specificity–they must be curated such that library terms are not too broad and they are not able to utilize context. All note-level identification can be traced to the detection of specific words with examples of text for each method provided in Appendix C. 4.4 Effect of training set size In order to determine how smaller training sets related to the performance of the trained algorithms, we trained multiple networks with varying number of notes. We plotted training dataset size against algorithm performance for 8 sample sizes (Figure 2). The performance seemed to plateau at around 200 notes (around 250,000 tokens), which suggests that annotation efforts can be efficiently lever- aged to generalize the models to varied health systems. 9 Fig. 1. Comparison between the F1-score of the regular expression method and neural networks by domain. 5 Discussion and future work We describe a novel use of deep learning algorithms to rapidly and accurately identify documentation of serious illness conversations within clinical notes. When applied to identifying documentation of patient care preferences, our algo- rithm demonstrated high sensitivity (93.5%), positive predictive value (90.5%) and specificity (91.0%), with a F1-score of 92.0. In fact, we found that deep learning outperformed individual clinician coders both in terms of identifying the documentation and in terms of its many-thousands-time-faster speed. Existing work has shown that machine learning can extract structured enti- ties like medical problems, tests and treatments from clinical notes[25, 26], and unstructured image-based information in radiology, pathology and opthamology[27– 29]. Our study extends this line of work and demonstrates that deep learning can also perform accurate automated text-based information classification. Up until now, extracting goals of care documentation nested within free-text clinical notes has relied on labor-intensive and imperfect manual coding[11]. Us- ing the capabilities of deep learning as demonstrated in this paper would allow for rapid audit and feedback regarding documentation at the system and individ- ual practitioner level. This would result in significant opportunities for quality improvement that are currently not being met. Deep learning models could also improve patient care in real-time by broadening what is available at the point of care in the EHR. For example, clinicians could view displays of all documented goals of care conversations, or be prompted to complete documentation that was not yet available. Important limitations must be noted. Deep learning algorithms only detect what is documented. It is not fully understood to what extent documentation reflects the actual content of a patient-clinician conversation surrounding serious 10 Fig. 2. Neural network performance on validation set for detection of note-level docu- mentation of patient care preferences by number of notes used for training. illness care goals. However, documentation is the best proxy we have to under- stand and to track these conversations. This is also a single institution study, which may limit its generalizability. Future work will involve the investigation of how extensible models are to clinical notes from different health system. Varia- tions in EHR software and the structure of clinical notes in different institutions makes it essential to further train and validate our methods using data from multiple healthcare systems. This should be imminently possible, as our learn- ing curve suggested that the neural network needed to train on as few as 200 clinician coded notes to perform well. Future research should also focus on opti- mizing deep neural networks to further improve performance, and on determining the feasibility of operationalizing this algorithm across institutions. 6 Conclusion This is the first known report of employing deep learning, to our knowledge, to identify serious illness conversations. The potential of this technology to improve the visibility of documented goals of care conversations within the EHR and for quality improvement has far reaching implications. We hope such methods will become an important tool for evaluating and improving the quality of serious illness care from a population health perspective. Acknowledgements We are particularly grateful to Tristan Naumann, Franck Dernoncourt, Elena Sergeeva, Edward Moseley, and Alistair Johnson for helpful guidance and advice during the development of this research. Additionally, we would like to thank 11 Peter Szolovits for providing computing resources, as well as Saad Salman, Sarah Kaminar Bourland, Haruki Matsumoto and Dickson Lui for annotating clinical notes. This research was facilitated by preliminary work done as part of course HST.953 in the Harvard-MIT Division of Health Sciences and Technology (HST) at Massachusetts Institute of Technology (MIT), Boston, MA. References 1. Cook, D., Rocker, G. Dying with Dignity in the Intensive Care Unit. N Engl J Med 2014; 370:2506-2514 2. Wright AA, Zhang B, Ray A, et al. Associations between end-of-life discussions, patient mental health, medical care near death, and caregiver bereavement adjust- ment. JAMA. 2008;300(14):1665-1673. 3. Nicholas LH, Langa KM, Iwashyna TJ, Weir DR. Regional variation in the asso- ciation between advance directives and end-of-life Medicare expenditures. JAMA. 2011;306(13):1447-1453. 4. Teno JM, Gruneir A, Schwartz Z, Nanda A, Wetle T. Association between advance directives and quality of endoflife care: A national study. Journal of the American Geriatrics Society. 2007;55(2):189-194. 5. Detering KM, Hancock AD, Reade MC, Silvester W. The impact of advance care planning on end of life care in elderly patients: randomised controlled trial. BMJ. 2010;340:c1345 6. Huynh TN, Kleerup EC, Raj PP, Wenger NS. The Opportunity Cost Of Futile Treatment In The Intensive Care Unit. Critical care medicine. 2014;42(9):1977- 1982. doi:10.1097/CCM.0000000000000402. 7. Huynh, TN. Kleerup, EC., Wiley, JF., Savitsky, TD., Guse, D., Garber, BJ., Wenger, NS. The Frequency and Cost of Treatment Perceived to Be Futile in Critical Care. JAMA Intern Med. 8. NQF #1626: Patients Admitted to ICU Who Have Care Preferences Documented. National Quality Forum. 9. Khandelwal N., Kross E., Engelberg R., Coe N., Long A., Curtis J. Estimating the Effect of Palliative Care Interventions and Advance Care Planning on ICU Utilization: A Systematic Review. doi:10.1097/CCM.0000000000000852. Crit Care Med. 2015 May. 10. Rising JC, J.; Valuck, T. Building Additional Serious Illness Measures Into Medi- care Programs. The Pew Charitable Trusts. 2017. 11. Walling AM, Tisnado D, Asch SM, et al. The quality of supportive cancer care in the veterans affairs health system and targets for improvement. JAMA internal medicine. 2013;173(22):2071-2079. 12. Dy SM, Lorenz KA, O’Neill SM, et al. Cancer Quality-ASSIST supportive on- cology quality indicator set: feasibility, reliability, and validity testing. Cancer. 2010;116(13):3267-3275. 13. Aldridge MD, Meier DE. It is possible: quality measurement during serious illness. JAMA Intern Med. 2013;173(22):2080-2081. 14. Melton GB, Hripcsak G. Automated detection of adverse events using natural language processing of discharge summaries. JAMA. 2005;12(4):448-457 15. Matthew Honnibal and Mark Johnson. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empir- ical Methods in Natural Language Processing, pages 13731378, Lisbon, Portugal, September 2015. Association for Computational Linguistics. 12 16. Carrell DS, Schoen RE, Leffler DA, Morris M, Rose S, Baer A, Crockett SD, Goure- vitch RA, Dean KM, Mehrotra A. Challenges in adapting existing clinical natural language processing systems to multiple, diverse healthcare settings. (JAMIA) 17. Schmidhuber, J. (2015). ”Deep Learning in Neural Networks: An Overview”. Neu- ral Networks. 61: 85117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637 18. Johnson AE, Pollard, T. J., Shen, L., Lehman, L. W., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., Mark, R. G. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. 19. Walling AM, Asch, S.M., Lorenz, K.A., Roth, C.P., Barry, T., Kahn, K.L., Wenger, N.S. The Quality of Care Provided to Hospitalized Patients at the End of Life. Archives of Internal Medicine. 2010;170(12):1057-1063. 20. Honnibal MJ, M. An Improved Non-monotonic Transition System for Dependency Parsing. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; 2015; Lisbon, Portugal. 21. Dernoncourt F, Lee, J. Y., Uzuner, O., Szolovits, P. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association. 2017;24(3):596-606. 22. Dernoncourt F, Lee, J.Y, Szolovits, P. . NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. . Conference on Empirical Methods on Natural Language Processing (EMNLP). 2017. 23. Efron B. Better Bootstrap Confidence Intervals. Journal of the American Statistical Association. 1987;82(397):171-185. 24. Davison AC, Hinkley, D.V. Bootstrap Methods and their Application. Cambridge University Press; 1997. 25. D’Avolio LW, Nguyen, T. M., Goryachev, S., Fiore, L. D. Automated concept-level information extraction to reduce the need for custom software and rules develop- ment. Journal of the American Medical Informatics Association. 2011;18(5):607- 613. 26. Xu H, Jiang, M., Oetjens, M., Bowton, E.A., Ramirez, A.H., Jeff, J.M.,Basford, M.A., Pulley, J.M., Cowan, J.D., Wang, X., Ritchie, M.D., Masys, D.R., Roden, D.M., Crawford, D.C., Denny, J.C. Facilitating pharmacogenetic studies using elec- tronic health records and natural-language processing: a case study of warfarin. Journal of the American Medical Informatics Association. 2011;18(4):387-391. 27. Bejnordi BE, Veta M, van Diest PJ, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Journal of the American Medical Association. 2017;318(22):2199-2210. 28. Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learn- ing algorithm for detection of diabetic retinopathy in retinal fundus photographs. Journal of the American Medical Association. 2016;316(22):2402-2410. 29. Ting DSW, Cheung, C.Y., Lim, G.,Tan, G.S.W., Quang, N.D., Gan A, Hamzah H, Garcia-Franco R, San Yeo IY, Lee SY. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal im- ages from multiethnic populations with diabetes. Journal of the American Medical Association. 30. Charlotta Lindvall, Elizabeth J. Lilley, Sophia N. Zupanc, Isabel Chien, Alexander W. Forsyth, Anne Walling, Zara Cooper, and James A. Tulsky. Natural language processing to assess palliative care processes in cancer patients receiving palliative surgery. In preparation. 13 A Regular expression library Domain Keywords goc, goals of care, goals for care, goals of treatment, goals for treatment, treatment goals, family meeting, family discussion, family discussions, patient goals, dnr, dni, Patient care preferences dnrdni, dnr/dni, DNI/R, do not resuscitate, do-not-resuscitate, do not intubate, do-not-intubate, chest compressions, no defibrillation, no endotracheal intubation, no mechanical intubation, shocks, cmo, comfort measures goc, goals of care, goals for care, goals of treatment, goals Goals of care for treatment, treatment goals, family meeting, family conversations discussion, family discussions, patient goals dnr, dni, dnrdni, dnrdni, DNIR, do not resuscitate, do-not-resuscitate, do not intubate, do-not-intubate, chest Code status limitations compressions, no defibrillation, no endotracheal intubation, no mechanical intubation, shocks, cmo, comfort measures Explicit conversations held during ICU stay period with Communication with patients or family members about the patients goals, family values, or priorities for treatment and outcomes. Full code status full code B Token-level performance Table 4. Performance (%) of the neural network on the validation data set at the token-level. Positive Domain F1-score Accuracy Sensitivity Predictive Specificity Value Patient care 76.0 99.6 75.8 75.2 99.8 preferences Goals of care 70.4 99.6 70.0 69.9 99.8 conversations Code status 76.3 99.8 72.7 80.5 99.9 limitations Communication 68.2 99.7 62.0 76.4 99.9 with family Full code status 90.9 99.8 88.3 93.6 99.8 14 C Examples of identified text Below are examples of correctly identified serious illness documentation by the neural network and regular expression methods in the validation dataset. Cor- rectly identified tokens are bolded. Typographical errors are from the original text. Each cell includes an example of identified tokens in the same text and an example of documentation identified by the neural network that was missed by the regular expression method, if relevant. Domain Neural Network Regular Expression Goals of care Hypercarbic resp failure: Hypercarbic resp failure: conversations family meeting was held with family meeting was held son/HCP and in keeping with son/HCP and in keeping with patients goals of care, with patients goals of care, there was no plan for there was no plan for intubation.Family was intubation.Family was brought brought in and we explained in and we explained the the graveness of her ABG and graveness of her ABG and her her worsened mental status worsened mental status which which had failed to improve had failed to improve with with BiPAP. Family was BiPAP. Family was comfortable with removing comfortable with removing Bipap and providing Bipap and providing comfort comfort care including care including morphine prn. morphine prn. family open to cmo but pt family open to cmo but pt wants full code but also wants full code but also doesn’t want treatment or to doesn’t want treatment or be disturbed. to be disturbed. Code status CODE: DNR/DNI, CODE: DNR/DNI, limitations confirmed with healthcare confirmed with healthcare manager who will be manager who will be discussing with official discussing with official HCP HCP 15 Communication Dr. [**First Name (STitle) **] Dr. [**First Name (STitle) **] with family from neurosurgery held from neurosurgery held family family meeting and meeting and explained grave explained grave prognosis prognosis to the family. to the family. lengthy discussion with the son lengthy discussion with who is health care proxy he the son who is health care wishes to pursue comfort proxy he wishes to pursue measures due to severe and comfort measures due to unrevascularizable cad severe and daughter is not in agreement unrevascularizable cad at this time but is not the daughter is not in proxy due to underlying agreement at this time but psychiatric illness is not the proxy due to underlying psychiatric illness Full code Code: FULL; Discussed Code: FULL; Discussed with status with daughter and HCP daughter and HCP who says who says that patient is in that patient is in a Hospice a Hospice program with a program with a ”bridge” to ”bridge” to DNR/DNI/CMO, but despite DNR/DNI/CMO, but multiple conversations, the despite multiple patient insists on being full conversations, the patient code insists on being full code CODE: Presumed full CODE: Presumed full