Automated Narrative Extraction from Administrative Records*,** Karine Megerdoomian Karl Branting Charles E. Horowitz The MITRE Corporation The MITRE Corporation The MITRE Corporation McLean, VA, USA McLean, VA, USA McLean, VA, USA karine@mitre.org lbranting@mitre.org chorowitz@mitre.org Amy B. Marsh Nick Modly Stacy J. Petersen The MITRE Corporation The MITRE Corporation The MITRE Corporation McLean, VA, USA McLean, VA, USA McLean, VA, USA amarsh@mitre.org nmodly@mitre.org spetersen@mitre.org Eric O. Scott Sujit B. Wariyar The MITRE Corporation The MITRE Corporation McLean, VA, USA McLean, VA, USA escott@mitre.org swariyar@mitre.org ABSTRACT history have allowed the probation office to have a better understanding of their client population and to perform The U.S. Probation and Pretrial Services Office staff produce analyses that were previously unavailable to the organization. billions of pages of information on defendants’ and offenders’ This technical approach can be applied across organizations, profile and conduct. While it is critical for probation officers legal institutions, clinical administrations, and government and district chiefs to have up-to-date knowledge on their agencies that maintain large amounts of information in the clients to better assist and reduce risk of recidivism, the data form of free text narratives. are often stored in narrative texts in multiple large documents. As a result, these records remain mostly out of reach without the use of painstaking manual review. This paper describes an 1 Introduction analytic prototype developed to automatically acquire The U.S. Probation and Pretrial Services Office (PPSO) staff structured information from natural language text in probation supervise more than 300,000 people a year and collect and office documents through the application of PDF content produce billions of pages of information on defendants’ and extraction, text mining, and language analytics. Since serious offenders’ profile and conduct, as well as on the strategies and mental illness is very prevalent in the U.S. corrections system, actions of officers and their outcomes. While it is critical for the first phase of the project focused on extracting information probation officers to have up-to-date knowledge on their and constructing timelines from narrative text regarding the clients to reduce the risk of recidivism, the data are often stored defendants’ mental health conditions, substance use and in narrative texts in multiple large documents, making it very treatment history. challenging and time-consuming to collect all relevant case information manually. This renders 70 terabytes of mostly Automated narrative extraction and the construction of an unstructured data on more than a million defendants, and event timeline for defendants’ mental and emotional health strategies used by thousands of officers over decades, mostly unusable by PPSO [1]. As a result, policy makers, program *In Proceedings of the Workshop on Artificial Intelligence and the Administrative State (AIAS 2019), June 17, 2019, Montreal, QC, Canada. evaluators, and probation and pretrial services staff have been Copyright © 2019 for this paper by The MITRE Corporation. Use permitted under denied valuable data with which to do their jobs. Creative Commons License Attribution 4.0 International (CC BY 4.0). Published at http://ceur-ws.org ** Approved for Public Release; Distribution Unlimited 19-1482. Throughout this A significant number of offenders supervised by the U.S. document, all names of people, places, facilities and dates are replaced with probation services have a current mental health condition, fictitious ones to anonymize the information. most of them with co-occurring substance use disorders. AIAS’19, June, 2019, Montreal, Quebec Canada K. Megerdoomian et al. Defendants who suffer from mental disorders often require 2 Background more intensive monitoring and specialized treatment [2]. We Past clinical information extraction systems have tended to rely therefore focus on addressing important PPSO business on shallow NLP techniques (pattern-matching, simple parses, questions to better understand the nature of the mental linear pattern interpretation rules). More recently, however, conditions in the officers’ caseload and gain knowledge of the several projects have adopted knowledge-based approaches defendants’ diagnosis and treatment history. The information adapted for the clinical domain. was automatically obtained from the free text sections of Presentence Investigation Reports (PSIR), which represent While the advantages of machine learning methods for investigations into the history of the person convicted of a information extraction cannot be denied, they also present a crime before sentencing to determine if there are extenuating number of limitations in applications for narrative extraction circumstances. To automatically extract and analyze the free from clinical data. To begin with, machine learning algorithms text information in the PSIRs, we applied language analytics require large amounts of training data which are pre-tagged for technology to detect the events of interest (substance use, the relevant features and parameters. Preparing the pre- diagnosis, treatment sessions, prescriptions) in the defendant’s annotated data sets can be time-consuming and expensive. In life and visualized them as a timeline of activities that could be addition, such probabilistic approaches might miss rare reviewed by the probation and parole officers. phenomena that need to be identified since they do not occur often enough in the training data to be picked up by the The system leverages Apache cTAKES (clinical Text Analysis learning algorithms. Another challenge for using machine and Knowledge Extraction System), an open-source Natural learning methods in the clinical domain is that users often Language Processing (NLP) system developed specifically to expect high level of consistency in the results and precise extract and analyze clinical information from unstructured text information on how the computational decisions were made. In [3]. cTAKES identifies clinical terms such as drugs, diseases and such instances, a rule-based approach might be more disorders, symptoms, and medical and treatment procedures. transparent and easier to understand and modify. It also performs deep textual analysis and can identify, for instance, if a sentence is negated or not, or if the person being The approach described in this paper leverages in-depth discussed is the patient or a family member. The prototype linguistic and semantic analysis to detect the domain system combines the results of cTAKES with rich linguistic information in narrative text, more in line with recent analysis from other open source systems such as concept knowledge-based approaches [5] [6]. Machine learning ontologies and the Stanford CoreNLP parser and entity approaches often require a large amount of pre-annotated data recognizer [4]. These syntactic and semantic analyses are then on which to train the algorithms. Since the PSIR data had not enhanced to adapt to the use case, by identifying significant previously been tagged for the events of interest and mental terms for the events of interest for the mental health domain, conditions, a purely machine learning approach was not readily applying linguistic analysis to improve argument and negation available. Hence, the prototype applies a hybrid method. It detection, and implementing recent advances in NLP to leverages rich linguistic and semantic information through the improve precision (e.g., vector space semantics, algorithms for application of open-source Natural Language Processing building a narrative timeline). systems, adapted for the existing use case by applying a combination of rule-based linguistic analysis, vector space All extracted information on a defendant’s narrative is stored semantics, and machine learning techniques to enhance the in a graph database and displayed on a dynamic map, allowing results. These were used to improve negation detection and filtering of results based on judicial district, defendants’ argument identification (i.e., entities the events refer to), and to demographic information (age, education, citizenship), develop temporal reasoning algorithms. Ontologies (lexicons) criminal category, mental conditions or medications of mental health and medication terms, vetted by a subject prescribed. matter expert, were used for concept identification. The rest of this section provides a detailed description of the technical As large amounts of information in business, government and steps in building the analytic prototype. administration are maintained in the form of narratives (clinical records, legal and financial summaries, progress reports, human resources assessments, etc.), the approach 3 Technical Approach described in this paper for acquiring structured information The technical approach is a hybrid one, leveraging open source from narrative text can be reapplied across organizations and NLP applications often developed by training machine learning government agencies. algorithms, and refining the syntactic and semantic analyses with a combination of knowledge-based and probabilistic approaches. Automated Narrative Extraction for Administrative Records AIAS’19, June, 2019, Montreal, Quebec Canada 3.1 Analytic Pipeline 5. User Interface (UI): This component interacts with the Neo4j database and displays results on a Google Earth The presentence reports undergo several steps in order to map. The UI allows the user to run queries, to review the extract the defendant’s mental health and substance use details on particular defendants, and to see aggregate narratives. These are shown in Error! Reference source not results on the data set. found. and are described in detail in the rest of this section. The specific steps involved are: 3.2 Content Extraction 1. Content Extraction: parsing the different sections of the The Content Extraction component parses the PDF presentence PDF documents and extracting the structured profile and reports, identifies all subsections and extracts the textual criminal information as well as all free text content. This content. To analyze the mental health and substance use component also “cleans” the data by normalizing the information of defendants, the text content of the Mental and textual content to maximize processing. Emotional Health (MEH) and Substance Abuse (SA) sections in 2. Language Analytics: The extracted text for each PSIR is presentence reports are automatically extracted. In addition, run through the Natural Language Processing this step identifies and extracts all federal charges from the components, providing a full linguistic parse, a list of cover sheet of the PSIR, criminal history information from the entities and events of interest, and semantic relationships. Juvenile Adjudications and Adult Criminal Convictions sections 3. Knowledge Discovery: This step is the heart of the of the report, Arrest Dates and associated charges from the textual analytics where the system identifies all concepts, Criminal History information, and Criminal History Score and events, and their relationships for the domain of interest. Category from the Criminal History Computation section. • Identifies the events of interest associated with the defendant (arrests, diagnoses, treatments, The prototype’s Content Extraction component successfully prescriptions, drug use, suffering from a mental extracted information from 92% of the original PDF condition); documents, providing us with a data set of 11,243 extracted • Determines whether the information is obtained from narrative text documents to analyze. Given that some medical records or if it is reported by the defendant, defendants have more than one presentence report associated by a medical professional, or by a third party; with them, the successfully extracted content corresponds to • Provides full event description including date, 10,973 defendants. The free text content extracted from the location, persons involved, treatment provider, MEH and SA sections amount to 22,486 text items. These can nature of treatment and medication prescribed; range from a few sentences to several paragraphs depending on the report. • Computes the temporal relationships between the various events to build a narrative timeline for a defendant. 3.3 Language Analytics The Language Analytics component leverages existing Natural Language Processing software to perform various linguistic analyses on a piece of text. NLP is a subset of Artificial Intelligence (AI) and is fast becoming an essential technology in modern-day organizations to gain significant insights from unstructured content, such as email communications, social media, videos, customer reviews, customer support request, and administrative records in business and government. Natural Language Processing tools and techniques help to automatically process, analyze, and understand large amounts of data, providing structure and meaning to information that Figure 1: Analytic pipeline for narrative extraction and timeline originally was in unstructured form. development In this step of the analysis, the texts extracted from the Mental 4. Neo4j Database: Neo4j is a graph database management and Emotional Health and Substance Abuse sections of the system and is available as open source software. All PSIRs are run through several NLP software tools. The software extracted information from the Knowledge Discovery packages currently in use are Apache cTAKES (clinical Text component, as well as the client demographic metadata, Analysis and Knowledge Extraction System), Stanford Named and structured information on arrest history and federal Entity Recognizer, and FONS (Framework for Operation NLP offenses extracted from the presentence reports are Services) – a software package pipeline leveraging open source loaded into the database. tools and was built by a research team at MITRE to detect events of interest to national security. AIAS’19, June, 2019, Montreal, Quebec Canada K. Megerdoomian et al. 1. Identify concepts (entities and events) of interest cTAKES output forms the primary basis for further analytics. It associated with the client, including mentions of a client was chosen primarily because of its entity recognition suffering from a mental condition, diagnoses, treatments, capabilities in the clinical domain, which aligned with the prescriptions and drug use. desire to obtain data about PPSO clients’ mental and emotional 2. Detect the event description such as the date and location health and substance use. Entities identified by cTAKES include when it occurred, the persons involved, the treatment medical conditions, drugs/medications, medical procedures, provider, the nature of treatment (e.g., inpatient or and medical symptoms. The entities identified by cTAKES out- outpatient, anger management, drug rehabilitation) and of-the-box were supplemented with additional entities the medication prescribed. frequently encountered by analysts in PSIRs. We worked 3. Detect the source of the information – was the information closely with a PPSO subject matter expert to review the list of reported by the client, was it obtained from medical conditions and medications that cTAKES recognized, and records or a medical professional, or reported by a third identify the ones that were of interest in the mental and party? emotional health and substance use domain. The subject matter expert also identified a more general superclass for each As described, cTAKES detects these entities of interest in the of these specific mental and emotional conditions so that mental and emotional health domain. However, to identify further analysis could be conducted at the appropriate level of whether a client is suffering from a mental condition, it does granularity. For example, conditions such as depression, chronic not suffice to simply retrieve sentences with a mental condition depression, and major depressive disorder were all mapped to mention. It is also important to detect the subject of the the more general term depressive disorder. sentence to distinguish cases where a family member is mentioned to suffer from a mental condition (e.g., “the cTAKES also provides domain-independent NLP capabilities of defendant’s mother suffered from Schizophrenia”), and to syntactic parsing, dependency parsing, and semantic role exclude any negated events (e.g., “the defendant does not suffer labelling – it can give the base forms of words, their parts of from a severe mental disease or defect”). Fortunately, when speech, mark up the structure of sentences in terms of phrases cTAKES identifies a concept, it also identifies that sentence’s and syntactic relations, detect negation in the sentence and polarity (whether the entity appears in a negated context or identify the role of the entities in a sentence (e.g., agent of not), and the event’s subject (whether that event or concept event). The results of all these capabilities were used to identify should be ascribed to the client described in the text, a family events of interest in a client’s mental and emotional health and member of the client, or someone else). Some modifications to substance use history. However, we found it useful to the cTAKES source code were made to improve the accuracy of supplement the cTAKES output with other natural language these attribute identifications. processing systems to achieve the most accurate analysis. The Stanford Named Entity Recognizer was applied to identify While the cTAKES entities can be counted to obtain statistics on people, places, organizations, dates, times, and locations, none the prevalence of various mental conditions among the of which are identified by cTAKES. Additionally, the FONS defendant population, further processing is necessary to system, which also generates entities, syntactic parsing and identify more complicated events, such as receiving a dependency parsing output, was used to supplement cTAKES’ diagnosis, attending treatment, being prescribed medication, output to obtain a higher level of accuracy. In particular, FONS or using drugs. To identify the events of interest, a small sample was applied to the PSIR text data to tag entities (people, of PSIRs was reviewed to identify the verbs commonly facilities, locations, dates and times), and to categorize all associated with these events. An iterative process was used in events into conceptual classes by detecting event types (e.g., reviewing the event detection results and updating the state, transfer, communication) and different verb meanings predicates for the domain. The verbal predicates associated (e.g., prescribe can either be the verb denoting the prescription with each type of event are listed in Table 1. of medication by a medical professional or a communication event meaning ‘to advise’, ‘to recommend’). Event Type Predicate Diagnosis diagnose 3.4 Domain-Specific Entity and Event Prescription prescribe, treat (with) Identification Treatment admit, attend, complete, discharge, The Knowledge Discovery phase of the analytics involves enroll, enter, hospitalize, meet, processing the output from the Natural Language Processing participate, place, receive, see, seek, systems to perform several steps in knowledge discovery in speak, treat, undergo natural language text: Usage abuse, addict, consume, drink, experiment, ingest, inhale, relapse, smoke, snort, take, try, use Automated Narrative Extraction for Administrative Records AIAS’19, June, 2019, Montreal, Quebec Canada Table 1: Verbs used to identify events related to mental and of the source of information. The top verbs identified as emotional health and substance use Communication events are listed in Table 2. Once the predicates are identified, the semantic roles Event Type Predicate associated with each occurrence of the predicate are Communication state, indicate, note, explain, automatically extracted to enable the identification of the report, say, acknowledge, discuss, predicate’s agent, affected entity, and whether the predicate identify, confirm, deny, address, was negated. The sentence in which the predicate appeared agree, communicate, question, was also examined to identify medications, drugs, mental suggest, tell, describe, claim, conditions, medical procedures, and treatments associated mention, inform, disclose with that event. Other formulation according to Table 2: Terms used to identify the source of information. To detect the source of the information, all sentences with Communication events identified by the FONS software This linguistically rich event-based narrative analysis package were analyzed and the subject of the verbs extracted. methodology allows the Language Analytics component to For example, in “Dr. Gray stated that the defendant has never extract information of interest including the people involved in been hospitalized for emotional disorders of any kind”, the the event, the time it occurred, and the places mentioned. A communication verb stated is detected and its subject, Dr. Gray sample analyzed sentence is shown in the following example: (a medical professional), is identified as the source of the information. Similarly, in the example “the defendant’s mother The defendant reported also reported he was diagnosed with Bi-Polar Disorder several she was years ago”, the source of information is identified as the diagnosed at the age of 14