=Paper=
{{Paper
|id=Vol-2471/paper7
|storemode=property
|title=Automated Narrative Extraction from Administrative Records
|pdfUrl=https://ceur-ws.org/Vol-2471/paper7.pdf
|volume=Vol-2471
|authors=Karine Megerdoomian,Karl Branting,Charles Horowitz,Amy Marsh,Stacy Petersen,Eric Scott
|dblpUrl=https://dblp.org/rec/conf/icail/MegerdoomianBHM19
}}
==Automated Narrative Extraction from Administrative Records==
Automated Narrative Extraction from Administrative Records*,**
Karine Megerdoomian Karl Branting Charles E. Horowitz
The MITRE Corporation The MITRE Corporation The MITRE Corporation
McLean, VA, USA McLean, VA, USA McLean, VA, USA
karine@mitre.org lbranting@mitre.org chorowitz@mitre.org
Amy B. Marsh Nick Modly Stacy J. Petersen
The MITRE Corporation The MITRE Corporation The MITRE Corporation
McLean, VA, USA McLean, VA, USA McLean, VA, USA
amarsh@mitre.org nmodly@mitre.org spetersen@mitre.org
Eric O. Scott Sujit B. Wariyar
The MITRE Corporation The MITRE Corporation
McLean, VA, USA McLean, VA, USA
escott@mitre.org swariyar@mitre.org
ABSTRACT history have allowed the probation office to have a better
understanding of their client population and to perform
The U.S. Probation and Pretrial Services Office staff produce analyses that were previously unavailable to the organization.
billions of pages of information on defendants’ and offenders’ This technical approach can be applied across organizations,
profile and conduct. While it is critical for probation officers legal institutions, clinical administrations, and government
and district chiefs to have up-to-date knowledge on their agencies that maintain large amounts of information in the
clients to better assist and reduce risk of recidivism, the data form of free text narratives.
are often stored in narrative texts in multiple large documents.
As a result, these records remain mostly out of reach without
the use of painstaking manual review. This paper describes an 1 Introduction
analytic prototype developed to automatically acquire
The U.S. Probation and Pretrial Services Office (PPSO) staff
structured information from natural language text in probation
supervise more than 300,000 people a year and collect and
office documents through the application of PDF content
produce billions of pages of information on defendants’ and
extraction, text mining, and language analytics. Since serious
offenders’ profile and conduct, as well as on the strategies and
mental illness is very prevalent in the U.S. corrections system,
actions of officers and their outcomes. While it is critical for
the first phase of the project focused on extracting information
probation officers to have up-to-date knowledge on their
and constructing timelines from narrative text regarding the
clients to reduce the risk of recidivism, the data are often stored
defendants’ mental health conditions, substance use and
in narrative texts in multiple large documents, making it very
treatment history.
challenging and time-consuming to collect all relevant case
information manually. This renders 70 terabytes of mostly
Automated narrative extraction and the construction of an
unstructured data on more than a million defendants, and
event timeline for defendants’ mental and emotional health
strategies used by thousands of officers over decades, mostly
unusable by PPSO [1]. As a result, policy makers, program
*In Proceedings of the Workshop on Artificial Intelligence and the Administrative
State (AIAS 2019), June 17, 2019, Montreal, QC, Canada.
evaluators, and probation and pretrial services staff have been
Copyright © 2019 for this paper by The MITRE Corporation. Use permitted under denied valuable data with which to do their jobs.
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Published at http://ceur-ws.org
** Approved for Public Release; Distribution Unlimited 19-1482. Throughout this A significant number of offenders supervised by the U.S.
document, all names of people, places, facilities and dates are replaced with probation services have a current mental health condition,
fictitious ones to anonymize the information.
most of them with co-occurring substance use disorders.
AIAS’19, June, 2019, Montreal, Quebec Canada K. Megerdoomian et al.
Defendants who suffer from mental disorders often require 2 Background
more intensive monitoring and specialized treatment [2]. We
Past clinical information extraction systems have tended to rely
therefore focus on addressing important PPSO business
on shallow NLP techniques (pattern-matching, simple parses,
questions to better understand the nature of the mental
linear pattern interpretation rules). More recently, however,
conditions in the officers’ caseload and gain knowledge of the
several projects have adopted knowledge-based approaches
defendants’ diagnosis and treatment history. The information
adapted for the clinical domain.
was automatically obtained from the free text sections of
Presentence Investigation Reports (PSIR), which represent
While the advantages of machine learning methods for
investigations into the history of the person convicted of a
information extraction cannot be denied, they also present a
crime before sentencing to determine if there are extenuating
number of limitations in applications for narrative extraction
circumstances. To automatically extract and analyze the free
from clinical data. To begin with, machine learning algorithms
text information in the PSIRs, we applied language analytics
require large amounts of training data which are pre-tagged for
technology to detect the events of interest (substance use,
the relevant features and parameters. Preparing the pre-
diagnosis, treatment sessions, prescriptions) in the defendant’s
annotated data sets can be time-consuming and expensive. In
life and visualized them as a timeline of activities that could be
addition, such probabilistic approaches might miss rare
reviewed by the probation and parole officers.
phenomena that need to be identified since they do not occur
often enough in the training data to be picked up by the
The system leverages Apache cTAKES (clinical Text Analysis
learning algorithms. Another challenge for using machine
and Knowledge Extraction System), an open-source Natural
learning methods in the clinical domain is that users often
Language Processing (NLP) system developed specifically to
expect high level of consistency in the results and precise
extract and analyze clinical information from unstructured text
information on how the computational decisions were made. In
[3]. cTAKES identifies clinical terms such as drugs, diseases and
such instances, a rule-based approach might be more
disorders, symptoms, and medical and treatment procedures.
transparent and easier to understand and modify.
It also performs deep textual analysis and can identify, for
instance, if a sentence is negated or not, or if the person being
The approach described in this paper leverages in-depth
discussed is the patient or a family member. The prototype
linguistic and semantic analysis to detect the domain
system combines the results of cTAKES with rich linguistic
information in narrative text, more in line with recent
analysis from other open source systems such as concept
knowledge-based approaches [5] [6]. Machine learning
ontologies and the Stanford CoreNLP parser and entity
approaches often require a large amount of pre-annotated data
recognizer [4]. These syntactic and semantic analyses are then
on which to train the algorithms. Since the PSIR data had not
enhanced to adapt to the use case, by identifying significant
previously been tagged for the events of interest and mental
terms for the events of interest for the mental health domain,
conditions, a purely machine learning approach was not readily
applying linguistic analysis to improve argument and negation
available. Hence, the prototype applies a hybrid method. It
detection, and implementing recent advances in NLP to
leverages rich linguistic and semantic information through the
improve precision (e.g., vector space semantics, algorithms for
application of open-source Natural Language Processing
building a narrative timeline).
systems, adapted for the existing use case by applying a
combination of rule-based linguistic analysis, vector space
All extracted information on a defendant’s narrative is stored
semantics, and machine learning techniques to enhance the
in a graph database and displayed on a dynamic map, allowing
results. These were used to improve negation detection and
filtering of results based on judicial district, defendants’
argument identification (i.e., entities the events refer to), and to
demographic information (age, education, citizenship),
develop temporal reasoning algorithms. Ontologies (lexicons)
criminal category, mental conditions or medications
of mental health and medication terms, vetted by a subject
prescribed.
matter expert, were used for concept identification. The rest of
this section provides a detailed description of the technical
As large amounts of information in business, government and
steps in building the analytic prototype.
administration are maintained in the form of narratives
(clinical records, legal and financial summaries, progress
reports, human resources assessments, etc.), the approach 3 Technical Approach
described in this paper for acquiring structured information
The technical approach is a hybrid one, leveraging open source
from narrative text can be reapplied across organizations and
NLP applications often developed by training machine learning
government agencies.
algorithms, and refining the syntactic and semantic analyses
with a combination of knowledge-based and probabilistic
approaches.
Automated Narrative Extraction for Administrative Records AIAS’19, June, 2019, Montreal, Quebec Canada
3.1 Analytic Pipeline 5. User Interface (UI): This component interacts with the
Neo4j database and displays results on a Google Earth
The presentence reports undergo several steps in order to
map. The UI allows the user to run queries, to review the
extract the defendant’s mental health and substance use
details on particular defendants, and to see aggregate
narratives. These are shown in Error! Reference source not
results on the data set.
found. and are described in detail in the rest of this section. The
specific steps involved are:
3.2 Content Extraction
1. Content Extraction: parsing the different sections of the The Content Extraction component parses the PDF presentence
PDF documents and extracting the structured profile and reports, identifies all subsections and extracts the textual
criminal information as well as all free text content. This content. To analyze the mental health and substance use
component also “cleans” the data by normalizing the information of defendants, the text content of the Mental and
textual content to maximize processing. Emotional Health (MEH) and Substance Abuse (SA) sections in
2. Language Analytics: The extracted text for each PSIR is presentence reports are automatically extracted. In addition,
run through the Natural Language Processing this step identifies and extracts all federal charges from the
components, providing a full linguistic parse, a list of cover sheet of the PSIR, criminal history information from the
entities and events of interest, and semantic relationships. Juvenile Adjudications and Adult Criminal Convictions sections
3. Knowledge Discovery: This step is the heart of the of the report, Arrest Dates and associated charges from the
textual analytics where the system identifies all concepts, Criminal History information, and Criminal History Score and
events, and their relationships for the domain of interest. Category from the Criminal History Computation section.
• Identifies the events of interest associated with the
defendant (arrests, diagnoses, treatments, The prototype’s Content Extraction component successfully
prescriptions, drug use, suffering from a mental extracted information from 92% of the original PDF
condition); documents, providing us with a data set of 11,243 extracted
• Determines whether the information is obtained from narrative text documents to analyze. Given that some
medical records or if it is reported by the defendant, defendants have more than one presentence report associated
by a medical professional, or by a third party; with them, the successfully extracted content corresponds to
• Provides full event description including date, 10,973 defendants. The free text content extracted from the
location, persons involved, treatment provider, MEH and SA sections amount to 22,486 text items. These can
nature of treatment and medication prescribed; range from a few sentences to several paragraphs depending
on the report.
• Computes the temporal relationships between the
various events to build a narrative timeline for a
defendant.
3.3 Language Analytics
The Language Analytics component leverages existing Natural
Language Processing software to perform various linguistic
analyses on a piece of text. NLP is a subset of Artificial
Intelligence (AI) and is fast becoming an essential technology
in modern-day organizations to gain significant insights from
unstructured content, such as email communications, social
media, videos, customer reviews, customer support request,
and administrative records in business and government.
Natural Language Processing tools and techniques help to
automatically process, analyze, and understand large amounts
of data, providing structure and meaning to information that
Figure 1: Analytic pipeline for narrative extraction and timeline
originally was in unstructured form.
development
In this step of the analysis, the texts extracted from the Mental
4. Neo4j Database: Neo4j is a graph database management and Emotional Health and Substance Abuse sections of the
system and is available as open source software. All PSIRs are run through several NLP software tools. The software
extracted information from the Knowledge Discovery packages currently in use are Apache cTAKES (clinical Text
component, as well as the client demographic metadata, Analysis and Knowledge Extraction System), Stanford Named
and structured information on arrest history and federal Entity Recognizer, and FONS (Framework for Operation NLP
offenses extracted from the presentence reports are Services) – a software package pipeline leveraging open source
loaded into the database. tools and was built by a research team at MITRE to detect
events of interest to national security.
AIAS’19, June, 2019, Montreal, Quebec Canada K. Megerdoomian et al.
1. Identify concepts (entities and events) of interest
cTAKES output forms the primary basis for further analytics. It associated with the client, including mentions of a client
was chosen primarily because of its entity recognition suffering from a mental condition, diagnoses, treatments,
capabilities in the clinical domain, which aligned with the prescriptions and drug use.
desire to obtain data about PPSO clients’ mental and emotional 2. Detect the event description such as the date and location
health and substance use. Entities identified by cTAKES include when it occurred, the persons involved, the treatment
medical conditions, drugs/medications, medical procedures, provider, the nature of treatment (e.g., inpatient or
and medical symptoms. The entities identified by cTAKES out- outpatient, anger management, drug rehabilitation) and
of-the-box were supplemented with additional entities the medication prescribed.
frequently encountered by analysts in PSIRs. We worked 3. Detect the source of the information – was the information
closely with a PPSO subject matter expert to review the list of reported by the client, was it obtained from medical
conditions and medications that cTAKES recognized, and records or a medical professional, or reported by a third
identify the ones that were of interest in the mental and party?
emotional health and substance use domain. The subject
matter expert also identified a more general superclass for each As described, cTAKES detects these entities of interest in the
of these specific mental and emotional conditions so that mental and emotional health domain. However, to identify
further analysis could be conducted at the appropriate level of whether a client is suffering from a mental condition, it does
granularity. For example, conditions such as depression, chronic not suffice to simply retrieve sentences with a mental condition
depression, and major depressive disorder were all mapped to mention. It is also important to detect the subject of the
the more general term depressive disorder. sentence to distinguish cases where a family member is
mentioned to suffer from a mental condition (e.g., “the
cTAKES also provides domain-independent NLP capabilities of defendant’s mother suffered from Schizophrenia”), and to
syntactic parsing, dependency parsing, and semantic role exclude any negated events (e.g., “the defendant does not suffer
labelling – it can give the base forms of words, their parts of from a severe mental disease or defect”). Fortunately, when
speech, mark up the structure of sentences in terms of phrases cTAKES identifies a concept, it also identifies that sentence’s
and syntactic relations, detect negation in the sentence and polarity (whether the entity appears in a negated context or
identify the role of the entities in a sentence (e.g., agent of not), and the event’s subject (whether that event or concept
event). The results of all these capabilities were used to identify should be ascribed to the client described in the text, a family
events of interest in a client’s mental and emotional health and member of the client, or someone else). Some modifications to
substance use history. However, we found it useful to the cTAKES source code were made to improve the accuracy of
supplement the cTAKES output with other natural language these attribute identifications.
processing systems to achieve the most accurate analysis. The
Stanford Named Entity Recognizer was applied to identify While the cTAKES entities can be counted to obtain statistics on
people, places, organizations, dates, times, and locations, none the prevalence of various mental conditions among the
of which are identified by cTAKES. Additionally, the FONS defendant population, further processing is necessary to
system, which also generates entities, syntactic parsing and identify more complicated events, such as receiving a
dependency parsing output, was used to supplement cTAKES’ diagnosis, attending treatment, being prescribed medication,
output to obtain a higher level of accuracy. In particular, FONS or using drugs. To identify the events of interest, a small sample
was applied to the PSIR text data to tag entities (people, of PSIRs was reviewed to identify the verbs commonly
facilities, locations, dates and times), and to categorize all associated with these events. An iterative process was used in
events into conceptual classes by detecting event types (e.g., reviewing the event detection results and updating the
state, transfer, communication) and different verb meanings predicates for the domain. The verbal predicates associated
(e.g., prescribe can either be the verb denoting the prescription with each type of event are listed in Table 1.
of medication by a medical professional or a communication
event meaning ‘to advise’, ‘to recommend’). Event Type Predicate
Diagnosis diagnose
3.4 Domain-Specific Entity and Event
Prescription prescribe, treat (with)
Identification Treatment admit, attend, complete, discharge,
The Knowledge Discovery phase of the analytics involves enroll, enter, hospitalize, meet,
processing the output from the Natural Language Processing participate, place, receive, see, seek,
systems to perform several steps in knowledge discovery in speak, treat, undergo
natural language text: Usage abuse, addict, consume, drink,
experiment, ingest, inhale, relapse,
smoke, snort, take, try, use
Automated Narrative Extraction for Administrative Records AIAS’19, June, 2019, Montreal, Quebec Canada
Table 1: Verbs used to identify events related to mental and of the source of information. The top verbs identified as
emotional health and substance use Communication events are listed in Table 2.
Once the predicates are identified, the semantic roles
Event Type Predicate
associated with each occurrence of the predicate are
Communication state, indicate, note, explain,
automatically extracted to enable the identification of the
report, say, acknowledge, discuss,
predicate’s agent, affected entity, and whether the predicate
identify, confirm, deny, address,
was negated. The sentence in which the predicate appeared
agree, communicate, question,
was also examined to identify medications, drugs, mental
suggest, tell, describe, claim,
conditions, medical procedures, and treatments associated
mention, inform, disclose
with that event.
Other formulation according to
Table 2: Terms used to identify the source of information.
To detect the source of the information, all sentences with
Communication events identified by the FONS software
This linguistically rich event-based narrative analysis
package were analyzed and the subject of the verbs extracted.
methodology allows the Language Analytics component to
For example, in “Dr. Gray stated that the defendant has never
extract information of interest including the people involved in
been hospitalized for emotional disorders of any kind”, the
the event, the time it occurred, and the places mentioned. A
communication verb stated is detected and its subject, Dr. Gray
sample analyzed sentence is shown in the following example:
(a medical professional), is identified as the source of the
information. Similarly, in the example “the defendant’s mother The defendant reported
also reported he was diagnosed with Bi-Polar Disorder several she was
years ago”, the source of information is identified as the diagnosed at the age of 14