=Paper= {{Paper |id=Vol-2471/paper7 |storemode=property |title=Automated Narrative Extraction from Administrative Records |pdfUrl=https://ceur-ws.org/Vol-2471/paper7.pdf |volume=Vol-2471 |authors=Karine Megerdoomian,Karl Branting,Charles Horowitz,Amy Marsh,Stacy Petersen,Eric Scott |dblpUrl=https://dblp.org/rec/conf/icail/MegerdoomianBHM19 }} ==Automated Narrative Extraction from Administrative Records== https://ceur-ws.org/Vol-2471/paper7.pdf
      Automated Narrative Extraction from Administrative Records*,**


     Karine Megerdoomian                                                 Karl Branting                              Charles E. Horowitz
       The MITRE Corporation                                        The MITRE Corporation                           The MITRE Corporation
          McLean, VA, USA                                              McLean, VA, USA                                 McLean, VA, USA
         karine@mitre.org                                            lbranting@mitre.org                             chorowitz@mitre.org

            Amy B. Marsh                                                   Nick Modly                                  Stacy J. Petersen
       The MITRE Corporation                                        The MITRE Corporation                           The MITRE Corporation
          McLean, VA, USA                                              McLean, VA, USA                                 McLean, VA, USA
         amarsh@mitre.org                                             nmodly@mitre.org                               spetersen@mitre.org

              Eric O. Scott                                            Sujit B. Wariyar
       The MITRE Corporation                                        The MITRE Corporation
          McLean, VA, USA                                              McLean, VA, USA
         escott@mitre.org                                            swariyar@mitre.org




ABSTRACT                                                                            history have allowed the probation office to have a better
                                                                                    understanding of their client population and to perform
The U.S. Probation and Pretrial Services Office staff produce                       analyses that were previously unavailable to the organization.
billions of pages of information on defendants’ and offenders’                      This technical approach can be applied across organizations,
profile and conduct. While it is critical for probation officers                    legal institutions, clinical administrations, and government
and district chiefs to have up-to-date knowledge on their                           agencies that maintain large amounts of information in the
clients to better assist and reduce risk of recidivism, the data                    form of free text narratives.
are often stored in narrative texts in multiple large documents.
As a result, these records remain mostly out of reach without
the use of painstaking manual review. This paper describes an                       1 Introduction
analytic prototype developed to automatically acquire
                                                                                    The U.S. Probation and Pretrial Services Office (PPSO) staff
structured information from natural language text in probation
                                                                                    supervise more than 300,000 people a year and collect and
office documents through the application of PDF content
                                                                                    produce billions of pages of information on defendants’ and
extraction, text mining, and language analytics. Since serious
                                                                                    offenders’ profile and conduct, as well as on the strategies and
mental illness is very prevalent in the U.S. corrections system,
                                                                                    actions of officers and their outcomes. While it is critical for
the first phase of the project focused on extracting information
                                                                                    probation officers to have up-to-date knowledge on their
and constructing timelines from narrative text regarding the
                                                                                    clients to reduce the risk of recidivism, the data are often stored
defendants’ mental health conditions, substance use and
                                                                                    in narrative texts in multiple large documents, making it very
treatment history.
                                                                                    challenging and time-consuming to collect all relevant case
                                                                                    information manually. This renders 70 terabytes of mostly
Automated narrative extraction and the construction of an
                                                                                    unstructured data on more than a million defendants, and
event timeline for defendants’ mental and emotional health
                                                                                    strategies used by thousands of officers over decades, mostly
                                                                                    unusable by PPSO [1]. As a result, policy makers, program
*In Proceedings of the Workshop on Artificial Intelligence and the Administrative
State (AIAS 2019), June 17, 2019, Montreal, QC, Canada.
                                                                                    evaluators, and probation and pretrial services staff have been
Copyright © 2019 for this paper by The MITRE Corporation. Use permitted under       denied valuable data with which to do their jobs.
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Published at http://ceur-ws.org
** Approved for Public Release; Distribution Unlimited 19-1482. Throughout this     A significant number of offenders supervised by the U.S.
document, all names of people, places, facilities and dates are replaced with       probation services have a current mental health condition,
fictitious ones to anonymize the information.
                                                                                    most of them with co-occurring substance use disorders.
 AIAS’19, June, 2019, Montreal, Quebec Canada                                                                  K. Megerdoomian et al.

Defendants who suffer from mental disorders often require            2 Background
more intensive monitoring and specialized treatment [2]. We
                                                                     Past clinical information extraction systems have tended to rely
therefore focus on addressing important PPSO business
                                                                     on shallow NLP techniques (pattern-matching, simple parses,
questions to better understand the nature of the mental
                                                                     linear pattern interpretation rules). More recently, however,
conditions in the officers’ caseload and gain knowledge of the
                                                                     several projects have adopted knowledge-based approaches
defendants’ diagnosis and treatment history. The information
                                                                     adapted for the clinical domain.
was automatically obtained from the free text sections of
Presentence Investigation Reports (PSIR), which represent
                                                                     While the advantages of machine learning methods for
investigations into the history of the person convicted of a
                                                                     information extraction cannot be denied, they also present a
crime before sentencing to determine if there are extenuating
                                                                     number of limitations in applications for narrative extraction
circumstances. To automatically extract and analyze the free
                                                                     from clinical data. To begin with, machine learning algorithms
text information in the PSIRs, we applied language analytics
                                                                     require large amounts of training data which are pre-tagged for
technology to detect the events of interest (substance use,
                                                                     the relevant features and parameters. Preparing the pre-
diagnosis, treatment sessions, prescriptions) in the defendant’s
                                                                     annotated data sets can be time-consuming and expensive. In
life and visualized them as a timeline of activities that could be
                                                                     addition, such probabilistic approaches might miss rare
reviewed by the probation and parole officers.
                                                                     phenomena that need to be identified since they do not occur
                                                                     often enough in the training data to be picked up by the
The system leverages Apache cTAKES (clinical Text Analysis
                                                                     learning algorithms. Another challenge for using machine
and Knowledge Extraction System), an open-source Natural
                                                                     learning methods in the clinical domain is that users often
Language Processing (NLP) system developed specifically to
                                                                     expect high level of consistency in the results and precise
extract and analyze clinical information from unstructured text
                                                                     information on how the computational decisions were made. In
[3]. cTAKES identifies clinical terms such as drugs, diseases and
                                                                     such instances, a rule-based approach might be more
disorders, symptoms, and medical and treatment procedures.
                                                                     transparent and easier to understand and modify.
It also performs deep textual analysis and can identify, for
instance, if a sentence is negated or not, or if the person being
                                                                     The approach described in this paper leverages in-depth
discussed is the patient or a family member. The prototype
                                                                     linguistic and semantic analysis to detect the domain
system combines the results of cTAKES with rich linguistic
                                                                     information in narrative text, more in line with recent
analysis from other open source systems such as concept
                                                                     knowledge-based approaches [5] [6]. Machine learning
ontologies and the Stanford CoreNLP parser and entity
                                                                     approaches often require a large amount of pre-annotated data
recognizer [4]. These syntactic and semantic analyses are then
                                                                     on which to train the algorithms. Since the PSIR data had not
enhanced to adapt to the use case, by identifying significant
                                                                     previously been tagged for the events of interest and mental
terms for the events of interest for the mental health domain,
                                                                     conditions, a purely machine learning approach was not readily
applying linguistic analysis to improve argument and negation
                                                                     available. Hence, the prototype applies a hybrid method. It
detection, and implementing recent advances in NLP to
                                                                     leverages rich linguistic and semantic information through the
improve precision (e.g., vector space semantics, algorithms for
                                                                     application of open-source Natural Language Processing
building a narrative timeline).
                                                                     systems, adapted for the existing use case by applying a
                                                                     combination of rule-based linguistic analysis, vector space
All extracted information on a defendant’s narrative is stored
                                                                     semantics, and machine learning techniques to enhance the
in a graph database and displayed on a dynamic map, allowing
                                                                     results. These were used to improve negation detection and
filtering of results based on judicial district, defendants’
                                                                     argument identification (i.e., entities the events refer to), and to
demographic information (age, education, citizenship),
                                                                     develop temporal reasoning algorithms. Ontologies (lexicons)
criminal category, mental conditions or medications
                                                                     of mental health and medication terms, vetted by a subject
prescribed.
                                                                     matter expert, were used for concept identification. The rest of
                                                                     this section provides a detailed description of the technical
As large amounts of information in business, government and
                                                                     steps in building the analytic prototype.
administration are maintained in the form of narratives
(clinical records, legal and financial summaries, progress
reports, human resources assessments, etc.), the approach            3 Technical Approach
described in this paper for acquiring structured information
                                                                     The technical approach is a hybrid one, leveraging open source
from narrative text can be reapplied across organizations and
                                                                     NLP applications often developed by training machine learning
government agencies.
                                                                     algorithms, and refining the syntactic and semantic analyses
                                                                     with a combination of knowledge-based and probabilistic
                                                                     approaches.
 Automated Narrative Extraction for Administrative Records                            AIAS’19, June, 2019, Montreal, Quebec Canada

3.1 Analytic Pipeline                                                 5.   User Interface (UI): This component interacts with the
                                                                           Neo4j database and displays results on a Google Earth
The presentence reports undergo several steps in order to
                                                                           map. The UI allows the user to run queries, to review the
extract the defendant’s mental health and substance use
                                                                           details on particular defendants, and to see aggregate
narratives. These are shown in Error! Reference source not
                                                                           results on the data set.
found. and are described in detail in the rest of this section. The
specific steps involved are:
                                                                      3.2 Content Extraction
1.   Content Extraction: parsing the different sections of the        The Content Extraction component parses the PDF presentence
     PDF documents and extracting the structured profile and          reports, identifies all subsections and extracts the textual
     criminal information as well as all free text content. This      content. To analyze the mental health and substance use
     component also “cleans” the data by normalizing the              information of defendants, the text content of the Mental and
     textual content to maximize processing.                          Emotional Health (MEH) and Substance Abuse (SA) sections in
2.   Language Analytics: The extracted text for each PSIR is          presentence reports are automatically extracted. In addition,
     run through the Natural Language Processing                      this step identifies and extracts all federal charges from the
     components, providing a full linguistic parse, a list of         cover sheet of the PSIR, criminal history information from the
     entities and events of interest, and semantic relationships.     Juvenile Adjudications and Adult Criminal Convictions sections
3.   Knowledge Discovery: This step is the heart of the               of the report, Arrest Dates and associated charges from the
     textual analytics where the system identifies all concepts,      Criminal History information, and Criminal History Score and
     events, and their relationships for the domain of interest.      Category from the Criminal History Computation section.
     •    Identifies the events of interest associated with the
          defendant      (arrests,     diagnoses,    treatments,      The prototype’s Content Extraction component successfully
          prescriptions, drug use, suffering from a mental            extracted information from 92% of the original PDF
          condition);                                                 documents, providing us with a data set of 11,243 extracted
     •    Determines whether the information is obtained from         narrative text documents to analyze. Given that some
          medical records or if it is reported by the defendant,      defendants have more than one presentence report associated
          by a medical professional, or by a third party;             with them, the successfully extracted content corresponds to
     •    Provides full event description including date,             10,973 defendants. The free text content extracted from the
          location, persons involved, treatment provider,             MEH and SA sections amount to 22,486 text items. These can
          nature of treatment and medication prescribed;              range from a few sentences to several paragraphs depending
                                                                      on the report.
     •    Computes the temporal relationships between the
          various events to build a narrative timeline for a
          defendant.
                                                                      3.3 Language Analytics
                                                                      The Language Analytics component leverages existing Natural
                                                                      Language Processing software to perform various linguistic
                                                                      analyses on a piece of text. NLP is a subset of Artificial
                                                                      Intelligence (AI) and is fast becoming an essential technology
                                                                      in modern-day organizations to gain significant insights from
                                                                      unstructured content, such as email communications, social
                                                                      media, videos, customer reviews, customer support request,
                                                                      and administrative records in business and government.
                                                                      Natural Language Processing tools and techniques help to
                                                                      automatically process, analyze, and understand large amounts
                                                                      of data, providing structure and meaning to information that
Figure 1: Analytic pipeline for narrative extraction and timeline
                                                                      originally was in unstructured form.
                          development
                                                                      In this step of the analysis, the texts extracted from the Mental
4.   Neo4j Database: Neo4j is a graph database management             and Emotional Health and Substance Abuse sections of the
     system and is available as open source software. All             PSIRs are run through several NLP software tools. The software
     extracted information from the Knowledge Discovery               packages currently in use are Apache cTAKES (clinical Text
     component, as well as the client demographic metadata,           Analysis and Knowledge Extraction System), Stanford Named
     and structured information on arrest history and federal         Entity Recognizer, and FONS (Framework for Operation NLP
     offenses extracted from the presentence reports are              Services) – a software package pipeline leveraging open source
     loaded into the database.                                        tools and was built by a research team at MITRE to detect
                                                                      events of interest to national security.
 AIAS’19, June, 2019, Montreal, Quebec Canada                                                                 K. Megerdoomian et al.

                                                                      1.   Identify concepts (entities and events) of interest
cTAKES output forms the primary basis for further analytics. It            associated with the client, including mentions of a client
was chosen primarily because of its entity recognition                     suffering from a mental condition, diagnoses, treatments,
capabilities in the clinical domain, which aligned with the                prescriptions and drug use.
desire to obtain data about PPSO clients’ mental and emotional        2.   Detect the event description such as the date and location
health and substance use. Entities identified by cTAKES include            when it occurred, the persons involved, the treatment
medical conditions, drugs/medications, medical procedures,                 provider, the nature of treatment (e.g., inpatient or
and medical symptoms. The entities identified by cTAKES out-               outpatient, anger management, drug rehabilitation) and
of-the-box were supplemented with additional entities                      the medication prescribed.
frequently encountered by analysts in PSIRs. We worked                3.   Detect the source of the information – was the information
closely with a PPSO subject matter expert to review the list of            reported by the client, was it obtained from medical
conditions and medications that cTAKES recognized, and                     records or a medical professional, or reported by a third
identify the ones that were of interest in the mental and                  party?
emotional health and substance use domain. The subject
matter expert also identified a more general superclass for each      As described, cTAKES detects these entities of interest in the
of these specific mental and emotional conditions so that             mental and emotional health domain. However, to identify
further analysis could be conducted at the appropriate level of       whether a client is suffering from a mental condition, it does
granularity. For example, conditions such as depression, chronic      not suffice to simply retrieve sentences with a mental condition
depression, and major depressive disorder were all mapped to          mention. It is also important to detect the subject of the
the more general term depressive disorder.                            sentence to distinguish cases where a family member is
                                                                      mentioned to suffer from a mental condition (e.g., “the
cTAKES also provides domain-independent NLP capabilities of           defendant’s mother suffered from Schizophrenia”), and to
syntactic parsing, dependency parsing, and semantic role              exclude any negated events (e.g., “the defendant does not suffer
labelling – it can give the base forms of words, their parts of       from a severe mental disease or defect”). Fortunately, when
speech, mark up the structure of sentences in terms of phrases        cTAKES identifies a concept, it also identifies that sentence’s
and syntactic relations, detect negation in the sentence and          polarity (whether the entity appears in a negated context or
identify the role of the entities in a sentence (e.g., agent of       not), and the event’s subject (whether that event or concept
event). The results of all these capabilities were used to identify   should be ascribed to the client described in the text, a family
events of interest in a client’s mental and emotional health and      member of the client, or someone else). Some modifications to
substance use history. However, we found it useful to                 the cTAKES source code were made to improve the accuracy of
supplement the cTAKES output with other natural language              these attribute identifications.
processing systems to achieve the most accurate analysis. The
Stanford Named Entity Recognizer was applied to identify              While the cTAKES entities can be counted to obtain statistics on
people, places, organizations, dates, times, and locations, none      the prevalence of various mental conditions among the
of which are identified by cTAKES. Additionally, the FONS             defendant population, further processing is necessary to
system, which also generates entities, syntactic parsing and          identify more complicated events, such as receiving a
dependency parsing output, was used to supplement cTAKES’             diagnosis, attending treatment, being prescribed medication,
output to obtain a higher level of accuracy. In particular, FONS      or using drugs. To identify the events of interest, a small sample
was applied to the PSIR text data to tag entities (people,            of PSIRs was reviewed to identify the verbs commonly
facilities, locations, dates and times), and to categorize all        associated with these events. An iterative process was used in
events into conceptual classes by detecting event types (e.g.,        reviewing the event detection results and updating the
state, transfer, communication) and different verb meanings           predicates for the domain. The verbal predicates associated
(e.g., prescribe can either be the verb denoting the prescription     with each type of event are listed in Table 1.
of medication by a medical professional or a communication
event meaning ‘to advise’, ‘to recommend’).                            Event Type            Predicate
                                                                       Diagnosis             diagnose
3.4 Domain-Specific Entity and Event
                                                                       Prescription          prescribe, treat (with)
   Identification                                                      Treatment             admit, attend, complete, discharge,
The Knowledge Discovery phase of the analytics involves                                      enroll, enter, hospitalize, meet,
processing the output from the Natural Language Processing                                   participate, place, receive, see, seek,
systems to perform several steps in knowledge discovery in                                   speak, treat, undergo
natural language text:                                                 Usage                 abuse,    addict,    consume,    drink,
                                                                                             experiment, ingest, inhale, relapse,
                                                                                             smoke, snort, take, try, use
 Automated Narrative Extraction for Administrative Records                            AIAS’19, June, 2019, Montreal, Quebec Canada

  Table 1: Verbs used to identify events related to mental and       of the source of information. The top verbs identified as
              emotional health and substance use                     Communication events are listed in Table 2.

Once the predicates are identified, the semantic roles
                                                                      Event Type                 Predicate
associated with each occurrence of the predicate are
                                                                      Communication              state, indicate, note, explain,
automatically extracted to enable the identification of the
                                                                                                 report, say, acknowledge, discuss,
predicate’s agent, affected entity, and whether the predicate
                                                                                                 identify, confirm, deny, address,
was negated. The sentence in which the predicate appeared
                                                                                                 agree, communicate, question,
was also examined to identify medications, drugs, mental
                                                                                                 suggest, tell, describe, claim,
conditions, medical procedures, and treatments associated
                                                                                                 mention, inform, disclose
with that event.
                                                                      Other formulation          according to
                                                                        Table 2: Terms used to identify the source of information.
To detect the source of the information, all sentences with
Communication events identified by the FONS software
                                                                     This linguistically rich event-based narrative analysis
package were analyzed and the subject of the verbs extracted.
                                                                     methodology allows the Language Analytics component to
For example, in “Dr. Gray stated that the defendant has never
                                                                     extract information of interest including the people involved in
been hospitalized for emotional disorders of any kind”, the
                                                                     the event, the time it occurred, and the places mentioned. A
communication verb stated is detected and its subject, Dr. Gray
                                                                     sample analyzed sentence is shown in the following example:
(a medical professional), is identified as the source of the
information. Similarly, in the example “the defendant’s mother           The       defendant      reported
also reported he was diagnosed with Bi-Polar Disorder several            she           was
years ago”, the source of information is identified as the               diagnosed at the age of 14