<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extracting Body Function from Clinical Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guy Divita</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jessica Lo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chunxiao Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kathleen Coale</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elizabeth Rasch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rehabilitation Medicine Department, National Institutes of Health Clinical Center</institution>
          ,
          <addr-line>Bethesda, Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>22</fpage>
      <lpage>35</lpage>
      <abstract>
        <p>This paper describes finding Body Function (BF) mentions within clinical text. Body Function is noted in clinical documents to provide information on potential pathologies within underlying body systems or structures. BF mentions are embedded in highly formatted structures where the formats include implied scoping boundaries that confound existing NLP segmentation and document decomposition techniques. We have created two extraction systems: a dictionary lookup rule-based version, and a conditional random field (CRF) approach based on training from manual annotations. Training and test data utilized the NIH Clinical Center Rehabilitation Medicine Department records. Results of these systems provide a baseline for future work to improve document decomposition techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Natural Language Processing</kwd>
        <kwd>Body Function</kwd>
        <kwd>ICF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Body functions are the physiological or psychological functions of body systems[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Body functions
are mentioned in clinical text when there is concern for or documentation of pathologies around body
function or body function assessment. Body Function information is commonly collected during
physical exams to provide information on potential pathologies within underlying body systems or
structures.
      </p>
      <p>Our motivation came from a request from the Social Security Administration to retrieve BF mentions
within their documents to support existing efforts to enhance their disability claims adjudication
process. While there is a question around the utility of body function information as it relates to
disability adjudications, we are motivated to work on this task as a mechanism to improve the
algorithms that support BF extraction, namely sectionizing, sentence chunking, and context scoping
annotators using BF mentions as the use case. BF mentions are often embedded in complex formatted
text in the form of lists, slot-values, and oddly punctuated sentences in clinical notes. This paper reports
on the systems developed to capture this information before making improvements to the document
decomposition tasks.</p>
      <p>
        Our conceptual framework for BF comes from the International Classification of Functioning,
Disability and Health(ICF) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. While there are many specific kinds of body function, we set out to find
mentions of strength, range of motion (ROM), and reflexes because of their relevance to the current
disability adjudication business process. Within these mentions, we label the body function type
(strength, range of motion, reflex), the body location, and any associated qualifiers.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Prior Work</title>
      <p>
        There is little prior work specifically extracting body function from clinical notes. Some work has been
done extracting other ICF defined areas using traditional rule-based techniques as well as deep learning
methods. Kukafka, Bales, Burkhardt and Friedman report on modifying MedLEE to automatically
identify five ICF codes from Rehab Discharge summaries[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Newman-Griffis and Fosler-Lussier
describe linking physical activity reports to ICF codes using more recent language models and
embeddings[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The NLP platform employed for this work was adapted from the V3NLP Framework[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
Sophia[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which were used for symptom extraction and finding mentions of sexual trauma in veteran
clinical notes. The framework employed is built upon UIMA[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], so resembles the cTAKES[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] system
closely, but has a pedigree from UMLS concept extraction in biomedical literature (MetaMap)[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Corpus and Manual Annotations</title>
      <p>The NIH Clinical Center Rehabilitation Medicine notes, from which our corpus was drawn, are
indicative of any hospital’s rehab notes, in that, the services provided are in support of patients in need
of rehab. While the document formatting is idiosyncratic, the terminology is in line with what is being
seen in SSA claimant data, which is composed from a national pool of clinical records from an
extremely heterogeneous set of providers.
3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Manual Annotations</title>
    </sec>
    <sec id="sec-5">
      <title>3.2. Annotation Guidelines</title>
      <p>A body function mention is identified when there is mention of body function type, and a qualifier, and
optionally, one or more body locations within the scope of a phrase or sentence. These mentions are
only annotated from objective (clinician-observed) information.</p>
      <p>Laterality and similar modifiers are to be lumped with body location as body function locations are
typically modified with descriptors such as left, right, both, proximal, and distal. There were exceptions,
where a mention is made without one of the necessary components or where those components are
inferred, when it is thought that some body function type information is indicated. For example, the
mention Neurological: Negative is marked. Neurological here infers a location and a body function
type.</p>
    </sec>
    <sec id="sec-6">
      <title>3.2.1. Qualifiers and Polarity</title>
      <p>Body Function qualifiers include terms like positive, improved, as well as values from test scores such
as degrees of range or values from test scores such as the (clinically applied) manual muscle tests which
look like 5/5 or 9/10. Each qualification is classified into below-level function (-1), ambiguous (0), or
at- or above- level function (+1). As a side note, strength values between 0/5 and 4+/5 are given a -1
qualifier as they are less than normal as quantified on the manual muscle test scale. Only a 5/5 would
be marked as +1.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2.2. Underspecified and Ambiguous Mentions</title>
      <p>There are body function mentions found in clinical text that say something about body function
incompletely. This comes in at least three varieties. A mention like raise arm overhead is either
strength or range of motion but is not more specific. A mention that is qualified as assessed is not
sufficient to assign a polarity of 1 or -1.</p>
    </sec>
    <sec id="sec-8">
      <title>3.2.3. Implicit Semantics</title>
    </sec>
    <sec id="sec-9">
      <title>3.2.4. Interrater Reliability</title>
      <p>There are mentions that contain implicit body function type and implicit body location. For example,
grip strength implicitly indicates the hand as a body location. A mention like a fist can be made with
both hands implicitly indicates strength and/or range of motion.</p>
      <p>This initial set of records was annotated by a fellow with some medical training guided by domain
experts (ER, KC). While the body function annotation task included two annotators, only one annotator
worked on this NIH corpus. Late in process, a second annotator was trained and interrater reliability
endeavors were done using NIH data. The interrater reliability (F1) scores across all the labels between
the two annotators macro and micro scores on the NIH corpus were .70. For this work, the only useful
take-away is that two annotators can comparably continue to annotate NIH data, and that the task is not
too complex.</p>
    </sec>
    <sec id="sec-10">
      <title>4. Methods</title>
      <p>We have created two systems, a dictionary lookup rule-based version and a conditional random field
(CRF) approach based on training from manual annotations. Neither of these approaches use an
underlying, pre-built language model. The assumption is that the verbiage and context around body
function mentions are in very different contexts than how they would appear in any of the pre-built
language models available to us.</p>
    </sec>
    <sec id="sec-11">
      <title>4.1. Rule Based, Dictionary Lookup System</title>
      <p>This system is constructed from a V3NLP Framework, a UIMA NLP suite of annotators, pipeline
connection utilities and readers and writers. The evolution of V3NLP Framework is now branded
Framework Legacy. The pipeline created stitched together (mechanical annotators) to decompose the
clinical text into its constituent parts, including sections, sentences, phrases, tokens and dictionary
looked up terms. The intelligence of the system uses a dictionary lookup annotator which relies upon
lexicons. It is worth sharing the pedigree of each of those lexica.
4.1.1. Lexica
A separate lexicon (one for each) was created for Strength, ROM, reflex, body location, qualifier terms,
as well as pain, balance, and coordination.</p>
      <p>
        For most of these, a top level (seed) term was identified in the UMLS. All semantically decedent
UMLS terms were extracted from the UMLS and added to each respective lexicon. The resulting terms
were fed through a lexical variant generation tool (LVG)[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to create fruitful variants[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Each of
these extracted and generated terms were labeled with Strength, ROM, Body Location and such.
Metadata including the UMLS identifiers and UMLS semantic types were retained for pedigree sake.
When a term is found in the text, a lexicalElement annotation is created and tagged with one or more
categories (Strength, ROM, Body Location, etc.) and the metadata garnered from the UMLS.
4.1.1.1.
      </p>
    </sec>
    <sec id="sec-12">
      <title>Body Location Lexicon</title>
      <p>The SNOMED terminology has a modifier (Body Structure) attached to a number of their terms. All
of these body structure tagged terms were extracted, then manually culled to remove those terms which
would not be relevant. These culled terms involved terms with cell and cell structure, cardiac, vein.
This yielded 53,641 terms with UMLS concept identifiers. Thirty-six body laterality terms including
left-sided, proximal and distal were manually added to cover parts of body location expressions in the
text. An additional 24 terms were added to cover body location expressions found in the training set.
These were mostly abbreviations like r le and a few more colloquial terms like core and quad. There
are 53,704 terms total in the body location lexicon.
4.1.1.2.</p>
    </sec>
    <sec id="sec-13">
      <title>Body Strength Lexicon</title>
      <p>The bulk of Body Strength terms gathered from the UMLS came from terms with the token strength in
them. There are 4739 such terms. Some of these, admittedly are overly broad, for example having to
do with the strength of contractions, and the strength of medication. A number of these were manually
filtered out. Sixty-two terms were added from expressions seen in the text, not otherwise found. These
were mostly in the form [body location] [extensor|extensors|extension|extensions|ext]. Note that 34
terms having to do with muscle weakness were included as part of the body strength lexicon. There
were a total of 4802 body strength terms.
4.1.1.3.</p>
    </sec>
    <sec id="sec-14">
      <title>Range of Motion Lexicon</title>
      <p>
        Descendent terms of Range of Motion were gathered from SNOMED-CT[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] (screen scraping from
the SNOMED CT Browser[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ][
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]). These were augmented from terms in the UMLS with range of
motion, extension, flexion as part of the term. It should be noted that a number of these terms came
from MEDCIN[15] in particular. While most of the terms came from SNOMED, LOINC[
        <xref ref-type="bibr" rid="ref15">16</xref>
        ], the
National Cancer Institute Thesaurus[
        <xref ref-type="bibr" rid="ref16">17</xref>
        ], Ontology of Consumer Health Vocabulary (OCHY)[
        <xref ref-type="bibr" rid="ref17">18</xref>
        ],
along with MeSH[
        <xref ref-type="bibr" rid="ref18">19</xref>
        ], and ICD-10CM[
        <xref ref-type="bibr" rid="ref19">20</xref>
        ] had some coverage. Sixteen additional terms were added
to cover range of motion expressions found in the training text. There are 793 range of motion terms in
the range of motion lexicon.
4.1.1.4.
      </p>
    </sec>
    <sec id="sec-15">
      <title>Body Function Qualifiers Lexicon</title>
      <p>Many body function qualifiers are numeric and are covered by regular expression mechanisms to
identify units of measure. To this end, a lexicon of units of measure is being used to identify the units.
That lexicon is derived from the Unified Code for Units of Measure (UCUM) provided by the National
Library of Medicine. This resource was altered for the body function task. All single letter units were
commented out, because they were causing too many false positives. In addition, the terms feet and foot
and field were likewise commented out. The UCUM lexicon includes 946 entries.</p>
      <p>A lexicon needed to be gathered to cover the non-numeric qualifiers. 9755 terms which had a
semantic type of Qualifier were taken from the UMLS. This was augmented by terms that descended
from concepts weakness, observation of reflex, and hyperflexia. In all, 2807 concepts from the UMLS
were gathered. 104 terms were added to this resource to cover terms in the training text that were not
already known as a body qualifier. The body qualifier lexicon had 2910 terms.
4.1.1.5.</p>
    </sec>
    <sec id="sec-16">
      <title>Non Body Function Lexicon</title>
      <p>It was useful to gather terms that when identified, would rule out a body function expression. These
were labeled as Confounding Terms. Top among these terms was NIHFA score and MRI. There were
17 such terms added. An additional 34 terms were added, without any semantic label, to chunk up the
text to prevent fallacious qualifiers, mostly around temporal entities. Among these were terms like alert
and oriented x 3 and age-matched norms. There were 32 such terms added. A small lexicon of 62 pain
terms that indicate pain of some sort was gathered. One term in this lexicon turns out to be ambiguous
with a body strength term pinching.</p>
      <p>A number of qualifiers were erroneously identified because they were within expressions that also
included body location and sometimes strength but were referring to balance and coordination. A small
lexicon of 13 balance and coordination terms were gathered to identify balance and coordination terms
rather than strength or range of motion terms to combat these errant qualifiers to block them from
becoming part of a strength, rom, or reflex mention.</p>
    </sec>
    <sec id="sec-17">
      <title>4.1.2. Preprocessing Annotators</title>
      <p>The redaction done on this data was overly aggressive. Templated redacted forms were found within
section names and body function mentions as well as other locations. An annotator was created to label
redacted forms, to allow those spans of text to become invisible and are temporarily removed for
downstream annotators.</p>
      <p>A particularity of this data set is that redactions are in Section Names such as History of [First Name
id = XXXXX ] Illness: . Taking out the redaction enabled many sections to be correctly identified.</p>
      <p>The dataset we work with has a particular idiosyncrasy: it contains no newlines. While not usually
worth noting this level of data quality control and normalization, it is noted here because our other
datasets that come from a variety of providers from around the world also occasionally include
documents that have no newlines.</p>
      <p>An annotator was created to infer when there were no newlines in a document and inject newlines
around section names from a rough lookup of section names and simplistic regular expressions. This
aided in subsequent section boundary and section name identification because the existing sectionizing
mechanism requires newlines to be present.</p>
    </sec>
    <sec id="sec-18">
      <title>4.1.3. Body Function Pipeline</title>
      <p>The body function pipeline’s purpose is to identify BF mentions. That is, an utterance that includes a
body location, a body function type (such as strength, range of motion or reflex) along with some kind
of qualifier related to the body function type.</p>
      <p>The body function pipeline has been appended to the pipeline that decomposes the text into sections,
sentences, slot:values, lists, phrases, terms, and tokens (see appendix for details). The body function
pipeline relies upon having terms in the document already looked up and classified prior to the next set
of annotators and knowing what sections those terms occurred in.</p>
      <p>The guidelines and subsequent manual annotations created a Body Function Type label, with a type
attribute which has an enumerated value as one of Body Strength, Range of Motion or Reflex. For the
convenience of building the tool from existing components, those attributes were turned into labels.</p>
      <p>The guidelines indicate some sections to ignore. These include Goals, Plan, Education, Family
History, Medications, Referrals, Interventions, Gait, Balance, Coordination, Mobility, Motor learning,
Motor Function, Follow-up and Recommendation sections. While Balance Coordination are body
function, they are not included as the initial ones (strength, rom, reflex) we are addressing. One oddity:
there were several mentions in the training set that came from a common section labeled Impressions
and Plan. Impressions and plan sections were not filtered out.</p>
      <p>The Body Function Location, Strength, Range of Motion and Reflex Annotators each create their
respective annotations from terms noted to have those categories as attributes from the upstream term
lookup step. Annotations were not made from sections that were specifically noted to be ignored and
annotations were not made from mentions that were not about the patient. As noted above, all terms
have as an attribute the section they are within and if the term refers to the patient or not.</p>
    </sec>
    <sec id="sec-19">
      <title>4.1.4. Body Function Qualifiers</title>
      <p>Finding BF qualifiers is more complicated. Sometimes there is both a strength and pain mention in the
same sentence where the qualifier is really about the pain, not the strength. Although less frequent,
mentions about coordination and balance were found with strength and range of motion mentions in the
same statement and the qualifiers were not about the body function type we were looking for.</p>
      <p>To thwart these confusions, mentions that were categorized as pain, coordination, or sensation when
found within a window of six tokens of the other body function type kind of mentions would inhibit the
creation of a qualifier. To this end, a small lexicon of pain and coordination terms was created to
support this. While this works well, it is noted that some terms such as pinch were found to be about
pain or strength depending upon context beyond the scope of this task.</p>
      <p>There were a number of qualifier candidates that occurred in statements that had mentions of a body
function type and a body location, but the qualifiers were not about the body function type. Common
among these were mentions of patient ages and dates. There were a number of confounding terms also
found to be in the vicinity of BF type mentions that when seen, would indicate the qualifier would not
be attributed to the BF type. A lexicon of such confounding terms was created and used. Such terms
include fine motor activities and MRI.</p>
      <p>Scoping rules are a common theme to NLP, making it important to accurately attribute the scope of
the section where a mention is found as well as the associated sentence or slot:value. There are a
number of cases where the text is not pristine sentences, lists, or slot:values. Occasionally there were
texts where there were no sentence breaks or multiple colons causing the sentence scoping to go awry.
There were a number of these cases where no qualifier was found for a body function type within scope.
In these cases, a modification was added to the scope of where to look for a qualifier candidate when
no qualifier could be found within a sentence. Looking to the right by 266 characters (empirically set)
to find a qualifier for a body function type improved performance.</p>
    </sec>
    <sec id="sec-20">
      <title>4.2. CRF Model</title>
      <p>Stanford’s Named Entity Recognizer[21], which relies upon an underlying Conditional Random Field
(CRF)[22] statistical machine learning modeling algorithm has been chosen as the machine learning
approach for us to start with.</p>
      <p>A UIMA based NLP pipeline was created to chunk the text into tokens. Those tokens that came from
manually annotated body function mentions were marked. All tokens were used as the fodder for the
CRF model. The tokens were classified in the BIO fashion. Those tokens that were part of potential
body function mentions were marked where those tokens that began a mention were marked with Begin,
the middle tokens were marked with Inside, and all the rest of the tokens were marked with an Outside
classification. In addition to the BIO, all permutations of the body function classes and BIO were used.
For example, Begin-Body-Function-Type-Strength, Inside-Body-Function-Type-Strength,
BeginBody-Function-Location, Inside-Body-Function-Location.</p>
    </sec>
    <sec id="sec-21">
      <title>5. Results</title>
    </sec>
    <sec id="sec-22">
      <title>5.1. Rule-based System: Token Based Matching Criteria</title>
    </sec>
    <sec id="sec-23">
      <title>CRF-based System: Token Based Matching Criteria</title>
    </sec>
    <sec id="sec-24">
      <title>6. Rule-based System: Failure Analysis</title>
      <p>Particular attention is being paid here to the qualifiers because it is the lynch pin to creating mentions
for the most part. The rest of this section has to do with failures with qualifiers.</p>
    </sec>
    <sec id="sec-25">
      <title>6.1.1. False Negatives</title>
      <p>The most prevalent term missed was weakness. There were 34 cases where weakness was correctly
identified, but 93 cases where identifying weakness was a false positive. In the cases where weakness
was missed, five of the seven cases also involved confounding mentions related to balance,
coordination, and pain. One was a scoping case where a list of test results followed no weakness
identified on R side of body. As it turns out no weakness is in the Body Qualifier lexicon as a qualifier,
but not also tagged as strength as it should have been. One case of weakness incorrectly attributed to a
patient mention was triggered by the word note in the sentence.</p>
    </sec>
    <sec id="sec-26">
      <title>6.1.2. False Positives</title>
      <p>As mentioned above, there were 93 cases involving weakness. The majority of these turned out to be
within statements the patient made, either within chief complaints, or patient reports weakness, or
within quoted expressions. While an annotator was created to assign attribution of who authored the
statement and mentions were marked, the rule to filter these patient attributed mentions was not
working. Scoping issues arose, particularly with scoping in or out section names. Thirty-two cases were
caused by scoping, where a series of values either delimited by colons, semi-colons or periods limited
the scope of what those numbers were referring to. For example, for … 3 trials : Right : 60 , 60 , 62
Left: 60, 43, 50 Gauge was measured in …. where it was missed that the scope of these were grip
strength measurements.</p>
    </sec>
    <sec id="sec-27">
      <title>7. Discussion</title>
      <p>It is noted that the most challenging part of this task is identifying the qualifiers. This is where most of
the formatting and therefore scoping challenges arose.</p>
      <p>This initial work has led to the guidelines being altered for body function type categorization going
forward based on discussions of how to handle ambiguous statements that, for example can refer to
both strength and range of motion.</p>
      <p>The initial guidelines included marking section names as part of a mention, as when the section
name was Strength:. After some discussion, we have come up with a new annotation type: Relevant
Context to cover elements like section names which set the semantic context of mentions to come, but,
themselves are not mentions.</p>
      <p>The timing of this task was such that this set of 500 was annotated by one annotator before a second
annotator came aboard. The inter-rater reliability was done to insure, going forward, that the annotators
are consistent.</p>
      <p>The CRF results are acknowledged to be better than the rule based results but provide little insight
into the task. The CRF modeling also has provided challenges where there are limitations to how many
labels can be modeled given our current computing resources. The rule-based version will continue to
provide a baseline to benchmark gains to the system due to alterations in the document decomposition
tasks for scoping contexts to sentences, slot:values, tables, and sections.</p>
    </sec>
    <sec id="sec-28">
      <title>8. Future Work</title>
    </sec>
    <sec id="sec-29">
      <title>9. Conclusions</title>
      <p>There are a number of next steps for this work. The first is to model the qualifier attributes (-1,0,1).
We will loop back and alter annotations within this set to accommodate changes to the guidelines and
re-run.</p>
      <p>Strength, ROM, and reflexes are not the only body function types that can be retrieved. This work
can be expanded to include balance and coordination information.</p>
      <p>We intend to release the body function software after work to improve document decomposition
techniques has improved the performance of this baseline.</p>
      <p>We describe in this paper a rule based extraction tool developed to find body function mentions that
include strength, range of motion, and reflex. We developed this work learning from NIH Clinical
Center Rehabilitation Medicine notes and are adapting it to find body function mentions in SSA
claimant records.
10. Acknowledgements</p>
      <p>We would like to thank Rafael Jimenez Silva for staging the annotation tasks, ensuring an equitable
distribution of records across training and testing, doing the IIR work and providing corpus and quality
assurance statistics.</p>
      <p>Supported by the Intramural Research Program of the National Institutes of Health and the US Social
Security Administration.</p>
    </sec>
    <sec id="sec-30">
      <title>Appendix: Syntatic Pipeline Defined</title>
      <p>This body of work relies on an NLP UIMA based pipeline that decomposes text into its constituent
parts. This pipeline is substantially similar to what was used in the Sophia pipeline, but is outlined here
because some of the components have been added.</p>
    </sec>
    <sec id="sec-31">
      <title>The Syntactic Pipeline</title>
      <p>For the most part, the annotators listed here do obvious tasks that need no further explanation. There
are exceptions and white lies of course, which will be noted for the seemingly mundane tasks for
Tokenization, Sentence Chunking, and Date and Time identification. As it turns out, within the richly
heterogenous data we are processing, those tasks are not as straightforward and error-free as is
ultimately needed.
a. Line Annotator with Blank Lines
This annotator creates annotations for each line in a document. It does not strip empty lines out. Having
line annotations enables an algorithm to walk through lines of text. Multiple blank lines indicate a topic
shift. Thus, one needs to keep track of those kinds of lines, rather than filter them out, when looking
for paragraph breaks. This annotator does not work well when there are no newlines in text, as is the
case for the BTRIS data we are using. Special ameliorations are needed for such data.
b. RegEx Shape Annotator
The regular expression shape annotator creates annotations for emails, phone, URLs, zip codes and
common redaction artifacts found in clinical text. Shapes are pseudo lexical entities, have meaning,
but would not normally be looked up in a dictionary, which would distinguish them from lexical
entities that come from a dictionary lookup. This annotator Identifies the easy things you don’t want
which makes the task of identifying things you do want easier. Identifying these entities makes sure
that downstream annotators do not erroneously pick up entities that are these.</p>
    </sec>
    <sec id="sec-32">
      <title>Date and Time Annotator</title>
      <sec id="sec-32-1">
        <title>This annotator identifies dates and times via regular expressions.</title>
      </sec>
    </sec>
    <sec id="sec-33">
      <title>Token Annotator</title>
      <p>This annotator chunks the text into space delimited units. It creates word tokens and white space
tokens. The tokenizer used here also creates attributes describing if the token has punctuation, is only
punctuation, has numbers, is only numbers, starts with upper case, is camel case, and ends with
sentence ending punctuation.</p>
      <p>A technical note: this tokenizer is the V3NLP Framework Tokenizer, a tokenizer tuned for clinical
text that has a legacy from MMTx and MetaMap.</p>
      <p>Tokenizers play an unspoken, but big role in errors downstream, and no tokenizer does a perfect
job with clinical text. This tokenizer informally compared to the python based scispacy language
model driven tokenizer. Both tokenizers had failures with different difficult to parse texts, with neither
exhibiting brilliance, one way or another. As a consequence, this legacy version of the tokenizer
continues to be used, in great part because it is much faster and has a much smaller memory footprint
than the wrapped scispacy tokenizer.</p>
    </sec>
    <sec id="sec-34">
      <title>Sentence Chunker</title>
    </sec>
    <sec id="sec-35">
      <title>Term Annotator</title>
    </sec>
    <sec id="sec-36">
      <title>Date and Time by Token Annotator</title>
    </sec>
    <sec id="sec-37">
      <title>Checkbox Annotator</title>
      <p>There are oddball dates that get missed by the regular expression annotator before tokenization. This
annotator identifies dates that bounded by each token.</p>
      <p>This annotator identifies and analyzes mentions like Smoking: yes [ ] no [x ]. The annotator identifies
the heading, each of the options, and which option was marked. It identifies whether the options have
a positive or negative polarity to them. If so, it takes the polarity of the marked option and applies that
polarity to the heading. In this example, smoking gets negated because the no box was marked, noting
that no has a negative polarity.
[Note: This annotator was turned off for this work partly because the BTRIS data did not have
checkbox mentions that were relevant to the task.]</p>
    </sec>
    <sec id="sec-38">
      <title>Date by Lookup Annotator</title>
      <p>This annotator identifies parts of temporal expressions by items listed in a date lexicon as being a date.
These include the obvious – names of the months and days and holiday names.</p>
      <p>This annotator chunks together tokens into terms based on dictionary lookup. Categorization and
syntactic information from the dictionary are tagged onto the terms created.</p>
      <p>The UMLS SPECIALIST Lexicon, by default, is employed to chunk general English into terms.
There are annotator specific lexica also employed, including a date lexicon, a lexicon of section names,
a lexicon of assertion terms. Most of the pipelines employ 20 lexica of one kind or another.</p>
    </sec>
    <sec id="sec-39">
      <title>Slot:Value Annotator</title>
      <p>The slot:value annotator identifies and analyzes slot and value entities into a content heading entity
and an answer entity. Example: Denies Alcohol: yes.</p>
      <p>Slot:value entities are telegraphic sentences which lack an explicit verb. They are quick methods
of data capture and easy comprehension but do not syntactically parse in the same way sentences in
prose do.</p>
      <p>There are a lot of variations to slot: value formats within clinical text in general, and within the
BTRIS dataset. Getting this structure correct is paramount. However, there are many ambiguous
examples which flummox the current iteration of this annotator.</p>
      <p>This annotator identifies sentences within the text. Embedded within this task, are also the
identification of lists and list elements. Like the slot:value annotator, correctly identifying the bounds
of when a sentence begins and ends is paramount. The variation of text found in clinical text have
flummoxed all the sentence chunkers tried thus far. None have worked 100% of the time. Many of
the downstream errors are attributed to sentence chunking failures.</p>
    </sec>
    <sec id="sec-40">
      <title>Assertion Evidence Annotator</title>
      <p>This is one of two annotators that work in conjunction with each other. This annotator identifies
evidence for negation, conditional statements, hypothetical statements, whether the mention is about
the patient (subject), whether the mention is historical, and who is saying the mention.</p>
      <p>The algorithm employed is a re-write of Wendy Chapman’s ConTEXT algorithm in java. The
Lexica came from her rules, and greatly augmented from work done by three groups at the University
of Utah combining each group’s rules.</p>
      <p>Who is saying the mention (attribution) is the newest extension to this algorithm and was done for
this project. The annotation guidelines stipulated to ignore patient authored statements, thus, the need
to identify who is saying what. While it is not completely straight forward to identify patient reported
mentions, there are clues or evidence, including trigger statements such as “patient reports”, and patient
notes”. Also, any mentions that come from the subjective portion of SOAP notes are a-priori ruled as
patient reported. The rules used for this work were adopted from work done to determine the difference
between a sign vs a symptom and work done to determine if the statement is about the patient vs
someone else.</p>
      <p>Spoiler alert: the second annotator, the assertion annotator, is much further downstream in the
pipeline.</p>
    </sec>
    <sec id="sec-41">
      <title>Unit-of-Measure Annotator</title>
      <p>This annotator identifies things that are measured, are like terms, but not something to be looked up.
These include numeric test results, pulse rates, ejection fractions, or degrees of range of motion.</p>
      <p>This annotator employs, for the most part, a combination of dictionary lookup for the units part,
and regular expression for the numeric parts. The dictionary used for this is a snapshot of NLM’s
UCUM resource. Not perfect, but useful.</p>
    </sec>
    <sec id="sec-42">
      <title>Term Shapes Annotator</title>
      <sec id="sec-42-1">
        <title>This annotator identifies spelled out numbers and units of measure ranges.</title>
      </sec>
    </sec>
    <sec id="sec-43">
      <title>Punctuation Terms Annotator</title>
      <p>This is a corrective annotator: it creates terms that are only punctuation like +++. The current lexical
lookup ignores runs of only punctuation, thus making it impossible to create terms that are only
punctuation. There are many test results that are only punctuation. This annotator was created
specifically for this task to pick up such entities.</p>
    </sec>
    <sec id="sec-44">
      <title>Person Tokens Annotator</title>
      <sec id="sec-44-1">
        <title>This annotator identifies persons in the text.</title>
        <p>Note: The BTRIS data has persons already redacted, so this annotator is not useful currently and was
turned off for this work.
There are various failings of the current slot:value annotator that these corrective annotators fix, using
downstream annotations not available to the slot:value annotator when it runs in the sequence in the
pipeline. This annotator fixes some of the failures that are fixable.</p>
      </sec>
    </sec>
    <sec id="sec-45">
      <title>CCDA Section Header Annotator</title>
      <p>This annotator creates section headers based, for the most part, on dictionary lookup. The annotator
uses an augmented version of HL7’s list of approved section headings. The list was augmented a lot
for this task because OT/PT specific sections do not appear within the CCDA domain (yet).</p>
    </sec>
    <sec id="sec-46">
      <title>CCDA Panel Section Header Annotator</title>
      <p>Panels are sections within clinical documents that list test results for blood tests, primarily.
This annotator creates headers for panel sections.</p>
      <p>Note: Panels are ignored for this work, and this annotator is turned off.</p>
    </sec>
    <sec id="sec-47">
      <title>CCDA Section Annotator</title>
    </sec>
    <sec id="sec-48">
      <title>Sentence Section Repair</title>
      <p>This annotator creates section zones from the end of the section name down to just before the beginning
of the next section name.</p>
      <p>This is a corrective annotator. Once section headings are determined, there is need to adjust
(erroneous) sentence boundaries to exclude section names.</p>
    </sec>
    <sec id="sec-49">
      <title>Quoted Utterance Annotator</title>
      <p>This annotator creates quoted text. Symptoms are typically found in “quoted text”, so it’s useful to
find them.</p>
      <p>Note: Quoted text does not play a role in the Body Function task and though it is on, this feature is not
used downstream.
This is a corrective annotator. This annotator removes lists that only have one element to them and
turns those back into sentences. Sentences that end with a number also caused issues because the
numbers look like list delimiters. So lists that have list delimiters like “1. 2.” that have the list
delimiter ordering out of order are likely not lists, but sentences that end with numbers.
Sentences that have tabs in them are likely to be from multi-column formats, where, within the process
of OCRing them, the OCR software injected tabs to indicate a new column.
This annotator, part two of the two assertion annotators, creates assertion attributes to all annotations
based on the assertion evidence noted from the assertion evidence annotator.</p>
    </sec>
    <sec id="sec-50">
      <title>Section Name in Terms Attribute Annotator</title>
      <p>It is useful to know what section a term is mentioned in. This is useful to filter out mentions found
that come from sections you do not care about. This annotator adds the section name to each term in
the document. This is done outside the term annotator, which happens before the section zones are
computed
[21]Finkel, Jenny Rose, Trond Grenager, and Christopher D. Manning. "Incorporating non-local
information into information extraction systems by gibbs sampling." Proceedings of the 43rd
Annual Meeting of the Association for Computational Linguistics (ACL’05). 2005.
[22]Wallach, Hanna M. "Conditional random fields: An introduction." Technical Reports
(CIS) (2004): 22.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>NIH</given-names>
            <surname>SEER Training</surname>
          </string-name>
          <string-name>
            <surname>Modules</surname>
          </string-name>
          ,
          <article-title>Anatomy &amp; Physiology, Intro to the Human Body, Body Functions &amp; Life Processes: training</article-title>
          .seer.cancer.gov/anatomy/body/functions.html (last accessed
          <year>2021</year>
          /05/06)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Üstün</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Bedirhan</surname>
          </string-name>
          , et al.
          <article-title>"The International Classification of Functioning, Disability and Health: a new tool for understanding disability and health</article-title>
          .
          <source>" Disability and rehabilitation 25</source>
          .
          <fpage>11</fpage>
          -
          <lpage>12</lpage>
          (
          <year>2003</year>
          ):
          <fpage>565</fpage>
          -
          <lpage>571</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kukafka</surname>
          </string-name>
          ,
          <string-name>
            <surname>Rita</surname>
          </string-name>
          , et al.
          <article-title>"Human and automated coding of rehabilitation discharge summaries according to the International Classification of Functioning, Disability, and</article-title>
          <string-name>
            <surname>Health.</surname>
          </string-name>
          "
          <source>Journal of the American Medical Informatics Association 13.5</source>
          (
          <year>2006</year>
          ):
          <fpage>508</fpage>
          -
          <lpage>515</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Newman-Griffis</surname>
          </string-name>
          , Denis, and
          <string-name>
            <surname>Eric</surname>
          </string-name>
          Fosler-Lussier.
          <article-title>"Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and</article-title>
          <string-name>
            <surname>Health.</surname>
          </string-name>
          <article-title>" Frontiers in digital health 3 (</article-title>
          <year>2021</year>
          ):
          <fpage>24</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Divita</surname>
          </string-name>
          ,
          <string-name>
            <surname>Guy</surname>
          </string-name>
          , et al.
          <article-title>"v3NLP Framework: tools to build applications for extracting concepts from clinical text</article-title>
          .
          <source>" eGEMs 4</source>
          .3 (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Divita</surname>
          </string-name>
          ,
          <string-name>
            <surname>Guy</surname>
          </string-name>
          , et al.
          <article-title>"Sophia: A expedient UMLS concept extraction annotator." AMIA Annual Symposium Proceedings</article-title>
          . Vol.
          <year>2014</year>
          . American Medical Informatics Association,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ferrucci</surname>
            , David, and
            <given-names>Adam</given-names>
          </string-name>
          <string-name>
            <surname>Lally</surname>
          </string-name>
          .
          <article-title>"UIMA: an architectural approach to unstructured information processing in the corporate research environment." Natural Language Engineering (</article-title>
          <year>2004</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Savova</surname>
            ,
            <given-names>Guergana K.</given-names>
          </string-name>
          , et al.
          <article-title>"Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications</article-title>
          .
          <source>" Journal of the American Medical Informatics Association 17.5</source>
          (
          <year>2010</year>
          ):
          <fpage>507</fpage>
          -
          <lpage>513</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Aronson</surname>
          </string-name>
          ,
          <string-name>
            <surname>Alan R</surname>
          </string-name>
          .
          <article-title>"Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</article-title>
          .
          <source>" Proceedings of the AMIA Symposium. American Medical Informatics Association</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>McCray</surname>
          </string-name>
          ,
          <string-name>
            <surname>Alexa</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suresh</surname>
            <given-names>Srinivasan</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Allen</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Browne</surname>
          </string-name>
          .
          <article-title>"Lexical methods for managing variation in biomedical terminologies</article-title>
          .
          <source>" Proceedings of the Annual Symposium on Computer Application in Medical Care. American Medical Informatics Association</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Lexical</given-names>
            <surname>Variant Generation Documentation: Fruitful Variants</surname>
          </string-name>
          .
          <year>lexsrv3</year>
          .nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/flow/fG.html.
          <source>(last accessed</source>
          <year>2021</year>
          /05/06)
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Donnelly</surname>
            ,
            <given-names>Kevin.</given-names>
          </string-name>
          <article-title>"SNOMED-CT: The advanced terminology and coding system for eHealth." Studies in health technology</article-title>
          and
          <source>informatics 121</source>
          (
          <year>2006</year>
          ):
          <fpage>279</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Richesson</surname>
          </string-name>
          ,
          <string-name>
            <surname>Rachel</surname>
          </string-name>
          , et al.
          <article-title>"A web-based SNOMED CT browser: distributed and real-time use of SNOMED CT during the clinical research process." Medinfo 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems</article-title>
          . IOS Press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>SNOMED-CT Online</surname>
            <given-names>Browser</given-names>
          </string-name>
          , browser.ihtsdotools.org/ (last accessed
          <year>2021</year>
          /05/06) [15]
          <string-name>
            <surname>Goltra</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter S. MEDCIN</surname>
          </string-name>
          <article-title>: a new nomenclature for clinical medicine</article-title>
          .
          <source>Springer Science &amp; Business Media</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>Clement J.</given-names>
          </string-name>
          , et al.
          <article-title>"LOINC, a universal standard for identifying laboratory observations: a 5-year update</article-title>
          .
          <source>" Clinical chemistry 49.4</source>
          (
          <year>2003</year>
          ):
          <fpage>624</fpage>
          -
          <lpage>633</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Golbeck</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jennifer</surname>
          </string-name>
          , et al.
          <article-title>"The National Cancer Institute's thesaurus and ontology</article-title>
          .
          <source>" Journal of Web Semantics First Look</source>
          <volume>1</volume>
          _
          <issue>1</issue>
          _
          <issue>4</issue>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Amith</surname>
          </string-name>
          ,
          <string-name>
            <surname>Muhammad</surname>
          </string-name>
          , et al.
          <article-title>"Ontology of Consumer Health Vocabulary: providing a formal and interoperable semantic resource for linking lay language and medical terminology." 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)</article-title>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Lipscomb</surname>
            ,
            <given-names>Carolyn E.</given-names>
          </string-name>
          <article-title>"Medical subject headings (MeSH)."</article-title>
          <source>Bulletin of the Medical Library Association 88.3</source>
          (
          <year>2000</year>
          ):
          <fpage>265</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Cartwright</surname>
            ,
            <given-names>Donna J</given-names>
          </string-name>
          .
          <article-title>"ICD-9-CM to ICD-10-CM codes: what? why? how?</article-title>
          .
          <source>"</source>
          (
          <year>2013</year>
          ):
          <fpage>588</fpage>
          -
          <lpage>592</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>