<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Semantic Analysis of Pathology Reports</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Philip E. Whalen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aditya Trilok Muralidharan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonathan R. Kiddy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>William D. Duncan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Bioinformatics &amp; Biostatistics Roswell Park Comprehensive Cancer Center Buffalo</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Figure 1: Architecture of the Document Content Ontology</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>7</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Pathology reports play an essential role in cancer treatment and research. They contain vital findings about a patient's cancer, such as cell histology and molecular markers, that are used to diagnose the type of cancer, determine treatment options, and enhance our understanding of the nature of the disease. At Roswell Park Comprehensive Cancer Center, 1 pathology reports are stored mainly as unstructured text in a relational database. 2 To find information within pathology reports, we use either string matching methods (e.g., regular expressions), or the TIES3 natural language processing (NLP) program. The drawback of string matching is that string variations need to be accounted for in order to find information. For instance, a search for patients whose tumors lack estrogen receptor (ER) proteins will also have to search for strings matching 'ER negative', 'estrogen receptor negative', 'hormone status negative', and the like.</p>
      </abstract>
      <kwd-group>
        <kwd>ontology</kwd>
        <kwd>natural language processing</kwd>
        <kwd>named entity recognition</kwd>
        <kwd>pathology report</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        TIES addresses some of this variability by mapping
multiple strings to the same ontology class, but some searches
consistently do not perform well.4 Moreover, a class identified
by TIES is not linked to the formal axioms that define the
class, which prevents researchers from fully leveraging the
formal relations that hold between classes within an ontology.
For example, the formal definition of Medullary Breast
Carcinoma (C17965 5 ) in the NCI Thesaurus (NCIt) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
includes the axiom:
      </p>
      <p>To address these shortcomings, we are developing an
ontology that we currently call the ‘Document Content
Ontology’ 6 (DCO) to represent the terms, words, word
1 https://www.roswellpark.org
2 The database does contain some structured fields, but we find that most
researchers are interested in the information contained in the unstructured
text.
43 ThtItEpS:/c/otniessis. dtebnmtlyi.pfaititl.setdouidentify findings that the cells in the tumor lack
estrogen-receptor proteins.
5 IRI: http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C17965
6 The name of the ontology may change!
contexts, and their positions (i.e., indexes) within the
documents. That is, we are using the DCO to represent the
content of the document and where the content is located. It is
important to make clear that while we are using an NLP
program for named entity recognition (described in what
follows), we are not developing an NLP program. Rather, we
are augmenting the output of the NLP program so that we can
more fully leverage the axioms contained within an ontology.</p>
      <p>A high-level summary of the DCO is illustrated in Figure
1. Documents contain (i.e., has part) terms, and terms, which
are composed of one more words, have meanings that are
specified using the semantic type, sematic label, and semantic
source annotation properties to reference an ontology class.
The literal value data property associates the actual data (e.g.,
strings) with the term, and the polarity annotation represents
whether the term has a positive or negative connotation (e.g.,
the patient does not have breast carcinoma). In some cases, we
also represent the word context: the group of words
surrounding the word or words that constitute a term. Word
contexts are useful in aiding NLP programs to disambiguate
the sense in which a word is being used. For brevity, not all
properties and classes are discussed. Full details are available
at
https://github.com/RoswellParkResearch/document-contentontology.</p>
      <p>We are aware that a number of other ontologies (such as the
Information Artifact Ontology and Semanticscience Integrated
* corresponding author
Ontology) have terms similar to ours. However, these
ontologies carry with them metaphysical commitments, such as
a document being a type of generically dependent continuant.
Since we are just beginning to develop the DCO, we wish (for
now) to remain agnostic concerning such commitments.</p>
      <p>
        At present, we are using the DCO to structure output from
the Noble Coder Named Recognition Engine [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Noble takes a
document as input and outputs a file containing information
about (named) entities identified within the document as well
as the associated ontology classes that specify the meanings of
the named entities. For example, if the Noble program
determines that some text within a document refers to ductal
breast carcinoma, Noble associates this text with the NCIt class
‘Ductal Breast Carcinoma’ (C40177).
      </p>
      <p>We translate the output of Noble into OWL and load it
along with the full ontology of association classes (which we
call a named entity’s semantic type) into a GraphDB8 semantic
triple store. This allows us to simultaneously query for
pathology reports having a specified named entity and the
ontology for other entities related to the named entity.</p>
      <p>In addition to leveraging the ontologies axioms, we can
also examine the word context surrounding a term. For
instance, this is useful for addressing the aforementioned issue
of searching pathology reports in which the cells are found to
be ER negative.</p>
    </sec>
    <sec id="sec-2">
      <title>ACKNOWLEDGMENT</title>
    </sec>
    <sec id="sec-3">
      <title>7 IRI: http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C4017 8 http://graphdb.ontotext.com</title>
      <p>We gratefully acknowledge support from the Roswell Park
Comprehensive Cancer Center Biomedical Data Science
Shared Resource that is funded in part by Cancer Center
Support Grant NCI P30CA16056.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Wright</surname>
            <given-names>LW.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>NCI Thesaurus: A semantic model integratingcancer-related clinical and molecular information</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>40</volume>
          (
          <issue>1</issue>
          ):
          <fpage>30</fpage>
          -
          <lpage>43</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.jbi.
          <year>2006</year>
          .
          <volume>02</volume>
          .013. PMID:
          <volume>16697710</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Savova</surname>
            <given-names>GK</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masanz</surname>
            <given-names>JJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ogren</surname>
            <given-names>PV</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sohn</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kipper-Schuler</surname>
            <given-names>KC</given-names>
          </string-name>
          , et al.
          <article-title>Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          .
          <year>2010</year>
          ;
          <volume>17</volume>
          (
          <issue>5</issue>
          ):
          <fpage>507</fpage>
          -
          <lpage>13</lpage>
          . doi:
          <volume>10</volume>
          .1136/jamia.
          <year>2009</year>
          .
          <volume>001560</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ergin</given-names>
            <surname>Soysal</surname>
          </string-name>
          , Jingqi Wang, Min Jiang, Yonghui Wu, Serguei Pakhomov, Hongfang Liu, Hua Xu.
          <article-title>CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines</article-title>
          .
          <source>JAMIA. doi: 10</source>
          .1093/jamia/ocx132.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Tseytlin</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitchell</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Legowski</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrigan</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chavan</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jacobson</surname>
            <given-names>RS</given-names>
          </string-name>
          .
          <article-title>NOBLE - Flexible concept recognition for large-scale biomedical natural language processing</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <source>2016 Jan</source>
          <volume>14</volume>
          ;
          <fpage>17</fpage>
          :
          <fpage>32</fpage>
          . doi:
          <volume>10</volume>
          .1186/s12859-015-0871-y.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>