<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontology-Based Text Mining of Concept Definitions in Biomedical Literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Saeed Hassanpour</string-name>
          <email>saeedhp@stanford.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amar K. Das</string-name>
          <email>amar.das@stanford.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Stanford Center for Biomedical Informatics Research</institution>
          ,
          <addr-line>Stanford, CA 94305</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
      </contrib-group>
      <fpage>40</fpage>
      <lpage>45</lpage>
      <abstract>
        <p>Many developers of biomedical knowledge bases typically validate and update formalized knowledge based on reviews of full-text scientific articles, but finding text relevant to domain concepts can be tedious and prone to errors. Prior methods have automated this process by matching term-based patterns within a single sentence. In our work developing a knowledge base of autism phenotypes, specified using Semantic Web standards, we are interested in finding multi-sentence sections of text that contains complex phenotype definitions. In this paper, we present a text-mining method that incorporates both ontology- and rule-based semantics to determine which section is relevant. We evaluated our method in undertaking text extraction for the set of full-text articles used to create the knowledge base. We show that our method has higher precision and recall than a term-based approach in identifying definitions that contain complex patterns and occur across sentence boundaries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Keywords: Information Extraction, Text Analysis, Semantic
Ontology, Rules Base, OWL, SWRL
Web,</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        Biomedical knowledge resources, such as terminologies and ontologies, are important
for community-based annotation and sharing of data. Creating and maintaining these
resources is challenging given the rapid growth of scientific knowledge. Generally,
scientists, annotators and developers try to keep up by using search engines that find
publications relevant to given concepts in the knowledge resource. However, users
still need to review the publications and find sections within the documents that relate
to the concept being searched. One solution to this challenge is to automatically
identify the relevant parts of a full-text document. Prior methods, such as Textpresso
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], have focused on finding individual sentences that match the terms of biomedical
concepts and of properties that connect concepts. Such approaches do not find
sections of an article—including multiple sentences—that are semantically and
implicitly relevant to the definition of a concept. In our work, we present a novel text
mining method that retrieves the most semantically informative text in a document
using definitions of concepts modeled as rules in a domain ontology, and we compare
the precision and recall of our method against a term-based approach.
      </p>
      <p>
        Our work is motivated by the needs of developers of an ontology of autism
phenotypes [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. As part of these efforts, experts want to easily find text within a
publication that relates to the definition of a phenotype concept, both to find new
definitions of that concept and to annotate the document section as the relevant text to
the concept. For example, in a paper on autism genetics, Hus et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] define
Savantpositive and Savant-negative phenotype concepts as:
      </p>
      <p>The Savant Skills Factor was based on … current and ever scores of four ADI-R
items: visuospatial ability, memory skill, musical ability, and computational
ability. Item scores were summed and divided by total number of items to
generate a score between 0 and 1. … Participants were then divided into two
groups: Savant-positive and Savant-negative … .</p>
      <p>
        The autism ontology uses the Web Ontology Language (OWL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to model concepts
and hierarchical relationships and the Semantic Web Rule Language (SWRL) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to
define phenotype concepts as value restrictions on data collected through standardized
instruments, such as the Autism Diagnostic Instrument-Revised (ADI-R).
      </p>
    </sec>
    <sec id="sec-3">
      <title>2 Related Work</title>
      <p>
        Finding text relevant to a search term is undertaken by some web search engines,
which provides a few lines of site description or snippet for a search result to indicate
the relevance of a web page to the search query. Google, for example, uses the
description provided by meta tags, references to the web pages, Open Directory [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
and the text around the query keywords on web pages to provide informative search
result descriptions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We argue that structured domain knowledge can be used to
enhance the relevance of snippets to the queries as well and provide the most
semantically relevant parts of web page contents in result snippets.
      </p>
      <p>
        Another related work in this field is question-answering systems, which return a
part of a text from a corpus as the answer to a specified question. These techniques
rank the snippets from the relevant documents by criteria such as: containing expected
types of named entities, the percentage of overlap with question terms, containing
lexical patterns, and using information from lexicon dictionaries [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9-11</xref>
        ]. Other work
has tried to retrieve descriptive phrases from free text by using pattern matching,
word counting, and sentence location without using domain knowledge [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In our
work, we address the broader problem of extracting text that is semantically relevant
to domain concepts. Our approach leverages the structured and axiomatic forms of
knowledge in ontologies and rules, which contain richer semantic relationships than
lexical databases.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>METHODS</title>
      <p>In our work, we find the most relevant parts of science publications to domain
concepts using existing OWL ontologies and SWRL rules. As noted, both provide
formal definitions of domain concepts and their relationships to other concepts.</p>
      <sec id="sec-4-1">
        <title>3.1 Semantic Concept Modeling</title>
        <p>
          As the first step, we need a formal representation of domain concepts. In this work,
we use vector space modeling, a common method in the web search engines for
indexing web pages [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], and a structured knowledgebase as a basis of the concept
modeling. The concepts in the knowledgebase may be formally defined in logical
form of SWRL rules and saved as a part of an OWL ontology, as in the case of the
autism ontology. We thus consider rules’ components as relevant concepts and
incorporate them in our modeling for better presentation of the main concept.
Therefore, we have one dimension for each ontology class and property mentioned in
the rule as relevant concepts.
        </p>
        <p>Besides the classes and properties that are mentioned in the rule, we use ontology
hierarchies to extract more related concepts and incorporate them in the concept
presentation. We consider the parents and grandparents of the main concept and its
related concepts extracted from the corresponding rule as potential related concepts
that can strengthen our concept vector modeling. However, the relevance of these
concepts from the ontology hierarchy decreases by their distance from the main
concept in the hierarchy graph. Therefore, we weight these related terms in the vector
presentation less than the main class and the related concepts explicitly mentioned in
the rule that defines the concepts. As a heuristic choice to capture these differences,
we count the frequencies of the parent classes or properties as half of the actual
frequencies, and the frequencies of grandparent classes or properties as one-quarter of
the actual frequencies.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2 Relevant Text Finding</title>
        <p>
          After we model the concept, we go through a publication to find the most relevant
parts of the text for a particular concept. As the first step, we look at the vector
representation of the concept and found all the terms associated with that concept as
the concept terms. Concept terms are the terms that have weights greater than zero in
the concept vector presentation. We then go through the publication and mark all the
occurrence of the concept terms in the text. We cover occurrences of different forms
of a concept terms by applying, Porter stemming algorithm, a common stemming
method for English terms [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], on both concept terms and publication terms.
        </p>
        <p>Given the occurrences of concept terms in a publication, we treat them as
indicators of relevant parts of the text and use single linkage hierarchal clustering to
find the candidates for the most relevant parts of the publication. The average
sentence length in our corpus is 20 words. In the single linkage clustering we use 30
words as a heuristic threshold and in every step we merge the closest clusters that are
separated by less than 30 words. Thus, we ensure that a continuous section of text
without any concept term is limited to a few sentences and the whole cluster is
continuously correlated to the concept. We consider these clusters as the candidates
for the most relevant parts of the text.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3 Text Modeling and Correlation Computation</title>
        <p>
          In this work, our goal is to quantify the relevance between concepts and pieces of text.
Therefore, we need a mathematical modeling of texts. We use vector space modeling
again to provide a common basis for comparison. Vector space modeling for
documents’ text is based on term frequencies. To model a part of a text as a vector,
we first remove the stop words, the most common English words that are not
informative about the context. We use a common list of stop words in English [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Then we apply Porter stemming algorithm to replace different derivations of a word
with their root. Then we build a vector with one dimension for each term in the text
and assign the frequency of that term in the text as the value of that dimension in the
vector.
        </p>
        <p>After we present both text words and domain concepts as vectors, we need to
compute the correlations between them in order to find the most relevant parts of a
publication for a concept. To do that, we use cosine similarity as the measure of
correlation between texts and concepts. The cosine similarity for two vectors is the
cosine of the angle between them. Similarity values range from 0 for orthogonal
vectors to 1 for parallel vectors.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4 Evaluation Strategy</title>
        <p>In this work, we applied our method on the autism phenotype ontology and the papers
used to derive those concepts as mentioned in Section 1. We examined only the top
five most relevant parts of the publication for each concept and had an autism
ontology expert review these text sections to determine the efficacy and accuracy of
whether each section was related or not to the definition of the concept. To investigate
the significance of using ontological hierarchies and rule bases, we compared our
method to a baseline, which is a term-only method. The baseline method is a variation
of our method that only uses the terms in the semantic concept-modeling step. That is,
our baseline approach does not include concepts from the ontology or rules that are
related to the term. To eliminate bias in the assessment of the performance of the two
approaches, the expert was blind to which method produced the extracted text.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4 Results</title>
      <p>
        The autism ontology contains 1726 classes and properties, and it includes 156
SWRL rules that correspond to 145 phenotype definitions. The ontology and rules
were based on a review of 26 publications that had been undertaken by one of the
authors (AKD) and other domain experts in autism [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For this study, we selected 49
domain concepts that had rules using multiple criteria to define a phenotype (such as
the example concept of Savant positive given in Section 1). We excluded phenotype
definitions where the concept directly corresponded to the value of a single item on a
clinical assessment. We applied both our ontology-based text extraction method and
the term only method on each of the 49 concepts, and we returned the top 5 most
relevant parts of the publication for review by the domain expert. Altogether 338
sections of text were reviewed and evaluated by the autism ontology expert as to
whether they were relevant to the corresponding phenotype concept. Table 1 shows
the precision of our ontology-based method and the concept-based method—that is,
the percentage of returned sections that refer to the concepts.
      </p>
      <p>In our evaluation strategy, we knew that every concept had been defined in the
corresponding publication. For further investigation of the relevance strength in our
results, we asked the reviewer to identify which of the five most relevant parts of the
publications for a concept contained a clear definition. We used this to calculate the
recall for each method, which is the percentage of concepts that their definitions were
found. Table 2 shows the recall of the concept- and ontology-based methods in
finding the definitions of the concepts in the corresponding publication text.</p>
    </sec>
    <sec id="sec-6">
      <title>5 Discussion</title>
      <p>In this paper, we present a novel method to find parts of text in scientific publications
that relate to definitions of biomedical concepts. In comparison to methods that do
term matching to find individual sentences that contain a single concept or pairwise
sets of concepts, our ontology-based approach addresses the challenge of finding a
concept definition that occurs across multiple sentences or that is semantically similar
to predefined concepts. Our approach was particularly driven by the need to identify
text related to complex domain concepts like autism phenotypes, in which use
different terms and terminologies refer to similar concepts. Our evaluation shows that
ontology hierarchies and rules have a large impact on identifying the relevant parts of
the text. This is because of the informative nature of ontological hierarchies and the
inter-relationship of concepts maintained in rule bases.</p>
      <p>As future work, we are planning to improve upon our method by using the text’s
syntactic structures through constituent and dependency parsing methods. The
syntactic and dependency information can be used in the text modeling to improve the
concept relevance detection. Also, we will consider further addition of name entity
recognition methods, which can extract the information about the biomedical concepts
outside of the ontologies in texts. We are planning to use this information to develop a
richer presentation of text and find relationships between the publication text and the
queried biomedical concept.</p>
      <p>Acknowledgments. The authors would like to acknowledge Martin O’Connor and
Siddharth Taduri for their comments on the approach. This research was supported in
part by grant R01 MH87756 from the National Institutes of Health.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Muller</surname>
            ,
            <given-names>H.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kenny</surname>
            ,
            <given-names>E.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sternberg</surname>
            ,
            <given-names>P.W.</given-names>
          </string-name>
          :
          <article-title>Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature</article-title>
          .
          <source>PLoS Biol</source>
          .
          <volume>2</volume>
          (
          <issue>11</issue>
          ):e309. doi:
          <volume>10</volume>
          .1371/journal.pbio.
          <volume>0020309</volume>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tu</surname>
            ,
            <given-names>S.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tennakoon</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vismer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Astakhov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grethe</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martone</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McAuliffe</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <source>Ontology Driven Data Integration for Autism Research. 22nd IEEE International Symposium on Computer Based Medical Systems</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          , Albuquerque,
          <string-name>
            <surname>NM</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Tu</surname>
            ,
            <given-names>S.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tennakoon</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Using an Integrated Ontology and Information Model for Querying and Reasoning about Phenotypes: The Case of Autism</article-title>
          .
          <source>AMIA Annual Symposium</source>
          , pp.
          <fpage>727</fpage>
          -
          <lpage>731</lpage>
          , Washington, DC (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hus</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pickles</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cook</surname>
            ,
            <given-names>E.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Risi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lord</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Using the Autism Diagnostic Interview-Revised to Increase Phenotypic Homogeneity in Genetic Studies of Autism</article-title>
          .
          <source>Biol Psychiatry</source>
          .
          <volume>61</volume>
          (
          <issue>4</issue>
          ),
          <fpage>438</fpage>
          -
          <lpage>448</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>OWL Web Ontology Language Overview</article-title>
          . W3C Recommendation, &lt;http://www.w3.org/TR/2004/REC-owl-features-
          <volume>20040210</volume>
          /&gt; (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boley</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tabet</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grosof</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>SWRL: A Semantic Web Rule Language Combining OWL and RuleML</article-title>
          . &lt;http://www.w3.org/Submission/SWRL/&gt; (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>7. Open Directory Project, http://www.dmoz.org/</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Google support on snippets, http://www.google.com/support/webmasters/bin/answer.py?hl=en&amp;answer=
          <fpage>35624</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Cooper</surname>
          </string-name>
          , W.S.:
          <article-title>Fact Retrieval and Deductive Question-Answering Information Retrieval Systems</article-title>
          .
          <source>J. ACM</source>
          .
          <volume>11</volume>
          (
          <issue>2</issue>
          ),
          <fpage>117</fpage>
          -
          <lpage>137</lpage>
          (
          <year>1964</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Miliaraki</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Androutsopoulos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Learning to Identify Single-snippet Answers to Definition Questions</article-title>
          .
          <source>20th International Conference on Computational Linguistics</source>
          , Geneva, Switzerland (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prager</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samn</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Ranking Suspected Answers to Natural Language Questions Using Predictive Annotation</article-title>
          .
          <source>6th Conference on Applied Natural Language Processing</source>
          , pp.
          <fpage>150</fpage>
          -
          <lpage>157</lpage>
          , Seattle, WA (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <source>Retrieving Descriptive Phrases from Large Amounts of Free Text. 9th ACM Conference on Information and Knowledge Management</source>
          , pp.
          <fpage>180</fpage>
          --
          <lpage>186</lpage>
          , McLean,
          <string-name>
            <surname>VA</surname>
          </string-name>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>C.S.:</given-names>
          </string-name>
          <article-title>A Vector Space Model for Automatic Indexing</article-title>
          .
          <source>Commun ACM</source>
          .
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>14. Porter stemmer, http://tartarus.org/~martin/PorterStemmer</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <article-title>List of English stop words</article-title>
          , http://members.unine.ch/jacques.savoy/clef
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>