<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Hybrid Approach to Learn Description Logic based Biomedical Ontology from Texts?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yue Ma</string-name>
          <email>mayue@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alifah Syamsiyah</string-name>
          <email>alifah.syamsiyah@stud-inf.unibz.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Free University of Bozen-Bolzano</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Theoretical Computer Science, Technische Universität Dresden</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Augmenting formal medical knowledge is neither manually nor automatically straightforward. However, this process can benefit from rich information in narrative texts, such as scientific publications. Snomed-supervised relation extraction has been proposed as an approach for mining knowledge from texts in an unsupervised way. It can catch not only superclass/subclass relations but also existential restrictions; hence produce more precise concept definitions. Based on this approach, the present work aims to develop a system that takes biomedical texts as input and outputs the corresponding E L++ concept definitions. Several extra features are introduced in the system, such as generating general class inclusions (GCIs) and negative concept names. Moreover, the system allows users to trace textual causes for a generated definition, and also give feedback (i.e. correction of the definition) to the system to retrain its inner model, a mechanism for ameliorating the system via interaction with domain experts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Introduction
Biomedicine is a discipline that involves a large number of terminologies, concepts, and
complex definitions that need to be modeled in a comprehensive knowledge base to be
shared and processed distributively and automatically. The National Library of Medicine
(NLM) has maintained the world’s largest biomedical library since 1836 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One of the
medical terminologies preserved by NLM is Systematized Nomenclature of Medicine
Clinical Terms (SNOMED CT). It is a comprehensive clinical vocabulary structured in a
well-defined form that has the lightweight Description Logic E L++ [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as the underlying
logic, which can support automatic checking of modeling consistency.
      </p>
      <p>
        However, creating, maintaining, and extending formal ontology is an expensive
process [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In contrast, narrative texts, such as medical records, health news, and
scientific publications, contain rich information that is useful to augment a medical
knowledge base. In this paper, we propose a hybrid system that can generate E L TBoxes
from texts. It extends the formal definition candidates learned by the Snomed-supervised
relation extraction process [
        <xref ref-type="bibr" rid="ref3 ref4">4, 3</xref>
        ] with linguistic patterns to give a finer-grained translation
of the learned candidates. Besides generating concept name hierarchy that has been
widely studied, the system can also generate definitions with existential restrictions to
exploit the expressivity of E L. Moreover, the implemented Graphical User Interface
helps a user to visualize the flow of this framework, tracks textual sentences from
which a formal definition is generated, and gives feedback to enhance the system
interactively. The implementation of the system can be found from the link https:
//github.com/alifahsyamsiyah/learningDL.
Ontology Annotation
Concept Pair Extraction
      </p>
      <p>Feature Extraction
Multiclass Classifying</p>
      <p>Syntactic Parsing
Generate Axioms</p>
      <p>Ontology
Ontology Annotation</p>
      <p>Concept Relationships
Relation Alignment</p>
      <p>Feature Extraction</p>
      <p>Building Probabilistic Model</p>
      <p>User Verification
Our task is to generate E L definitions from textual sentences. For example, from the
sentence “Baritosis is a pneumoconiosis caused by barium dust”, it is desired to have an
automatic way to generate the formal E L axiom (together with some confidence value),
as shown in the red frame of Figure 2. Moreover, to help users understand the origin of
a generated definition and/or give their feedbacks, the system should be able to trace
the textual sources from which a definition is generated (implemented with the question
mark in our system), and allow users to correct automatically learned definitions (the “V”
mark in Figure 2).</p>
      <p>Below we describe our hybrid system that has two components, as shown in Figure 1:
one for extracting definition candidates by machine learning techniques, and the other
for formulating final definitions from the definition candidates by linguistic patterns.
The first part of the system is to generate definition candidates via the steps given in
the upper block of Figure 1. It again contains two components: learning a model from
the training data (training texts and ontology) and generating candidates from new texts.
Each steps in the two components are described below.</p>
      <p>
        Common steps in processing training and test texts. One is to recognize SNOMED CT
concept names from a given textual sentence, called ontology annotation in Figure 1.
In our implementation, this is done by invoking the tool Metamap [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since we are only
interested in the most specific and precise concepts, we filter the Metamap annotations
by keeping merely those that refer to head of a phrase but not a verb. The other common
processing step for training and test sentence is to extract textual features for a pair of
concepts occurring in a sentence, called feature extraction. Currently, the system uses
classical lexical features of n-grams over both characters and words as in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
A special processing on training texts is concerned to generate labelled training data
and to learn multi-class classification model for each predefined relation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]:
– Automatic generation of training data is realized by the step named Relationship
Alignment that matches an annotated sentence by Metamap with relationships
between concept names from ontology: If one sentence containing a pair of concepts
that has a relation R according to the ontology, this sentence is considered as a textual
representation of R, thus being labelled with R. Furthermore, we also consider the
inverse roles that often appear in texts via active and passive sentences. Hence, if
there are n predefined relations, there will be 2n possible labels for a sentence.
– Building probabilistic model is to learn a probabilistic multi-class classification
model based on the textual features of labelled sentences from the previous step. For
this, the current system uses the maximum entropy Stanford Classifier1.
A special processing on test texts is to extract definition candidates from a new test
sentence. A definition candidate is a triple (A; R; B) where A; B are concept names and
R is a relation, meaning that A and B have a relation R according to a test sentence.
– Concept pair extraction is to get pairs of concepts from an annotated test sentence.
– Multiclass classification is to answer whether a pair of two concept names has a
relation, and if yes, which relation it is. This part can be achieved by the model
learned from training data by Stanford Classifier. A positive answer returned by the
classifier gives a definition candidate (A; R; B). Slightly abusing of the notation,
we also call 9r:B a definition candidate for A.
2.2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Pattern based Transformation of Definition Candidates</title>
      <p>
        Once we get definition candidates, we first change the order of inverse role so it
always appears as an active role. Next, different from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we distinguish two ways
to formalize it: (1) into a subsumption (A v R:B) or (2) into a conjunction (A u
R:B). For example, the sentence “Baritosis is caused by barium dust” stands for
the subsumption Baritosis(disorder) v Causative_agent:Barium_dust; whilst
“Chest pain from anxiety ...” corresponds to a conjunction Chestpain(disorder) u
9:Causative_agent:Anxiety(disorder). To decide which transformation of a
definition candidate, we follow the intuition observable from the above examples:
– A subsumption A v 9R:B should be generated from a candidate (A; R; B) if A
and B are connected in the sentence in a subject-object relation, called S-form.
– A conjunction A u 9R:B should be formed if A and B appearing in a noun phrase
structure, called NP-form.
      </p>
      <sec id="sec-2-1">
        <title>1 http://nlp.stanford.edu/software/classifier.shtml</title>
        <p>To implement this linguistic pattern based strategy, we use the Stanford Parser2 to
get syntactical parsing tree of a test sentence. The S-form and NP-form are detected in
the following way: First, the phrases corresponding to A and B are recognized from the
sentences, and then the least common node of these two phrases is searched from the
syntactic parsing tree of the whole sentence. If the least common node has type S (resp.
NP)3, then A and B is in S-form (resp. NP-form). Otherwise, a parsing error is returned.
Negation Concept Names In natural language, sometimes we use negative way to define
the opposite meaning. For example, the sentence “The disease from foot is not relative to
heart attack” will be translated to Disease (disorder) u F S. F oot(body structure) v
:Heart_disease (disorder). This is achieved in the system based on negated atomic
concept names detectable by Metamap version 2013.
2.3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Tracing Source Sentence and Classifier Model Retraining</title>
      <p>There are two extra functions provided by the system, namely tracing to sentence and
classifier model retraining. As given in Figure 2, if a user clicks the "?" mark, system
will provide sentences from which the formal definition extracted. Note that the system
uses machine learning approach to acquire definition candidates which may get wrong.
Therefore, we provide a mechanism for user to validate the answer by clicking "V"
symbol and then give the correct relation to link two concept names. As shown in Figure
3, the user changes the role relation from inverse of Finding Site (FS-1) to Causative
Agent (CA).</p>
      <sec id="sec-3-1">
        <title>2 http://nlp.stanford.edu/software/lex-parser.shtml</title>
        <p>3 “S" is for sentence, and“NP" for noun phase.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aronson</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lang</surname>
            ,
            <given-names>F.M.:</given-names>
          </string-name>
          <article-title>An overview of metamap: historical perspective and recent advances</article-title>
          .
          <source>JAMIA</source>
          <volume>17</volume>
          (
          <issue>3</issue>
          ) (
          <year>2010</year>
          )
          <fpage>229</fpage>
          -
          <lpage>236</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baader</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Pushing the EL envelope</article-title>
          .
          <source>In: Proceedings of IJCAI'05</source>
          . (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Distel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Concept adjustment for description logics</article-title>
          .
          <source>In: Proceedings of K-Cap'13</source>
          . (
          <year>2013</year>
          )
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Distel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Learning formal definitions for Snomed CT from text</article-title>
          .
          <source>In: Proceedings of AIME'13</source>
          . (
          <year>2013</year>
          )
          <fpage>73</fpage>
          -
          <lpage>77</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>National</surname>
          </string-name>
          <article-title>Library of Medicine: NLM overview</article-title>
          . http://www.nlm.nih.gov/about/index.html (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Simperl</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bürger</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hangl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wörgl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Ontocom: A reliable cost estimation method for ontology development projects</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>16</volume>
          (
          <issue>5</issue>
          ) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>