<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semi-Automated Data-Driven Methods to Support Ontology Development</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Systems, Umm Al-Qura University</institution>
          ,
          <country country="SA">Saudi Arabia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Neuroscience, Newcastle University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computing, Newcastle University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Ontology development is expensive and requires signi cant e orts from both domain experts and ontologists. Automating the process usually produces unsatisfactory results and involves knowledge acquisition, which is intrinsically hard. In this abstract, we are investigating semi-automated techniques for bootstrapping and and supporting data-driven ontology development. Rehabilitation therapies are hard to describe, measure and compare; unlike pharmacologic therapies, they are not precisely de ned. This brings an interesting ontological challenge, because rehabilitation treatments are practice-based, diverse and involve interactions between a therapist, a patient and their environment. Therefore, we are using the domain of rehabilitation as a case study to build a rehabilitation therapy ontology (RTO). Here, we are proposing a pipeline for building semantic knowledge structures to support developing ontologies from biomedical literature. The pipeline starts with an initial small set of articles provided by experts in the domain. This requires relatively little from the domain expert, beyond a set of references to appropriate papers, something that most researchers will have through their normal bibliography management facilities.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The initial set of articles does not cover the domain; therefore, we expand
this to a corpus of PubMed records that are relevant and cover the scope of the
initial set using live PubMed's similar articles functionality and our pioneered
relative similarity measure [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], that retrieves articles related to the whole initial
set. In our case study , we were able to expand from initial set of 200 references,
provided from two experts in the domain of rehabilitation, to around 28,000
references using this technique.
      </p>
      <p>Full texts of the identi ed records of the corpus are then retrieved and pass
through several text pre-processing and cleaning steps. For phrase detection,
then, we apply word2phrase which is based on words' co-occurrences. Words
and phrases in the text are the terms of the corpus, but they are not
representative of the domain. To determine semantically meaningful and domain-related
representative terminology, we apply the term frequency- inverse document
frequency (tf-idf ) technique. The result is a list of terms and phrases that are
ranked according to their representation of the domain. Domain experts can
arbitrarily threshold through the tf-idf scores to identify and extract top ranked
representative terms.</p>
      <p>
        The list of extracted terms can neither represent the semantics of the terms
nor the relationships amongst them. Therefore, we develop a semantic knowledge
structure that represents those. To develop the knowledge structure, we facilitate
the list of extracted terms, their word embeddings from a trained word2vec [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
model, and a Directed Acyclic Graph (DAG) based on their lexical similarities,
i.e. string-substring relationships. Semantic \subclass" relationships were found
amongst the terms using the word2vec analogy technique. These were con rmed
via the lexical DAG. Thus, we have a taxonomy-like knowledge structure based
on word2vec semantic relationships. To add more relationships to the structure
that are di erent from the \subclass" relationships, we can modify the word2vec
analogy questions.
      </p>
      <p>
        We hope that the nal structure can be used to bootstrap an ontology by
domain experts and curators rather than starting from scratch. This is similar to
sca olding the mitochondrial disease ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; nevertheless rather than using
sca olds from existing knowledge sources, here, we have generated the sca olds
in a data-driven method. These sca olds are initially linked to easily discover
semantic relations, and have a \todo" list ranked with their importance (i.e. the
ranked list of terms) for curators to bootstrap the ontology in order.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Halawani</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forsyth</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lord</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A literature based approach to de ne the scope of biomedical ontologies: A case study on a rehabilitation therapy ontology</article-title>
          .
          <source>arXiv preprint arXiv:1709.09450</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Warrender</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lord</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Sca olding the mitochondrial disease ontology from extant knowledge sources</article-title>
          .
          <source>arXiv preprint arXiv:1505.04114</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>