<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Semantic Data Mining</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer and Information Science</institution>
          ,
          <addr-line>University of Oregon, Eugene, OR, 97401</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Incorporating domain knowledge is one of the most challenging problems in data mining. The Semantic Web technologies are promising to offer solutions to formally capture and efficiently use the domain knowledge. We call data mining technologies powered by the Semantic Web, capable of systematically incorporating domain knowledge, the semantic data mining. In this paper, we identify the importance of semantic annotation-a crucial step towards realizing semantic data mining by bringing meaning to data, and propose a learning-based semantic search algorithm for annotating (semi-) structured data.</p>
      </abstract>
      <kwd-group>
        <kwd>Haishan Liu</kwd>
        <kwd>Ontology</kwd>
        <kwd>Data mining</kwd>
        <kwd>Semantic annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        It has been widely stated that one of the most important and challenging
problems in data mining is incorporation of domain knowledge. Fayyad et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
contended that the use of domain knowledge is important in all stages of the
knowledge discovery process. When both data and domain knowledge are
available, it is worthwhile to explore the fusion of them. In practice, however, users
of data mining systems are mostly encouraged to express domain knowledge in
a application-specific form, scope and granularity. The ad hoc manner of the
representation hinders efficient usage of the codified knowledge.
      </p>
      <p>
        At the same time, research in the area of the Semantic Web has led to quite
mature standards for modeling and codifying knowledge. Today, Semantic Web
ontologies become a key technology for intelligent knowledge processing,
providing a framework for sharing conceptual models about a domain. Moreover, the
large and continuously growing amount of interlinked Semantic Web data have
provided perfect applications for novel data mining methods. Such methods
focus on relations between objects in addition to features/attributes of objects [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
We propose to exploit the advances of the Semantic Web technologies to formally
represent domain knowledge including structured collection of prior information,
inference rules, knowledge enriched datasets etc, and thus develop frameworks
for systematic incorporation of domain knowledge in an intelligent data mining
environment. We call this technology the semantic data mining. Our ongoing
NEMO project has made a first step to formally representing knowledge in the
ERP domain [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and produced considerable amount of data in RDF. How to
utilize the knowledge and linked data and perform efficient mining becomes the
major research question that motivates our research on semantic data mining.
      </p>
      <p>The majority of data underpinning a wide spectrum of data mining
applications are stored in structured sources such as relational databases (RDB)
with their proven track record of scalability and reliability, or in semi-structured
sources such as spreadsheets with their advantage of low maintenance and cheaper
overheads. Hence the problem of how to impart knowledge encoded in Semantic
Web ontologies to (semi-)structured data becomes a major challenge in realizing
the semantic data mining. Semantic annotation aims at addressing this
challenge by assigning semantic descriptions to elements of data. To ease the burden
of common users that are not familiar with the Semantic Web, we propose, in
this paper, a learning-based semantic search algorithm to automatically suggest
appropriate semantic descriptions for annotation.</p>
      <p>In the next section, we first examine the field of semantic annotation and
then propose the learning-based solution.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Automatic Annotation by Semantic Search</title>
      <p>
        Semantic Annotation aims at assigning to the basic element of information links
to formal semantic descriptions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Such elements should constitute the
semantics of their source. Semantic annotation is crucial in realizing semantic data
mining by bringing meanings to data. Annotating unstructured data (e.g., text)
has been studied more extensively than annotating (semi-)structured data due to
the proliferation of information extraction techniques that facilitate automatic
entity recognition from text. Since large amount of data for knowledge
discovery applications are stored in (semi-)structured sources, we focus on studying
annotating (semi-)structured data in this paper.
      </p>
      <p>The annotation process can be generally divided into two steps. The first is to
establish mappings between existing Semantic Web terms and those need to be
annotated in data. The second step is to come up with a local ontological
structure constituting the semantic web terms to model the data. Most of previous
work in annotating (semi-)structured data focus on the second step. Some skip
the first step and bootstrap the ontological terms and structure from the local
data itself. For example, a number of systems that map data in RDB to RDF
format leverage a set of rules such as “table to class and column to predicate”.</p>
      <p>We argue that such syntactical translation alone without referencing to
existing semantic descriptions does not lend itself well to aiding semantic data mining.
The automatically constructed self-contained local ontology may be applicable
to describe a specific dataset but is most likely too rough to capture the full
domain semantics that is necessary to express meaningful domain knowledge.
Moreover, with the advent of the Semantic Web and pervasive connectivity, an
increasing number of ontologies has been made widely available for reuse. These
ontologies are created by thorough knowledge engineering process and should
serve as better models for annotation. However, on the other hand, the sheer
number of Semantic Web ontologies and lack of effective search functionality
can lead to a huge hidden barrier for common users. Choosing proper
Semantic Web ontologies and terms (classes and properties) requires familiarity with
appropriate ontologies and the terms they define. There is very few system that
is able to provide automatic suggestions. To solve this problem, we propose a
learning-based semantic search algorithm to suggest proper Semantic Web terms
and ontologies for annotation given semantically related words and general
domain and context information.
2.1</p>
      <p>Proposed Learning-based Semantic Search Algorithm
In order to suggest suitable Semantic Web terms and ontologies for users to
annotate their data, we propose a learning-based semantic search algorithm. We
first submit a list of terms appeared in the schema of (semi-)structured data to
our semantic search algorithm and then use the returned results for annotation.
In a fully automatic setting, the search algorithm is configured to return the
top1 hit; while in an interactive setting, the search algorithm returns ordered top-k
search results for users to decide. Previous semantic search algorithms leverage
a variety of measures, including lexical and structural similarities (see details
below) to rank Semantic Web documents according to how likely they can be
semantically matched to the search terms. However, using any single measure
alone may not be sufficient to achieve the optimal result. We propose to combine
various measures to a weighted feature-based search model, where the weights are
learned from training data. We believe the incorporation of learning techniques
will improve the semantic search result.</p>
      <p>
        Feature-based Semantic Search Model. Consider a set of ontologies O =
{O1...Om} returned as the search result for a specific search term. Let =
{ϕ1(Oi)...ϕm(Om)} be a vector of real-valued feature functions ϕ : O 7→ R that
compute rank indicating how ontologies should be ordered in the search result.
The one with the highest rank is the top-hit for a specific search. Let W =
{w1...wm} be a vector of real-valued weights associated with each feature. A
score is computed for each ontology Oi by taking the dot product of the features
and weights: τ (Oi, W) = · W . We can leverage a variety of ranking methods
proposed in the literature as the feature functions . For example, Alan et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
proposed four types of measurements to evaluate the ranks for ontologies given
a list of search terms; The Swoogle search engine [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] weighs different types of
links between Semantic Web data and rank them using link-based algorithms at
three levels of granularity: documents, terms and RDF graphs; Maedche et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
described a two level similarity measure considering both lexical and conceptual
comparisons, etc.
      </p>
      <p>Training Set. Our algorithm to determine the weight vector W requires a
training set of known top hit in the search result (chosen by human): T = {&lt;
O1, l1 &gt; ... &lt; On, ln &gt;} , where each set of ontology namespaces Oi = {O1...Ok}
is associated with label li ∈ {1...k}, indicating which of the ontologies should
be selected for annotating the specific term ti (i.e., Oli ∈ O is the true ontology
selected by human as the best choice for annotating the term). There are several
ways to estimate W from the training set T as described below.
Subgradient Descent. We can view the weight learning as maximum margin
structured learning problem. Given a training set and loss function, the learned
W should score each known top-hit result Oli higher than all other O by at least
L(Oli , O), where L is the loss function. Mathematically, this constraint is
∀i, O ∈ {Oi\Oli }, W · (Oli ) ≥ W · (O) + L(Oli , O) ,
where {Oi\Oli } is the set of possible ontologies returned by a specific search
query excluding the gold standard ontology Oli chosen by human. We can express
this constraint as the following convex program:
min
W; i 2
λ</p>
      <p>
        d
∥W∥2 + 1 ∑ ζi . s.t.∀i, O ∈ O, W · (Oli ) + ζi ≥ W · (O) + L(Oli , O) ,
d i=1
where λ is a regularization term that prevents overfitting. We can rearrange the
convex program to show that the optimal W minimizes
(1)
(2)
c(W) = 1 ∑d ri(W) + λ
d 2
i=1
∥W∥2 ,
where ri(w) = maxO2fOli g(W · (O) + L(Oli , O)) − W · (Oli ) . This objective
function is convex but nondifferentiable. We can therefore minimize it with
subgradient descent, an extension of gradient descent to nondifferentiable objective
functions. The subgradient of equation 1 can be computed iteratively [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] .
Logistic Regression. The second method is based on logistic regression
(sometimes called maximum entropy classification). We modify the traditional logistic
regression loss function to rank, rather than classify, instances. Let the binary
random variable Ci be 1 if and only if ontology Oi is the gold standard chosen
by human. Given W and , we can compute the probability of Ci as follows:
p (Ci = 1|O, W) = ∑Oj2O e (Oj;W)
where the score for ontology Oi is normalized by the scores for every other
ontologies. We can estimate W from the training set T by minimizing the negative
log-likelihood of the data given W:
      </p>
      <p>L (W, T ) = − ∑ log p (Cli |O, W) .</p>
      <p>
        Oi2T
We can find the setting of W that minimizes Equation 2 using limited-memory
BFGS, a gradient ascent method with a second-order approximation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we introduce semantic data mining, an area we envision emerging
as the solution to systematic incorporation of domain knowledge in data mining
with the help of the Semantic Web technologies. We also recognize that vast
amount of information stored in (semi-)structured sources is calling for
attention to develop innovative approaches to solve following challenges: 1) how to
impart knowledge encoded in ontologies into the (semi-) structured data, and 2)
exploration of more meaningful ways to utilize the knowledge. We believe
semantic annotation is the solution to the first challenge. In this paper, we propose a
learning-based semantic search algorithm to suggest appropriate Semantic Web
terms and ontologies.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>This work is supported by the NIH funded NEMO project (Grant No. R01EB007684
NIH/NIBIB). I am grateful to Dr. Dejing Dou and Dr. Daniel Lowd for helpful
discussion and comments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>H.</given-names>
            <surname>Alani</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Brewster</surname>
          </string-name>
          .
          <article-title>Ontology ranking based on the analysis of concept structures</article-title>
          .
          <source>In K-CAP '05: Proceedings of the 3rd international conference on Knowledge capture</source>
          , pages
          <fpage>51</fpage>
          -
          <lpage>58</lpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Finin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Reddivari</surname>
          </string-name>
          .
          <article-title>Search on the semantic web</article-title>
          .
          <source>IEEE Computer</source>
          ,
          <volume>10</volume>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Finin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Kolari</surname>
          </string-name>
          .
          <article-title>Finding and ranking knowledge on the semantic web</article-title>
          .
          <source>In In Proceedings of the 4th International Semantic Web Conference</source>
          , pages
          <fpage>156</fpage>
          -
          <lpage>170</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>U.</given-names>
            <surname>Fayyad</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Piatetsky-shapiro, and</article-title>
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          .
          <article-title>From data mining to knowledge discovery in databases</article-title>
          .
          <source>AI Magazine</source>
          ,
          <volume>17</volume>
          :
          <fpage>37</fpage>
          -
          <lpage>54</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>G.</given-names>
            <surname>Frishkoff</surname>
          </string-name>
          , P. LePendu, R. Frank, H. Liu, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Dou</surname>
          </string-name>
          .
          <article-title>Development of Neural Electromagnetic Ontologies (NEMO): Ontology-based Tools for Representation and Integration of Event-related Brain Potentials</article-title>
          .
          <source>In Proceedings of the International Conference on Biomedical Ontology (ICBO)</source>
          , pages
          <fpage>31</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C.</given-names>
            <surname>Kiefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Locher</surname>
          </string-name>
          .
          <article-title>Adding data mining support to sparql via statistical relational learning methods</article-title>
          . pages
          <fpage>478</fpage>
          -
          <lpage>492</lpage>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Kiryakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Popov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Terziev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Manov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Ognyanoff</surname>
          </string-name>
          .
          <article-title>Semantic annotation, indexing, and retrieval</article-title>
          .
          <source>J. Web Sem</source>
          .,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>49</fpage>
          -
          <lpage>79</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nocedal</surname>
          </string-name>
          , and D. C.
          <article-title>On the limited memory bfgs method for large scale optimization</article-title>
          .
          <source>Mathematical Programming</source>
          ,
          <volume>45</volume>
          :
          <fpage>503</fpage>
          -
          <lpage>528</lpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Maedche</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          .
          <article-title>Measuring similarity between ontologies</article-title>
          .
          <source>In EKAW '02: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web</source>
          , pages
          <fpage>251</fpage>
          -
          <lpage>263</lpage>
          , London, UK,
          <year>2002</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Ratliff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bagnell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinkevich</surname>
          </string-name>
          .
          <article-title>Subgradient methods for structured prediction</article-title>
          .
          <source>In Eleventh International Conference on Arti cial Intelligence and Statistics (AIStats)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>