<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>XML Schema and Topic Map Ontology for Background Knowledge in Data Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomas Kliegr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vojtech Svatek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan Simunek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Stastny</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrej Hazucha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Economics, Prague, Dept. Information and Knowledge Engineering</institution>
          ,
          <addr-line>Nam. Winstona Churchilla 4, 130 67 Praha 3</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Background (or sometimes referred to as domain) knowledge is extensively used in data mining for data pre-processing and for nugget-oriented data mining tasks: it is essential for constraining the search space and pruning the results. Despite the costs of eliciting background knowledge from domain experts, there has been so far little e ort to devise a common exchange standard for its representation. This paper proposes the Background Knowledge Exchange Format (BKEF), a lightweight XML Schema for storing information on features and patterns, and the Background Knowledge Ontology (BKOn), as its semantic abstraction. The purpose of BKOn is to allow reasoning over and integration of analysed data with existing domain ontologies. We show an elicitation interface producing BKEF and discuss the possibilities for integration of such background knowledge with domain ontologies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Elicitation of knowledge from experts has long been known as a crucial research
topic in the eld of expert systems, and its importance is now starting to rise
in data mining applications, too. Background (or sometimes referred to as
domain) knowledge is extensively used in preprocessing of data for most mining
algorithms. It has special importance in association rule mining, where it is used
to separate the nuggets from rules conveying uninteresting information.</p>
      <p>Despite the potential of expert-provided background knowledge for
improving the quality of data mining results, there has been so far little research e ort
onselecting pieces of information that should be collected and little
standardization e orts on devising a common format for representation of background
knowledge. This paper presents one of the rst attempts to address these
problems by introducing the Background Knowledge Exchange Format (BKEF) XML
Schema. Simultaneously, to allow reasoning and integration of analysed data with
existing domain ontologies, we propose a semantic abstraction over BKEF { the
Background Knowledge Ontology (BKOn).</p>
      <p>This paper is organized as follows. Section 2 gives an account of the proposed
design objectives of a background knowledge speci cation. Section 3 introduces
its elementary building blocks and section 4 gives account of speci calities for
association rules. The proposed BK speci cation consisting of BKEF XML Schema
and the BKOn ontology is described in Sections 5 and 6 respectively. The new
possibilities that BKEF and BKOn open in the areas of automating data
mining tasks and result postprocessing are sketched in Section 7. The conclusion
presents an outlook for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Design Objectives</title>
      <p>The work presented here reacts to the pressing need for an industry standard
that would provide a common way of conveying pieces of background knowledge
that express expertise related to features and patterns relevant to datasets in a
given domain. Hence, although in the common case the knowledge acquistion is
driven by the need for knowledge pertaining to a speci c mining task and speci c
dataset, the standard should impose such principles that would foster reuse of
the knowledge in a di erent task-dataset scenario. While the work presented
here has experimental character, it follows some of the design guidelines that,
we believe, should be addressed by any serious attempt on an industry standard
speci cation.</p>
      <p>We will use the term background knowledge producer to denote a computer
program, such as a specialized elicitation interface, used by the domain expert
to input his/her background knowledge related to the data mining task.</p>
      <p>The background knowledge consumer, in turn, denotes a computer program
that uses background knowledge (BK). We consider the following types of BK
consumers: data preprocessing algorithms, data mining algorithms,
postprocessing algorithms and semantic knowledge bases.
2.1</p>
      <sec id="sec-2-1">
        <title>One size does not t all</title>
        <p>The standard should be constituted by an XML Schema and an ontology to
accommodate for the di erent needs of background knowledge producers and
consumers.</p>
        <p>
          It may seem natural that the language in which the speci cation is de ned
is selected so that its expressivity is at least such as required by the most
demanding consumer type, which is the semantic knowledge base. The semantic
knowledge base [11] interlinks mining models, background knowledge and
domain ontologies, and as such it would take advantage of background knowledge
comming directly in a semantic format such as RDF/OWL [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] or the Topic
Maps' XTM [7]. However, there are reasons for not using a semantic format as
the primary standard used by data mining and knowledge elicitation software.
The main ones include:
{ poor readability due to structural complexity
{ verbosity
{ the need for specialized, not widely available APIs
        </p>
        <p>Therefore, we propose using an XML Schema as an interchange format
between background knowledge consumers and background knowledge producers.
To foster the interoperability on the semantic level, the speci cation should also
de ne a semantic version of the XML Schema (an ontology) and a
transformation between the schema and the ontology. This transformation is to be executed
on the side of the BK consumer.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Background Knowledge Consumer Requirements</title>
        <p>The primary goal of the speci cation is to provide pieces of information that can
be automatically processed by background knowledge consumers and doing so
can enhance their functioning.</p>
        <p>An overview of requirements on the speci cation posed by the individual
consumers is given in Table 1. This table was constructed based on the analysis
of requirements of the LISp-Miner mining suite1 and the SEWEBAR framework2
as Semantic KB for association rules, but the authors conjecture that the table
should be, with some changes, applicable to other mining tasks and algorithms.</p>
        <p>Requirements on storing the types of information of types 1{3 require
inherently no semantics and can be met by the XML Schema speci cation. Since
indisputably one of the consumers of background knowledge is the human data
analyst, the speci cation should also provide the domain expert with the
possibility to complement the machine-readable values with a free-text annotation.</p>
        <p>The requirements of the Semantic KB consumer type are addressed in
subsection 2.3. While closely linked to background knowledge and essential for the
Semantic KB, machine-readable annotations fall out of the scope of the
background knowledge speci cation.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 Integration with Other Speci cations</title>
        <p>The background knowledge speci cation discussed here has strong links with
PMML, the widely adopted standard for data mining model interchange 3. The
1 http://lispminer.vse.cz
2 http://sewebar.vse.cz/
3 http://dmg.org
proposed speci cation plays the same role for background knowledge as PMML
does for mining models. For background knowledge consumers to be able to
apply this knowledge together with knowledge gained from PMML, the need for
alignment with PMML arises.</p>
        <p>While one of the key design objectives is independence of the BK speci cation
of a speci c dataset/task scenario, the bond between the BK speci cation and a
concrete dataset or mining model should be established in a separate mapping
speci cation. Further, we brie y introduce an attempt for such a speci cation
dubbed FML (Field Mapping Language).</p>
        <p>PMML is backed by an XML Schema, which eases the design of the mapping.
A more complex problem arises with the requirements imposed by the
Semantic KB consumer type. The purpose of Semantic KBs is to perform reasoning,
integration and search over the data. From this arises the necessity to annotate
the entities that emerged during the background knowledge elicitation process
(such as features, values and patterns) with an association to relevant concepts
in other ontologies or with unstructured sources. Since this annotation
information transcedes the scope of a single dataset, we suggest to support it with a
standalone speci cation (an XML Schema or an ontology) so that it is not a
direct part of BKEF, but is only linked with it. Since the only BKEF consumer
in our framework that has direct use for this kind of information is the Semantic
KB, a semantic format such as RDF/OWL could be more convenient for storing
the annotations than XML Schema. Additionaly, this annotation can aid the
process of automatic mapping of BKEF onto a speci c dataset resulting into an
FML speci cation.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Basic Concepts</title>
      <sec id="sec-3-1">
        <title>Metaattribute</title>
        <p>The basic building block of a background knowledge speci cation is a
metaattribute [14], which is an abstraction representing the underlying property of a
data- eld. There is a hierarchical structure between metaattributes. The
metaattribute on the nest granularity level is referred to as atomic metaattribute. Other
attributes are called group metaattributes.</p>
        <p>Since a property can be sometimes measured in di erent ways, most
commonly using di erent units, each metaattribute has multiple formats. Actually,
most pieces of information relating to a metaattribute are format-dependent.
Speci cally, a format can contain:
{ a value range,
{ standard value binning(s),
{ a collation.</p>
        <p>Since the speci cation is intended to be used in conjunction with a dataset,
where a data eld always conforms to one metaattribute format, it is
advantageous to introduce a common term Meta- eld for an atomic
metaattributeformat pair.</p>
        <p>Similarly Meta- eld Value is an abstraction of a possible 'value' of a meta eld
{ value or interval falling within the scope given in the value range or one of the
groupings.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Patterns</title>
        <p>Known relationships between metaattributes are captured using patterns. Since
often the pattern only applies to a speci c format or involves a value, the notion
of meta- eld and meta- eld value is central for their de nition.</p>
        <p>The purpose of patterns is to be used in conjunction with the data mining
algorithm, most commonly either in the algorithm itself or in the further
processing of results. As such, it is di cult to introduce a uni ed framework for
pattern representation that would be equally usable for all types of data mining
tasks and algorithms. Therefore the speci cation should propose suitable types
of patterns for the main data mining algorithms (such as classi cation, clustering
or association rule mining).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Background Knowledge for Association Rule Mining</title>
      <p>We introduce two types of patterns that were designed to aid the association
mining algorithms; their prospective utilization for other types of mining
algorithms is a matter for further research. These two types are Mutual In uences
and Background Knowledge Association Rules.</p>
      <p>A Background Association Rule (BAR) has the form of</p>
      <p>Here the Antecedent , Consequent and Condition are Boolean
Metaattributes and is a type of 4ft-quanti er. The optional explicitly corresponds
to value(s) of Interest Measures associated with the 4ft-quanti er. The BAR is
Conditional if the Condition is present.</p>
      <p>4ft-quanti er corresponds to a set of conditions (interest measures ) de ned
on the four- eld contingency table, which is a quadruple of natural numbers ha,
b, c, di so that: a is the number of objects(rows) from the data matrix satisfying
' and , b satisfying ' and : , c satisfying :' and and d the number of
objects satisfying :' and : . A Boolean Meta-attribute is a recursive structure
comprising conjunctions, disjunctions and negations of combinations of
individual items (Meta eld-Value pairs). A Boolean Meta-attribute is Basic or Derived.
A Basic Boolean Meta-Attribute has the form of b( ), where the Coe cient is
a subset of possible Values of Meta-Field b. A Derived Boolean Attribute is a
conjunction or disjunction of Boolean Meta-attributes, or a negation of a Boolean
Meta-attribute.</p>
      <p>The Background Association Rule can be input independently into the
Pattern component of a BKEF document, or as an Atomic Consequences element
within a Mutual In uences element. The notion of Mutual In uence comes out
of research by Rauch &amp; Simunek [14], who proposed to use it as a knowledge
elicitation aid.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Background Knowledge Exchange Format</title>
      <p>The Background Knowledge Exchange Format (BKEF) is de ned by an XML
Schema and used for storing mining models of a particular knowledge domain.
The BKEF XML Schema consists of two main building blocks: de nitions of
meta-attributes and de nitions of patterns. A metaattribute is understood as
an abstraction of the ultimate property of the mining model [14] with all
characteristics explained so far, hence metaattributes are simultaneously comprised in
the BKEF XML Schema. Mutual in uences among the metaattributes together
form a pattern. A simpli ed schema is shown in Fig. 1.</p>
      <p>BKEF Schema Overview</p>
      <p>Meta-Attributes [example]</p>
      <p>Format [example]
Meta-Attributes [1..*]
Meta-Attribute [1..*]
Annotation [0..*]</p>
      <p>Format [0..*]
Patterns [0..1]
Association Rules
Mutual In uences [0..1]
Mutual In uence [0..*]</p>
      <p>T
r
a
n
frs
o
m
e
d
BBaacckkggrroouunnddAA..RRuuleles [[00....*1]] .[.]to1*
h
a
s
C
h
li
d
.[].
0
*</p>
      <p>Blood Pressure (Group Meta-A.)
Diastolic blood pressure</p>
      <sec id="sec-5-1">
        <title>Formats Format: mmHg</title>
        <p>Format: kPa
Systolic blood pressure
Variability: stable</p>
        <p>Formats (...)
Risc Factors (Group Meta-A.)
Annotation: (...)
Child Meta Attribute: Smoking
Child Meta Attribute: Weight
Fig. 1. Schema of BKEF
mmHg
Author: MUDr. Plesny
Data Type: Float
Allowed Range: 50;300
Collation: numerical/ascending
Preprocessing Hints
Discretization Hint:
patient without Diabetes
Interval Enumeration
Interval Bin Name: 50;90
normal
Interval Bin Name: 90;140
increased
5.1</p>
        <sec id="sec-5-1-1">
          <title>Metaattribute De nitions in BKEF</title>
          <p>The XML Schema restricts meta-attributes to a two-level hierarchy. The base
level encompasses indivisible MetaAttributes4 (level = 0) - basic layer, evenly
atomic metaattribute. The upper level comprises groups of the MetaAttribute
elements (level = 1); each group contains an unlimited number of the
MetaAttribute.
4 Typewriter text labels on particular elements of the BKEF XML Schema where it is
necessary to refer about XML elements for the proper understanding.
Groups of meta-attributes A general collection of MetaAttribute elements. The
group should have a name, unique identi cation and at least one link to the
MetaAttribute of level = 0 (which is called ChildMetaAttribute from this
point of view).</p>
          <p>Meta-Attribute The main focus of the MetaAttribute is the multiple de
ninition of the Format as the property could be expressed in di erent ways of
measurement. The Annotation together with the author's name are used for
additional information on di erent authors. See an example:
&lt;Annotation&gt;
&lt;Text&gt;Measured in 2009&lt;/Text&gt;</p>
          <p>&lt;Author&gt;MUDr. Plesny&lt;/Author&gt;
&lt;/Annotation&gt;</p>
          <p>The Variability of the MetaAttribute is expressed either as stable or
actionable whereas the unchangeable properties in the mining model are stable.
E.g. the date of birth cannot be changed, thus this metaattribute is referred to
as stable. If we for example expect that the systolic blood pressure can be in
uenced by some other property, we refer to the Variability as actionable [17],
otherwise it can also be a stable MetaAttribute; this depends on the mining
model and its research targets. An atomic MetaAttribute element contains at
least one Format.</p>
          <p>Format The Format is identi ed by a unique name (within the collection)
and encompasses the following elements: Author, Annotations (which is a
collection of particular annotations), DateType, ValueType, ValueAnnotations,
AllowedRange, Collation, PreprocessingHints and ValueDescriptions.</p>
          <p>Each Annotation consists of the name of an author and the commentary
- each format could be commented through the Annotations (collection of
Annotation elements). The Author of the Format is self-explanatory, as a value
of the DataType is used some of the common data type readable by the
intended consumer BK (string, integer, boolean etc.). The ValueType content
distinguishes between cardinal, nominal, ordinal and a real number. Commonly
used are values as nominal and ordinal for qualitative meta-attribute and
cardinal (which means an interval or a rational number) for quantitative
metaattributes [13].</p>
          <p>The ValueAnnotations element is de ned for the commentary to particular
values: each value can be commented separately more than once. The particular
annotation has the same format as the Annotation.</p>
          <p>The AllowedRange element denotes a value boundary of the particular format
of the MetaAttribute. Thus the formats of the same values can di er. The range
can be de ned by Interval for quantitative values (maximum and minimum) or
by Enumeration for qualitative values. See an example of allowed range de ned
by an interval:
&lt;Interval&gt;
&lt;LeftBound type="closed" value="2"/&gt;
&lt;RightBound type="closed" value="15"/&gt;
&lt;/Interval&gt;</p>
          <p>The Collation expresses a commonly accepted arrangement of the greater
than relation between format values, if such an arrangement exists. This is
essential for interpretation of the be greater than relationship between values [14].
The BKEF XML Schema di erentiates between easily sortable numerical
values and qualitative values whose sequence is expressed by the enumeration as
depicted on the following example:
&lt;Collation type="Numerical" sense="Ascending" /&gt;
respectively
&lt;Collation type="Enumeration" sense="Ascending"&gt;
&lt;Value&gt;elementary&lt;/Value&gt;
&lt;Value&gt;secondary&lt;/Value&gt;
&lt;Value&gt;university&lt;/Value&gt;
&lt;/Collation&gt;</p>
          <p>The PreprocessingHints element conveys to a BK Consumer the
information on how to prepare data. The current version of the BKEF XML Schema
allows one or more DiscretizationHint elements as the only possible child
elements of the Preprocessing Hint. The values of the DiscretizationHint
are assorted into discreet counterparts. There can be more than one
preprocessing hint, for example depending on the desired granularity of the
metaattribute values. The way of discretization is set up by ExhaustiveEnumeration
or IntervalEnumeration. It re ects all intended values of the metaattribute
designated for the BK consumer and consecutive mining tasks. The element
IntervalEnumeration is used for numerical values, as seen from an example:
&lt;IntervalEnumeration&gt;
&lt;IntervalBin name="normal"&gt;</p>
          <p>&lt;Annotation&gt;...&lt;/Annotation&gt;
&lt;Interval&gt;
&lt;LeftBound type="closed" value="60"/&gt;
&lt;RightBound type="closed" value="88"/&gt;
&lt;/Interval&gt;
&lt;/IntervalBin&gt;
&lt;IntervalBin name="overweight indicator"&gt;</p>
          <p>&lt;Annotation&gt;...&lt;/Annotation&gt;
&lt;Interval&gt;
&lt;LeftBound type="closed" value="88"/&gt;
&lt;RightBound type="closed" value="140"/&gt;
&lt;/Interval&gt;</p>
          <p>&lt;/IntervalBin&gt;
&lt;/IntervalEnumeration&gt;
&lt;ExhaustiveEnumeration&gt;
&lt;Bin name="yes"&gt;
&lt;Annotation&gt;...&lt;/Annotation&gt;
&lt;Value&gt;yes&lt;/Value&gt;
&lt;/Bin&gt;
&lt;Bin name="no"&gt;
&lt;Annotation&gt;...&lt;/Annotation&gt;
&lt;Value&gt;no&lt;/Value&gt;
&lt;/Bin&gt;
&lt;/ExhaustiveEnumeration&gt;</p>
          <p>
            An example of ExhaustiveEnumeration for non-numerical values is:
The exhaustive enumeration corresponds with the Map Values (where the values
are de ned as a table) of PMML 3.2 [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ].
          </p>
          <p>
            There are another two variations of interval enumeration: Equifrequent (the
number of intervals is given and the interval boundaries are determined
automatically so that the frequency of values falling into each interval is roughly identical)
and Equidistant (given exact lenght of an interval). The Discretization Hint
element does not include the value sets aggregation (known from PMML[
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]),
otherwise the clear and expressive discretization hint structure is one of the
strengths of the BKEF XML Schema.
          </p>
          <p>The Value Descriptions element is used for characteristics of particular
values. It uses the Interval or Value elements for numerical and non-numerical
values, respectively.
&lt;ValueDescriptions&gt;
&lt;ValueDescription type="Significant"&gt;
&lt;Annotation&gt;...&lt;/Annotation&gt;
&lt;Interval&gt;
&lt;LeftBound type="closed" value="100"/&gt;
&lt;RightBound type="closed" value="150"/&gt;
&lt;/Interval&gt;
&lt;/ValueDescription&gt;
&lt;/ValueDescriptions&gt;</p>
          <p>In general, setting of the Collation, PreprocessingHints and
ValueDescriptions is not a question of an exact method, as their determination is fully
dependent on the domain expert and a particular mining task.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.2 Patterns in BKEF</title>
          <p>The current BKEF XML Schema allows to de ne MutualInfluences, which are
a base for the BAR.</p>
          <p>A MutualInfluences contains at least one MutualInfluence, which forms
a relation between two metaattributes A ! B.
&lt;Influence type="Positive-bool-growth" id="20" arity="2"&gt;
&lt;KnowledgeValidity&gt;Unknown&lt;/KnowledgeValidity&gt;
&lt;MetaAttribute role="A" name="weight"&gt;
&lt;RestrictedTo&gt;&lt;Format name="kg"/&gt;&lt;/RestrictedTo&gt;
&lt;/MetaAttribute&gt;
&lt;MetaAttribute role="B" name="Hyperlipoproteinemy"&gt;
&lt;RestrictedTo&gt;
&lt;Format name="boolean value"&gt;
&lt;Value format="boolean value"&gt;yes&lt;/Value&gt;
&lt;/Format&gt;
&lt;/RestrictedTo&gt;
&lt;/MetaAttribute&gt;
&lt;/Influence&gt;
KnowledgeValidity can have two values { Unknown, Proven or Rejected {
regarding the mining task result. The metaattribute appearing in the in uence
might be restricted to the Format or even particular value (which should be
linked with the corresponding Format of the atomic MetaAttribute).
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Background Knowledge Ontology</title>
      <p>The Background Knowledge Ontology is a semantic abstraction of the BKEF
XML Schema introduced in section 5. The purpose of the BKEF XML Schema
is to rigidly enumerate what types of background knowledge are acceptable and
in what format. To this, BKOn adds information on relations between the pieces
of background knowledge by explicitly linking them through typed associations,
thus adding machine-readable semantics for background knowledge consumers.
The most prominent consumer is the Semantic KB, which utilizes these relations
for reasoning.</p>
      <p>Adding semantics to the BKOn results in reshu ing of the BKEF content.
The design guidelines that were followed when translating BKEF nodes to BKOn
ontology topics are the same that were followed when creating the Association
Rule Mining Ontology from PMML as described in [10]. Reenumerating the
guidelines is out of the scope of this paper, nevertheless the main principle is
simple { allow for automatic transformation of BKEF XML documents into
instances of the ontology concepts while making the resulting ontology as clean
as possible.</p>
      <p>To achieve this, the following prominent changes in BKOn compared to
BKEF were made
{ some concepts that were only implicitly present in the BKEF XML Schema
are explicitly present in BKOn,
{ some BKEF XML nodes do not have a corresponding concept in the ontology
as they are contained in the newly created concepts,
{ explicit superclasses for closely related topics are introduced.</p>
      <p>Some of the concrete examples of these changes are as follows: Meta eld
becomes an explicit ontology concept and a concept directly corresponding to
the Format BKEF element is no longer explicitly present in the ontology. One
instance of the Metafield concept is created from each pair of Format element
and its containing Metaattribute element.</p>
      <p>The Meta eld Binned Content is used as a superclass for EnumerationBin
and IntervalBin, and Meta eld Raw Content as a superclass for Interval
and Value. Both these newly introduced concepts have the Meta eld Content
superclass.</p>
      <p>We make a reference transformation implemented as an XSLT stylesheet
available5. The gist of BKOn is depicted on Figure 2.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Exploiting BKEF and BKOn in the Data Mining Loop</title>
      <p>
        This section demonstrates a possible use case of BKEF and BKOn, in
conjunction with the academic data mining system LISp-Miner and the SEWEBAR
framework. LISp-Miner is an academic system for KDD developed at University
of Economics, Prague [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for teaching and research in the area of KDD. It
consists of several procedures covering the entire process of KDD as described in
the CRISP-DM methodology.6. The SEWEBAR (for: Semantic Web {
Analytical Reports) framework involves a content management system and a semantic
knowledge base for creating and sharing knowledge relating to data mining tasks.
It is based on the Joomla! CMS and the Ontopia Topic-Map-based Knowledge
Base.7
      </p>
      <p>This section goes through elicitation of background knowledge within
SEWEBARCMS, its linking with the mined data using the FML, using it to localize search
and prune results within the LISp-Miner system, and nally through its
semantic postprocessing, again in SEWEBAR-SKB. The description of the work ow
is illustrated in a data mining task whose purpose is to nd novel knowledge in
a cardiological dataset.
7.1</p>
      <sec id="sec-7-1">
        <title>Background Knowledge Elicitation</title>
        <p>
          The rst implementation of background knowledge elicitation was integrated into
the LM KnowledgeSource and LM DataSource modules [19] of the LISp-Miner
system. However, it emerged later that it is more suitable for domain experts to
use a web-based system. This prompted the development of the BKEF Editor
(see [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]), as one of the modules of SEWEBAR-CMS.
5 At http://sewebar.vse.cz
6 www.crisp-dm.org
7 See ontopia.net and joomla.org for more info
Example Starting the aforementioned data mining use case, consider a medical
expert, a cardiologist, who initiates the data mining process. The cardiologist
uses the BKEF editor to convey her knowledge of the characteristics that are
recorded about cardiological patients and indicates known and interesting
relationships appearing in these characteristics.
7.2
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>Linking Background Knowledge with Mined Data</title>
        <p>The main challenge faced is how to properly match data elds that are used in
the current data mining task with the semantically equivalent metaattributes.
This problem can be divided into two steps: choosing the right BKEF le for the
domain being mined and matching metaattributes and their values with data
elds and data eld values. While this problem is a unique one, it bears signi cant
resemblance with problems that are addressed in ontology alignment and schema
mapping research [6]. Since fully automated construction of a reliable mapping
seems to be unfeasible given the state of the art in ontology matching and schema
mapping, a semi-automated mapping approach is proposed. There is an ongoing
work on a web-based system that would propose such a mapping based on a
mixture of schema mapping and ontology alignment techniques, which would
then have the user con rm the proposed mappings. The result of this mapping
is a Field Mapping Language (FML) document. The data mining system will
use a web service to locate and retrieve correct FML and BKEF les.
Example The data analyst working with the cardiological dataset searches
for BKEF les related to the dataset. Two such les are found. The rst one
is a BKEF le created by the cardiologist; the second is from a di erent
domain, but it contains general medical elds such as Age or Blood pressure. Once
the metaattributes are mapped to data elds though the semiautomatic process
highlighted above, the data mining software can use the Preprocessing hints
associated with mapped metaattributes to automatically perform discretization
and outlier treatment.
7.3</p>
      </sec>
      <sec id="sec-7-3">
        <title>Background Knowledge for Localizing Search</title>
        <p>In LISp-Miner, the rst implemented use of background knowledge was to guide
users in the process of de ning Local Analytical Questions (LAQs). That is to
properly de ne what kind of patterns in the analyzed data we are looking for.
LAQs are based on pre-de ned patterns that lead to di erent types of questions
asked and therefore to di erent data mining procedures used for answering them.
LAQs were rst proposed in [18].</p>
        <p>Based on actual background knowledge the rst type of LAQ pattern could
be to mine for yet unknown in uences between two groups of attributes (e.g.
social status attributes and health status attributes). Or, another LAQ pattern
could be used to pinpoint some condition under which some relationship stored
into ontology does not hold (e.g. Concerning men above 50 living in Prague it IS
NOT TRUE that..."). Solving such a LAQ could lead to updates of background
knowledge.</p>
        <p>Example The data analyst is looking for guides to help him/her design the
parameters of the data mining task. Based on the information contained in the
BKEF pattern section, the data mining system shows that it is already known
by the experts that high waist-hip ratio is associated with hypertension. Based
on this piece of information, the data analyst instructs the system to look for
exceptions to this rule { i.e. to nd subsets of data (circumstances) where the
high waist-hip ratio is NOT associated with hypertension.
7.4</p>
      </sec>
      <sec id="sec-7-4">
        <title>Background Knowledge for Result Pruning</title>
        <p>Another prospective use of background knowledge is pruning of the results of
data mining that are of no value for experts (e.g. of patients giving birth to
child, at least 99 % are women). If such a relationship is stored in BKEF, no
implicational8 association rule with the attribute concerning ability to give birth
to a child on the left side (antecedent) and gender on the right side (succedent)
will be placed into results.</p>
        <p>Even more useful is pruning in case of a function-like dependency between
two attributes, e.g. Age and Height. In general, there is a clear dependency
between the age of people and their height. When described by association rules
many speci c rules will emerge in results, which is undesirable. Instead, a
bettersuited procedure of the KL-Miner (see e.g. [16] could be (automatically) used
and many association rules related to this dependency could be pruned from the
results and represented by a single KxL-fold contingency table to describe this
function like dependency as a single pattern.</p>
        <p>Example The cardiologist is not interest in obvious facts in the results. So all
patterns expressing already known relationship between the high waist-hip ratio
and hypertension are automatically pruned from the results (if not explicitly
overruled by the data analyst). This covers all the derived patterns, i.e. even
pruning of extended patterns that logically follow from the simple implication
of the form waist-hip ratio(high) =&gt; hypertension(true).
7.5</p>
      </sec>
      <sec id="sec-7-5">
        <title>Background Knowledge for Postprocessing</title>
        <p>SEWEBAR-CMS [11] accepts mining models in PMML sent through a web
service by the data mining system. The BKEF XML les are already present in
the system as they originate there. Combining these pieces of information, the
analyst conveys the results to the domain expert through a textual analytical
report using special report-authoring tools within the CMS [20]. PMML and
BKEF documents are semantized according to the Data Mining Ontology [10]
8 A subclass of association rules [12].
and the BKOn ontology. They are interlinked and stored in the SEWEBAR-SKB,
which answers queries issued from the CMS. The queries are issued in the tolog
query language, which is a combination of Prolog and SQL. The results of the
queries are returned by the Semantic KB in XML, using an XSLT transformation
converted to HTML and returned to the user.</p>
        <p>Example To communicate the results to medical specialists, the data analyst
creates a textual analytical report summarizing his/her ndings. In the report
s/he also includes the semantic query against the Semantic KB for related
association rules that were found in previous tasks, including those executed over
di erent datasets.
8</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Conclusions</title>
      <p>The main purpose of this paper was to discuss the requirements on a standard
for exchange of background knowledge in data mining. The paper also details
an attempt for such a speci cation consisting of the BKEF Schema and BKOn
ontology. Practical experience with these formats has already been described
in [11], including the interlinking of BKOn with a data mining ontology for
association rules introduced in [10] and examples of semantic queries over the
merged ontologies.</p>
      <p>Future work will primarily address the issue of `smart' interlinking to domain
ontologies, presumably using ontology patterns9. This will allow to explicitly
disambiguate vague notions, e.g. that of hypertension, which can equally be a
summarization of several measurements or a permanent characteristic of a
patient. In relation to that, a version of BKOn based on the RDF/OWL formalism
(in addition to the Topic Map one) will be built.
9</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgment</title>
      <p>This work has been partly supported from grant no IGA 15/2010 of UEP and
by grant GAR 201/08/0802 of Czech Grant Agency.
9 http://www.ontologydesignpatterns.org
6. Euzenat J. and Shvaiko P.: Ontology matching. Springer-Verlag. 2007. ISBN
3-54049611-4.
7. Garshol L. M., Moore G.: Topic Maps i?1 XML Syntax. ISO/IEC JTC1/SC34,
http://www.isotopicmaps.org/sam/sam-xtm/.
8. Garshol, L.M.: TMRAP -i?1 Topic Maps Remote Access Protocol. In: Maicher, L.,
Sigel, A., Garshol, L.M. (eds.) TMRA 2006. LNCS (LNAI), vol. 4438. Springer,
Heidelberg (2007)
9. Garshol, L.M.: Towards a Methodology for Developing Topic Maps Ontologies. In:
Maicher, L., Sigel, A., Garshol, L.M. (eds.) TMRA 2006. LNCS (LNAI), vol. 4438.</p>
      <p>Springer, Heidelberg (2007)
10. Kliegr, T., Ovecka M., Zemanek, J.: Topic Maps for Association Rule Mining. In:</p>
      <p>Proc. TMRA 2009. University of Leipzig 2009.
11. Kliegr M., Ralbovsky M., Svatek, V, Simunek M., Jirkovsky V., Nemrava J.,
Zemanek J.: Semantic Analytical Reports: A Framework for Post-Processing Data
Mining Results. In: Foundations of Intelligent Systems (ISMIS'09). Springer
Verlag, LNCS, 2009, 88i?198.
12. Rauch, J.: Classes of Association Rules: An Overview. In: Studies In Computational</p>
      <p>Intelligence. Springer 2008.
13. Rauch J.: Considerations on Logical Calculi for Dealing with Knowledge in Data
Mining. In: Advances in Data Management. Studies in Computational Intelligence,
Volume 223/2009, Springer 2009.
14. Rauch J., Simunek M.: Dealing with Background Knowledge in the SEWEBAR
Project. In: Knowledge Discovery Enhanced with Semantic and Social Information.</p>
      <p>Studies in Computational Intelligence, Volume 220/2009, Springer 2009.
15. Rauch J., Simunek M.: Alternative Approach to Mining Association Rules. In
Lin T Y, Ohsuga S, Liau C J, and Tsumoto S (eds): Data Mining: Foundations,
Methods, and Applications, Springer-Verlag, 2005.
16. Rauch, J., Simunek, M., L n, V.: Mining for Patterns Based on Contingency Tables
by KL-Miner First Experience. In: Foundations and Novel Approaches in Data
Mining. Berlin : Springer-Verlag, 2005, s. 155167. ISBN 3-540-28315-3. ISSN
1860949X.
17. Rauch, J., Simunek, M.: Action Rules and the GUHA Method: Preliminary
Considerations and Results. ISMIS 2009: 76-87
18. Rauch, J., Simunek, M.: LAREDAM Considerations on System of Local Analytical
Reports from Data Mining. Toronto 20.05.2008 { 23.05.2008. In: Foundations of
Intelligent Systems. Berlin : Springer-Verlag, 2008, pp. 143{149.
19. Simunek, M.: Academic KDD Project LISp-Miner. In: Advances in Soft Computing
- Intelligent Systems Desing and Applications. Heidelberg : Springer-Verlag, 2003,
s. 263272. ISBN 3-540-40426-0.
20. Vojir S.: SEWEBAR - gInclude - Analytical Report Design using gInclude. In:
Znalosti 2010, Czech Republic, in Czech, February 2010.</p>
      <p>Fig. 2. Background Knowledge Ontology Overview</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. LISp-Miner:
          <article-title>academic system for KDD [online]</article-title>
          .
          <source>[cit</source>
          . 2010-
          <volume>03</volume>
          -20], available from WWW: &lt;http://lispminer.vse.cz&gt;
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>OWL</given-names>
            <surname>Web</surname>
          </string-name>
          <article-title>Ontology Language Overview</article-title>
          .
          <source>W3C Recommendation</source>
          , 10
          <year>February 2004</year>
          . http://www.w3.org/TR/owl-features/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. W3C: XSL Transformation.
          <article-title>Online: www</article-title>
          .w3.org/TR/xslt. 1999
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. DMG:
          <article-title>PMML 3.2 Speci cation</article-title>
          , Online: http://www.dmg.
          <source>org/pmml-v3-2</source>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Balhar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kliegr</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stastny</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vojir</surname>
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Elicitation of Background Knowledge for Data Mining</article-title>
          .
          <source>In: Znalosti</source>
          <year>2010</year>
          ,
          <string-name>
            <given-names>Czech</given-names>
            <surname>Republic</surname>
          </string-name>
          ,
          <year>February 2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>