<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data-Driven Logical Reasoning</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Data &amp; Knowledge Management Unit - Fondazione Bruno Kessler</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science - University of Bari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The co-existence of heterogeneous but complementary data sources, such as ontologies and databases describing the same domain, is the reality of the Web today. In this paper we argue that this complementarity could be exploited both for discovering the knowledge not captured in the ontology but learnable from the data, and for enhancing the process of ontological reasoning by relying on the combination of formal domain models and evidence coming from data. We build upon our previous work on knowledge discovery from heterogeneous sources of information via association rules mining, and propose a method for automated reasoning on grounded knowledge bases (i.e. knowledge bases linked to data) based on the standard Tableaux algorithm. The proposed approach combines logical reasoning and statistical inference thus making sense of heterogeneous data sources.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        From the introduction of the Semantic Web view [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], many domain ontologies have
been developed and stored in open access repositories. However, still huge amounts
of data are stored in relational databases (DBs) and managed by RDBMSs (relational
database management systems). The seamless integration of these two knowledge
representation paradigms is becoming a crucial research challenge. Most of the work in this
area concerns what is addressed as ontology based data access (OBDA) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In OBDA
the ontology “replicates” at a higher conceptual level the physical schema of the DBMS
and provides a “lens” under which the data can be viewed, and possibly adds additional
semantic knowledge on the data. The connection between the ontology and the data
is represented as conjunctive queries. Roughly speaking, every concept/relation of the
ontology is associated to a conjunctive query which retrieves from the DB all and only
the instances of such a concept/relation.
      </p>
      <p>
        Another common situation is when existing ontologies describe domain aspects that
(partially) complement data in a database. In this case the concepts of the ontologies are
linkable to views of the DB. Also in this case it would be very useful to be able to
combine the knowledge contained in the two information sources, for example for enriching
the existing ontologies. Due to the heterogeneity of the information a crisp
representation of the correspondence between the DB data and the classes and relations of the
ontologies (such as the one adopted in OBDA) is not possible. A more flexible connection
between the two sources of knowledge should be adopted. An option could be to exploit
rules that are able to express a statistical evidence of the connection between the data in
a DB and the knowledge in the ontology. For giving the intuition for this solution, let us
consider the following scenario. Given an existing ontology describing people gender,
family status and their interrelations with Italian urban areas1 and a demographic DB
describing Italian occupations, average salaries, etc., a possible connection between the
two information sources can be described with rules as follows:
“clerks between 35 and 45 years old living in a big city are male and
earn between 40 and 50 e” with a confidence value of 0.75.
(1)
where bold face terms correspond to classes and relations in the ontology, non bold
face terms correspond to data in the DB. The confidence value can be interpreted as the
probability that the specified connection between the two sources occurs. Rules of the
form (1) are called semantically enriched association rules. They have been introduced
in our previous work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] where an inductive approach for discovering new knowledge
in the form of association rules [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] from heterogeneous data sources is proposed.
      </p>
      <p>
        In this paper, we revise the approach introduced in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] by taking into account the
Open World Assumption adopted in description logics (DLs), while in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] association
rules are extracted from the hybrid data sources by adopting an implicit Closed Word
Assumption that is not fully compliant with the theory of ontological representation.
We also make a further step towards the framework for knowledge representation and
reasoning in which knowledge can be represented by a mix of logical formulas and sets
of data, linked together. Specifically, we introduce a concept of a grounded knowledge
base and the notion of mixed model that integrates logical knowledge expressed in terms
of a description logic knowledge base, and a statistical data mining model that expresses
the statistical regularities of the properties associated to a set of individuals. Finally and
most importantly, we propose a method for automated reasoning on grounded
knowledge bases, which is the result of combining logical reasoning and statistical inductive
inference and learning. In particular, we propose an extension of the standard Tableaux
algorithm grounded on the adoption of an heuristic to be used when random choices
(i.e. the processing of a disjunction, namely when we need to decide whether an object
x belongs to concept C or to concept D) have to be made during the reasoning process.
The heuristic exploits the evidence coming from the data. Assume, for example, that for
a given object x, which is a P erson, a high school student, and has the property x is 15
years old, we need to decide whether x is a P arent or not, and there is no statements
in the knowledge base from which it is possible to infer neither x is a P arent nor x
is ¬P arent. The following association rule learned from the data (with high degree of
confidence)
      </p>
      <p>AGE = [0, 16] ⇒ ¬P arent 0.99
can be exploited to conclude that, with high probability, x is not a P arent.</p>
      <p>
        The rest of the paper is structured as follows. In Section 2 we give basic
definitions necessary to set up the framework. In Section 3 we summarize and extend the
approach for learning association rules from heterogeneous sources of information
pre1 The concepts “male”, “parent”, “big city”, “medium-sized town”, and the relations
”lives in” are used in the ontology.
sented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Section 4 presents the data-driven Tableaux reasoning algorithm, followed
by discussions and conclusions in Section 5.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Basic definitions</title>
      <p>Let D be a non empty set of objects and f1, . . . , fn be n feature functions defined on
every element of D, with fi : D → Di. D is called the set of observed objects and
fi(d) for every d ∈ D is the i-th feature observed on d. Notationally we use d for the
elements of D and d1, . . . , dn to denote the values of f1(d), . . . , fn(d).</p>
      <p>Let Σ be a DL alphabet composed of three disjoint sets of symbols, ΣC , ΣR and ΣI ,
the set of concepts symbols, the set of role symbols and the set of individual symbols.
A knowledge base on Σ, is a set K of DL inclusion axioms and DL assertions (we
assume ALC as DL language here). The elements of K are called axioms of K. An
axiom can be of the form X v Y , where X and Y are ALC (complex) concepts, or
X(a) where X is a (complex) concept and a an individual symbol, or R(a, b), and
a = b , where R is a role symbol and a and b are individual symbols. We call X v Y a
subsumption, and X(a), R(a, b), and a = b assertions. K is formally defined as a couple
K = hT , Ai where T contains the inclusion axioms (T stands for Terminological part)
and A contains the assertional axioms (A stands for Assertional part).</p>
      <p>
        An interpretation of a DL alphabet Σ is a pair I = ΔI , ·I such that ΔI is a
non empty set, and ·I is a function that assigns to each concepts name a subset of ΔI ,
to each role name a binary relation on ΔI , and to each individual an element of ΔI .
The interpretation function can be extended to complex concepts in the usual way [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Satisfiability |= of statements is also defined as usual [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. I |= K if I |= φ for every
axiom of K. An interpretation I satisfies a knowledge base K, (in symbols I |= K) if
I |= φ for every axiom of K.
      </p>
      <p>The “glue” between a dataset and a knowledge base is the so called grounding,
which is a relation that connects the objects of the knowledge base with the data of the
database. More formally: a grounding g of Σ on D is a total function g : D → ΣI .
This implies that for every d ∈ D there is at least an element a ∈ ΣI with g(d) = a.
Intuitively g(d) = a represents the fact that the data d are about/correspond to object
a of the knowledge base. Please note that the grounding g refers to objects that are
explicitly mentioned in D and K respectively. In our framework (see Sect. 3) we assume
that the grounding between D and K is already given.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Semantically enriched association rules</title>
      <p>
        Association rules (ARs), originally introduced in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], make it possible to represent in
a rule based form some statistical regularities of the tuples in a relational database.
Roughly speaking, ARs allow one to state conditional probabilities among the values
of the attributes of the tuples of a database. Learning ARs is one of the fundamental
tasks in data-mining.
      </p>
      <p>
        In this section we recall how ARs can be extended to include information coming
from an ontological knowledge base and how they can be used to bridge the knowledge
contained in an ontology with that contained in a relational database. These rules are
called semantically enriched ARs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Hence, we revise the approach introduced in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
by taking into account the Open World Assumption adopted in description logics (DL),
while in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] association rules are extracted from the hybrid data sources by adopting
an implicit Closed Word Assumption that is not fully compliant with the theory of
ontological representation. At the end of the section, the process of learning semantically
enriched ARs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is also briefly recalled.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Association rules: an overview</title>
        <p>
          Association rules [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] provide a form of rule patterns for data mining. Let D be a dataset
made by a set of attributes {A1, . . . , An} with domains Di : i ∈ {1, . . . , n}. The basic
components of an AR for the dataset D are itemsets. An itemset φ is a finite set of
assignments of the form A = a with a ∈ D(A). An itemset {Ai1 = a1, . . . , Aim =
am} can be denoted by the expression
An AR has the general form
        </p>
        <p>Ai1 = a1 ∧ . . . ∧ Aim = am
θ ⇒ ϕ
(2)
where θ and ϕ are itemsets. The frequency of an itemsets θ, denoted by freq (θ), is the
number of cases in D that match θ, i.e.</p>
        <p>freq (θ) = |{d ∈ D | ∀(A = a) ∈ θ : fA(d) = a}|
where fA is the feature function for d w.r.t. the attribute A (see beginning of Sect.2).</p>
        <p>The support of a rule θ ⇒ ϕ is equal to freq (θ ∧ ϕ). The confidence of a rule θ ⇒ ϕ
is the fraction of items in D that match ϕ among those matching θ:
conf (θ ⇒ ϕ) =
freq (θ ∧ ϕ)
freq (θ)
A frequent itemset expresses the variables and the corresponding values that occur
reasonably often together.</p>
        <p>
          In terms of conditional probability, the confidence of a rule θ ⇒ ϕ, can be seen as
the maximum likelihood (frequency-based) estimate of the conditional probability that
ϕ is true given that θ is true [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Semantically enriched association rules</title>
        <p>Let K be a knowledge base on Σ, D a dataset and g a grounding of Σ on D.</p>
        <p>A semantically enriched itemset is a set containing statements of the form fi = a,
C = tv, R = tv where, fi is an attribute of D, a a value in the range of fi, C
is a concept name of ΣC and R is a role name of ΣR and tv is a truth value in
{true, false, unknown}. The elements of the itemset of the form fi = a are called
data items, the elements of the form C = tv and R = tv are called semantic items.</p>
        <p>A semantically enriched AR is an association rules made by semantically enriched
itemsets. This means that for a certain set of individuals, both knowledge coming from
the ontology and information coming from the database are available (see Sect. 3.3 for
more details).</p>
        <p>Coherently with ARs, it is possible to define the frequency of a semantically
enriched itemset and the support of a semantically enriched AR. Given a grounding g of
Σ on D, the frequency of a semantically enriched itemset θ = θd ∧ θk (in the following
also called mixed itemset) is the following generalization of the definition of frequency
given for a standard itemset.</p>
        <p>freq (θd ∧ θk) = |F |
where F is the following set:









F = 









∀(fi = a) ∈ θd, fi(d) = a
∀(C = true) ∈ θk, K |= C(g(d))
∀(C = false) ∈ θk, K |= ¬C(g(d))
d ∈ D ∀(C = unknown) ∈ θk, K 6|= C(g(d)) &amp; K 6|= ¬C(g(d))
∀(R = true) ∈ θk, K |= ∃R.&gt;(g(d)) 
∀(R = false) ∈ θk, K |= ¬∃R.&gt;(g(d)) 
∀(R = unknown) ∈ θk, K 6|= ∃R.&gt;(g(d)) &amp; K 6|= ¬∃R.&gt;(g(d)) 









</p>
        <p>The support and confidence of a semantically enriched AR can be defined similarly.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Learning semantically enriched association rules</title>
        <p>
          In [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] we proposed a framework for learning semantically enriched ARs from
heterogeneous sources of information (namely an ontology and a relational database) grounded
on the underlying Closed World Assumption that is not fully compliant with the theory
of ontological representation. Here we extend the framework by taking into account the
Open World Assumption usually made in DLs that is the theoretical framework
underlying the OWL2 language, namely the standard representation language in the Semantic
Web [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. With regard to this aspect it is important to note that the notions of frequency,
confidence and support given in Sect. 3.2 are compliant with the Open World Semantics.
        </p>
        <p>The approach for learning semantically enriched ARs is grounded on the
assumption that a dataset D and an ontological knowledge base K share (a subset of) common
individuals, and a grounding g of Σ on D is already available (see the end of Sect. 2
for more details on the grounding function). This assumption is reasonable in practice
since, in the real world, there are several cases in which different information aspects
concerning the same entities come from different data sources. An example is given by
the public administration, where different administrative organizations have
information about the same persons but concerning complementary aspects such as: personal
data, income data, ownership data. Another example is given by the biological domain
where research organizations have their own databases that could be complemented
with existing domain ontologies.</p>
        <p>The proposed framework is sketched in the following. To learn semantically
enriched ARs from a dataset D and a knowledge base K grounded by g to D, all the
2 http://en.wikipedia.org/wiki/Web_Ontology_Language
information about the common domain of K and D are summarized
(proposizionalized) in a tabular representation constructed as follows:
1. choose the primary entity of interest in D or K for extracting association rules and
set this entity as the first attribute A1 in the table T to be built; A1 will be the
primary key of the table
2. choose (a subset of) the attributes in D that are of interest for A1 and set them as
additional attributes in T; the corresponding values are be obtained as a result of
an SQL query involving the selected attributes and A1
3. choose (a subset of) concept names {C1, . . . , Cm} in K that are of interest for A1
and set their names as additional attribute names in T
4. for each Ck ∈ {C1, . . . , Cm} and for each value ai of A1, if K |= Ck(ai) then set
to 1 the corresponding value of Ck in T, else if K |= ¬Ck(ai) then set the value to
0, otherwise set to 1/2 the corresponding value of Ck in T
5. choose (a subset of) role names {R1, . . . , Rt} in K that are of interest for A1 and
set their names as additional attribute names in T
6. for each Rl ∈ {R1, . . . , Rt} and for each value ai of A1, if ∃y ∈ K s.t. K |=
Rl(ai, y) then set to 1 the value of Rl in T, else if ∀y ∈ K K |= ¬Rl(ai, y) then
set the value of Rl in T to 0, otherwise set the value of Rl in T to 1/2
7. choose (a subset of) the datatype property names {T1, . . . , Tv} in K that are of
interest for A1 and set their names as additional attribute names in T
8. for each Tj ∈ {T1, . . . , Tv} and for each value ai of A1, if K |= Tj (ai, dataValuej )
then set to dataValuej the corresponding value of Tj in T, set 0 otherwise.</p>
        <p>
          It is straightforward to note that for all but the datatype properties, the Open World
Assumption is considered during the process for building the tabular representation.
Numeric attributes are processed (as usual in data mining) for performing data
discretization [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] namely for transforming numerical values in corresponding range of values
(categorical values). An example of a unique tabular representation in the demographic
domain is reported in Tab. 1 where P erson, P arent, M ale and F emale are concepts
of an ontological knowledge base K, and JOB and AGE are attributes of a relational
dataset D. The numeric attribute (AGE) has been discretized.
        </p>
        <p>
          The choice of representing the integrated source of information within tables allows
for directly applying state of the art algorithms for learning association rules. Indeed,
once a unique tabular representation is obtained, the well known APRIORI algorithm [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
is applied for discovering semantically enriched ARs from the integrated source of
information3 (see [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for additional details and examples). Specifically, given a certain
confidence threshold, ARs having a confidence value equal or greater than the fixed
confidence threshold are learnt. This ensures that only significant ARs are considered
while the others are discarded. As highlighted in sect. 3.1, the confidence value of the
extracted semantically enriched ARs is interpreted as the conditional probability on the
values of items in the consequence of the rule given that the left hand side of the rule
is satisfied in (a model of) the available knowledge. Examples of semantically enriched
ARs that could be learned from a table like Tab. 1 are reported in Tab. 2.
3 Since a state of the art algorithm is adopted it is not reported in the paper. The novelty of the
proposed approach consists in the way the integrated source of knowledge is built. Once this
is obtained, the state of the art APRIORI algorithm is straightforwardly applied.
# Rule
1 (AGE=[16, 25]) ∧ (JOB = Student) ⇒ ¬P arent
2 (JOB=P oliceman) ⇒ M ale
3 (AGE=[16, 25]) ∧ P arent ⇒ F emale
4 (JOB=P rimary school teacher) ⇒ F emale
5 (JOB=Housewif e) ∧ (AGE = [26, 35]) ⇒ P arent ∧ F emale
We want to exploit the semantically enriched ARs (see Sect. 3.3) when performing
deductive reasoning given DLs (namely ontological) representations. Since almost all DL
inferences can be reduced to concept satisfiability [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], we focus on this inference
procedure. For most expressive DL (such as ALC) the Tableaux algorithm is employed. Its
goal is to built a possible model, namely an interpretation, for the concept whose
satisfiability has to be shown. If, building such a model, all clashes (namely contradictions)
are found, the model does not exist and the concept is declared to be unsatisfiable.
        </p>
        <p>
          Our goal is to set up a modified version of the Tableaux algorithm whose output, if
any, is the most plausible model, namely the model that best fits the available data. This
means to set up a data driven heuristic that should allow reducing the computational
effort in finding a model for a given concept and should be also able to supply the model
that is most coherent with/match the available knowledge. In this way the variance
due to intended diversity and incomplete knowledge is reduced, namely, the number
of possible models that could be built (see [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for formal definitions). The inference
problem we want to solve is formally defined as follows:
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Definition 1 (Inference Problem).</title>
        <p>Given: D, K, the set R of ARs, a (possibly complex) concept E of K, the individuals
x1, . . . , xk ∈ K that are instances of E, the grounding g of Σ on D
Determine: the model Ir for E representing the most plausible model given the K, D,
g and R.</p>
        <p>Intuitively, the most plausible model for E is the one on top of the ranking of the
possible models Ii for E. The ranking of the possible models is built according to the
degree up to which the models respect the ARs. The detailed procedure for building the
most plausible model is illustrated in the following.</p>
        <p>
          In order to find (or not find) a model, the standard Tableaux algorithm exploits a
set of transformation rules that are applied to the considered concept. A
transformation rule for each constructor of the considered language exists. In the following, the
transformation rules for ALC logic are briefly recalled (see [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for more details).
u-rule: IF the ABox A contains (C1 u C2)(x), but it does not contain both C1(x) and
        </p>
        <p>C2(x) THEN A = A ∪ {C1(x), C2(x)}
t-rule: IF A contains (C1 t C2)(x), but it does not contain neither C1(x) nor C2(x)</p>
        <p>THEN A1 = A ∪ {C1(x)}, A2 = A ∪ {C2(x)}
∃-rule: IF A contains (∃R.C)(x), but there is no individual name z s.t. C(z) and
R(x, z) are in A THEN A = A ∪ {C(y), R(x, y)} where y is an individual name
not occurring in A.
∀-rule: IF A contains (∀R.C)(x) and R(x, y), but it does not contain C(y) THEN
A = A ∪ {C(y)}</p>
        <p>To test the satisfiability of a concept E, the algorithm starts with the ABox A =
E(x0) (with x0 being a new individual) and applies to the ABox the consistency
preserving transformation rules reported above until no more rules apply. The result could
be all clashes, which means the concept is unsatisfiable, or an ABox containing a model
for the concept E that means the concept is satisfiable.</p>
        <p>The transformation rule for the disjunction (t-rule) is non-deterministic, that is,
a given ABox is transformed into finitely many new ABoxes. The original ABox is
consistent if and only if one of the new ABoxes is so. In order to save the computational
complexity, the ideal solution (for the case of a consistent concept) should be to choose
the ABox containing a model directly. Moving from this observation, in the following
we propose an alternative version of the Tableaux algorithm. The main differences with
respect to the standard Tableaux algorithm summarized above are:
1. the starting model for the inference process is given by the set of all attributes
(and corresponding values) of D that are related to individuals x1, . . . , xk that are
instances of E differently from the standard Tableaux algorithm where the initial
model is simply given by the assertion concerning the concept of which the
satisfiability (or unsatisfiability) has to be shown,
2. a heuristic is adopted in performing the t-rule, differently from the standard case
where no heuristic is given,
3. the most plausible model for the concept E and individuals x1, . . . , xk is built with
respect to the available knowledge K, D and R. The obtained model is a mixed
model, namely a model containing both information from R and K. Differently, in
the standard Tableaux algorithm the model that is built only refers to K and does
not take into account the (assertional) available knowledge.</p>
        <p>In the following these three characteristics are analyzed and the way for accomplish
each of them is illustrated. First of all, the way in which the starting model Ir is built is
illustrated. For each xi ∈ {x1, . . . , xk}, all attribute names Ai related to xi are selected4
4 As an example the following query may be performed: SELECT * FROM hTABLE NAMEi</p>
        <p>WHERE Ai = xi. Alternatively, a subset of the attributes in D may be considered.
jointly with the corresponding attribute values ai. The assertions Ai(ai) are added to Ir.
For simplicity and without loss of generality, a single individual x will be considered in
the following. The generalization to multiple individuals is straightforward by simply
applying the same procedure to all individuals that are (or assumed to be) instances of
the considered concept.</p>
        <p>Once the initial model Ir is built, all deterministic expansion rules, namely all but
t-rule, are applied following the standard Tableaux algorithm as reported above.
Instead, for the case of the t-rule, a heuristic is adopted. The goal of such a heuristic is
twofold: a) choosing a new consistent ABox almost in one step to save computational
complexity if E(x) is consistent (see discussion above concerning the t-rule); b)
driving the construction of the most plausible model given K and R. The approach for the
assessing the heuristic is illustrated in the following.</p>
        <p>Let C t D be the disjunctive concept to be processed by t-rule. The choice on C
rather than D (or vice versa) will be driven by the following process.</p>
        <p>– The ARs (see Sect. 3.2) containing C (resp. D) or its negation in the semantic items
of the right hand side of the rules are selected.
– Given the model under construction Ir, the left hand side of each selected rule
is considered and the degree of match is computed. This is done by counting the
number of (both data and semantic) items in the left hand side of a rule that are
contained in Ir, and averaging this number w.r.t. the length of the left hand side
of the rule. Items with uncertain (unknown) values are not taken into account. The
degree of match for the rules whose (part of the) left hand side is contradictory
w.r.t. to the model is set to 0.
– After the degree of match is computed, the rules having the degree of match equal
to 0 are discarded.
– For each of the remaining rules the weighted confidence value is computed as
weightedConf = ruleConf idence ∗ degreeOf M atch.
– Rules that have the degree of match below a given threshold (e.g. 0.75) are
discarded.
– The rule having the highest weighted confidence value is selected; in case of equal
weighted confidence value of different rules, a random choice is performed.
– If the chosen rule contains C = 1 (resp. D = 1) in the right hand side, the model
under construction Ir is enriched with C(x) (resp. D(x)), where x is the individual
under consideration.
– If the chosen rule contains C = 0 (resp. D = 0) in the right hand side, the model
under construction Ir is enriched with D(x) (resp. C(x)).
– In the general case the right hand side of the selected AR may contain additional
items besides that involving C or D. Assertions concerning such additional items
will be also added in Ir accordingly5.</p>
        <p>If there are no extracted ARs (satisfying a fixed confidence threshold)
containing neither C or D in the right hand side, the following approach may be adopted.
5 If a most conservative behavior of the heuristic has to be considered only the assertion
concerning the disjunct C (resp. D) will be added in Ir while the additional items in the right
hand side of the selected rules are not taken into account.</p>
        <p>Given Ir, a corresponding item set is created by transforming each assertion Ai(ai)
referring to an attribute in D as a data item Ai = ai, each concept and role
assertion to a knowledge item. Specifically, each positive (not negated) assertion is
transformed in concept/role name = 1, each negative assertion is transformed in
concept/role name = 0. Let θ be the conventional name of such a built itemset. Four
rules, θ ⇒ C = 1, θ ⇒ C = 0, θ ⇒ D = 1 and θ ⇒ D = 0 are created and their
confidence value is computed (see Sect. 3.2). Then, the rule having the highest confidence
(satisfying a given confidence threshold) value is selected and the corresponding right
hand side will be used as a guideline for expanding Ir.</p>
        <p>The presented approach for the case in which no rules are available could result to
be computationally expensive. As an alternative, the following criterion, grounded in
the exploitation of the prior probability of C (resp. D) could be used. Specifically, the
prior probability is computed, by adopting a frequency-based approach, as: P (C) =
|ext(C)|/|A| where ext(C) is the extension of C, namely the number of individuals
that are instances (asserted or derived) of C and | · | returns the cardinality of the set
extension. Similarly P (D) can be defined for D. The concept to be chosen for extending
Ir will be the one having the highest prior probability.</p>
        <p>In the cases discussed above, the disjunctive expression is assumed to be made by
atomic concept names. However, in ALC, more complex expressions may occur as part
of a disjunctive expression as: existential concept restrictions (i.e. ∃R.A t ∃R.B),
universal concept restrictions (i.e. ∀R.A t ∀S.B), nested concept expression (i.e. ∃R.∃S.A
or ∃R.(A u B)). To cope with these cases a straightforward solution is envisioned: new
concept names are created for naming the cases listed above. In this way, a
disjunction of atomic concept names is finally obtained. These new artificial concept names
have to be added in the table representing the heterogeneous source of information (see
Sect. 3.2) and the process for discovering ARs has to be run (see Sect. 3.2). This is
because potentially useful ARs for treating the disjuncts may be found. It is important
to note that the artificial concept names are not used for the process of discovering
new knowledge in itself (as illustrated in Sect. 3.2) but only for the reasoning purpose
presented in this section.</p>
        <p>Now let us consider the following example concerning the demographic domain
where the starting point for the inference is given in Tab. 3. Note that it is assumed that
P arent is true for x2. In the following the expansion of (M ale t F emale)(x) for
degreeOf M atch(rule2) = 0,</p>
        <p>As processing a disjunct expansion we always add assertions coming from the
evidence of the available knowledge, the proposed approach should ensure that the model
built is the one mostly compliant with the statistical regularities learned from data.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion and concluding remarks</title>
      <p>
        To summarize, in this paper we make a step towards the framework for knowledge
representation and reasoning in which knowledge is specified by a mix of logical formulas
and data linked together. We revise the preliminary results we presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] by
explicitly taking into account the Open World Assumption made in DLs. Differently from [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
where federated DBs are considered with the goal of removing structural conflicts
automatically while maintaining unchanged the views of the different DBs, we focus on
building a new knowledge base that is able to collect the complementary knowledge that
is contained in heterogeneous sources of information. Eventually, we propose a method
for data-driven logical reasoning, which is the result of combining logical reasoning and
data mining methods embedded in a Tableaux algorithm. Differently from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], where
an integrated system (AL- log) for knowledge representation based on DL and the
deductive database language Datalog is presented, here purely relational databases are
considered. Additionally, while in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] a method for performing query answering based
on constrained resolution is proposed, where the usual deduction procedure defined for
Datalog is integrated with a method for reasoning on the structural knowledge, here a
more expressive DL is considered and semantically enriched ARs are also exploited.
      </p>
      <p>Our proposed mixed inference imitates in a way the cognitive reasoning process
performed by humans. Indeed a human usually performs a logic reasoning process when
he/she has knowledge that is assumed to be complete for a certain domain, for example,
the medical domain. This step is represented in our case by the standard deductive
approach. If some degrees of uncertainty occur, for instance there are strange symptoms
that do not allow for a straightforward diagnosis, then existing cases are analyzed to
support a given diagnosis or an alternative one. The existing cases would represent
our external source of information (DBs and/or ARs). The process for determining a
diagnosis is now driven by the integration of the logic reasoning process and inductive
reasoning process that takes into account the additional cases and tries to produce a
reasonable diagnosis given the additional available evidence.</p>
      <p>The most plausible model that we build may be enriched with additional knowledge
coming from the external source of information. Specifically, given the selected rule for
resolving a disjunction (see Sect. 3.3), the information on the left hand side
concerning only the external source of information could be added as part of the model under
construction thus applying a sort of abductive inference. Alternatively, this additional
knowledge may be exploited during the matching process for preferring a rule rather
than another one (besides of the condition concerning the confidence of the rule).
Particularly, as an additional criterion the level of match between the rule and the model
under construction may be considered. The same would be done for additional
information on the right hand side of a rule even if this case appears to be a bit more problematic.</p>
      <p>For the future we aim at implementing the extended Tableaux algorithm for
experimental purpose.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Swami</surname>
          </string-name>
          .
          <article-title>Mining association rules between sets of items in large databases</article-title>
          . In P. Buneman and S. Jajodia, editors,
          <source>Proc. of the 1993 ACM SIGMOD Int. Conf. on Management of Data</source>
          , pages
          <fpage>207</fpage>
          -
          <lpage>216</lpage>
          . ACM Press,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F.</given-names>
            <surname>Baader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Patel-</surname>
          </string-name>
          Schneider, editors.
          <source>The Description Logic Handbook: Theory</source>
          , Implementation, and
          <string-name>
            <surname>Applications</surname>
          </string-name>
          . Cambridge University Press,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          , and
          <string-name>
            <surname>O. Lassila.</surname>
          </string-name>
          <article-title>The semantic web</article-title>
          .
          <source>Scientific American</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, Riccardo Rosati, Marco Ruzzi, and Domenico Fabio Savo.
          <article-title>The mastro system for ontology-based data access</article-title>
          .
          <source>Semantic Web Journal</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>43</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. C.
          <string-name>
            <surname>d'Amato</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Bryl</surname>
            , and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Serafini</surname>
          </string-name>
          .
          <article-title>Semantic knowledge discovery from heterogeneous data sources</article-title>
          . In H. Stuckenschmidt et al., editor,
          <source>Proc. of 18th Int. Conf. on Knowledge Engineering and Knowledge Management</source>
          ,
          <string-name>
            <surname>EKAW</surname>
          </string-name>
          <year>2012</year>
          , page To appear,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Donini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nardi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Schaerf</surname>
          </string-name>
          .
          <article-title>AL-log: Integrating datalog and description logics</article-title>
          .
          <source>Journal of Intelligent Information Systems</source>
          ,
          <volume>10</volume>
          :
          <fpage>227</fpage>
          -
          <lpage>252</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Grimm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Motik</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Preist</surname>
          </string-name>
          .
          <article-title>Variance in e-business service discovery</article-title>
          .
          <source>In Proceedings of the ISWC Workshop on Semantic Web Services</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>David J.</given-names>
            <surname>Hand</surname>
          </string-name>
          , Padhraic Smyth, and
          <string-name>
            <given-names>Heikki</given-names>
            <surname>Mannila</surname>
          </string-name>
          .
          <article-title>Principles of data mining</article-title>
          . MIT Press, Cambridge, MA, USA,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Spaccapietra</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Parent</surname>
          </string-name>
          .
          <article-title>View integration: A step forward in solving structural conflicts</article-title>
          .
          <source>IEEE TRANS. On Knowledge and Data Engineering</source>
          ,
          <volume>6</volume>
          (
          <issue>2</issue>
          ):
          <fpage>258</fpage>
          -
          <lpage>274</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ian</surname>
            <given-names>Witten</given-names>
          </string-name>
          , Eibe Frank, and
          <string-name>
            <given-names>Mark A.</given-names>
            <surname>Hall</surname>
          </string-name>
          .
          <source>Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition</source>
          . Morgan Kaufman,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>