<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explaining Clusters with Inductive Logic Programming and Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilaria Tiddi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathieu d'Aquin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Motta</string-name>
          <email>enrico.mottag@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge Media Institute, The Open University</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge Discovery consists in discovering hidden regularities in large amounts of data using data mining techniques. The obtained patterns require an interpretation that is usually achieved using some background knowledge given by experts from several domains. On the other hand, the rise of Linked Data has increased the number of connected cross-disciplinary knowledge, in the form of RDF datasets, classes and relationships. Here we show how Linked Data can be used in an Inductive Logic Programming process, where they provide background knowledge for nding hypotheses regarding the unrevealed connections between items of a cluster. By using an example with clusters of books, we show how di erent Linked Data sources can be used to automatically generate rules giving an underlying explanation to such clusters.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        the ones of [
        <xref ref-type="bibr" rid="ref4 ref6 ref8">4, 6, 8</xref>
        ]), few works have considered Linked Data for results
interpretation (some preliminary attempts are to be found in [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ]). However, the
former uses Linked Data only to support the user's navigation, and the latter
does not take into account the whole knowledge discovery process and focuses
on the interpretation of statistical data. For this reason, we aim to exploit the
interconnected knowledge from Linked Data to explain patterns resulting from a
clustering process, by combining the existing semantic technologies with a
Machine Learning technique, i.e. Inductive Logic Programming [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], to automatically
produce underlying explanations for the formation of such patterns.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <sec id="sec-2-1">
        <title>2.1 On Inductive Logic Programming</title>
        <p>
          Inductive Logic Programming (ILP) is a research eld at the intersection of
Machine Learning and Logic Programming, investigating the inductive construction
of rst-order clausal theories starting from a set of examples E = E + [ E [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
While E + represents the relation to be learnt, E are the facts where the relation
does not hold. The distinguished feature of ILP is the use of some additional
background knowledge B about the examples in E . Believing B, and faced with
the facts in E , the induction process derives an hypotheses space H. The success
of the induction requires that H covers all the positive examples (H is complete)
and none of the negative ones (H is consistent ), with respect to B (i.e., there is
no contradiction with the facts written in B).
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Proposed approach</title>
        <p>
          Assuming that we have retrieved some clusters, our approach is articulated as
follows (see Fig. 1):
1. Linked Data Selection. We retrieve information about the data contained
in each cluster from the Linked Data cloud, across several datasets.
2. Hypotheses Generation. We generate some hypotheses using ILP. A
hypothesis is an explanation (\why those items are part of that particular cluster").
3. Hypotheses Evaluation. We validate the hypotheses using two rules
evaluation measures: the Weighted Relative Accuracy (WRacc, as described in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]),
providing a trade-o between coverage and relative accuracy, that we exploit to
obtain explanations for small clusters, and the very well known and Information
Retrieval F-measure (F ).
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>We ran our experiments on the Hudders eld's books usage dataset introduced
in the rst section. Our target problem is de ned as: considering some clustered
books borrowed by students from the Humanities faculty, explain what those books
have in common and why they belong to a particular cluster. The manual analysis
of each cluster's centroid shows that each cluster represents the books borrowed
by students from the same course, such as Music Technologies, Politics or English
Literature.</p>
      <p>For each book, we retrieve some information from the Linked Data cloud. We
rst use bibo:isbn10 as an equivalence property to navigate from the
Hudders eld dataset to the British National Bibliography one3. From there, we retrieve
information about the book using the existing Linked Data vocabularies: Dublin
Core4 for topic and author, the Event Ontology5 for the publication time, place
and publisher. Finally, we exploit the owl:sameAs property to navigate to the
Library of Congress Subject headings6 and retrieve the broader concepts of each
topic using the skos:broader property.</p>
      <p>Clusters and the Linked Data extracted knowledge are encoded as Prolog
clauses as follows:</p>
      <p>E
E
+ clusters clMT (`bocolkM4T'()`.bocolkM1T'()`.book 5').</p>
      <p>RDF predicates subject(`book 1',`electronic music').</p>
      <p>B RDF is-a relations book(`book 1'). topic(`electronic music').</p>
      <p>Here we search the hypothesis space H speci c to the Music Technologies
cluster (clMT ). E+ is composed by books in clMT (as book 1), while books
in other clusters (such as book 4 and book 5) form E . The process is
repeated for each cluster. Both the RDF binary relations (hud:book 1 dc:subject
`electronic music') and the unary ones (hud:book 1 a bibo:book) are also
transformed into Prolog clauses and then added to B.</p>
      <p>We ran several experiments combining di erent properties (in di erent Bs),
in order to see the properties impact on the hypotheses generation. These are
shown in Table 1. Other hypotheses demonstrated the relations between di erent
predicates, such as the relation between a publisher and a speci c topic (see
Table 2).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and future work</title>
      <p>We showed how ILP can be good in generating hypotheses to explain patterns,
e.g. \books borrowed by students of Music Technologies are clustered together
because they talk about music". Although it is a trivial example, the automation of
such a process is not an easy task. We demonstrated how the use of Linked Data
is important to generate such hypotheses, and how combining di erent sources
3 http://bnb.data.bl.uk/
4 http://dublincore.org/documents/dcmi-terms/
5 http://motools.sourceforge.net/event/event.html
6 http://id.loc.gov/authorities/subjects.html</p>
      <p>cl(A):-broader(A,`psychology')^pubPlace(A,`oxford')
English cl(A):-publisher(A,`routledge')^broader(A,`literature')
Literature ^broader(A,`philology')
Politics cl(A):-publisher(A,`macmillan')^broader(A,`political science')
^broader(A,`social sciences')
F (%) WRacc
10.3 0.004
of background knowledge (i.e., di erent datasets) produces better explanations
of patterns of data. The future work concerns the automatic selection of the
datasets from Linked Data, the use of a more appropriate evaluation measure
and the generalisation of the approach to other data mining techniques.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jay</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Interpreting Data Mining Results with Linked Data for Learning Analytics: Motivation, Case Study and Directions</article-title>
          .
          <source>In Third Conference in Learning Analytics and Knowledge (LAK)</source>
          , Leuven, Belgium.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fayyad</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piatetsky-Shapiro</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Smyth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>From data mining to knowledge discovery in databases</article-title>
          .
          <source>AI magazine</source>
          ,
          <volume>17</volume>
          (
          <issue>3</issue>
          ),
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Muggleton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>De Raedt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Inductive logic programming: Theory and methods</article-title>
          .
          <source>The Journal of Logic Programming</source>
          ,
          <volume>19</volume>
          ,
          <fpage>629</fpage>
          -
          <lpage>679</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Narasimha</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kappara</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Vyas</surname>
            ,
            <given-names>O. P.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>LiDDM: A Data Mining System for Linked Data</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lavrac</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flach</surname>
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zupan</surname>
            <given-names>B.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Rule Evaluation Measures: A Unifying View</article-title>
          .
          <source>In Proceedings of the 9th International Workshop on Inductive Logic Programming (ILP '99)</source>
          . Springer-Verlag, London, UK,
          <fpage>174</fpage>
          -
          <lpage>185</lpage>
          . .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp; Fumkranz, J. (
          <year>2012</year>
          , June).
          <article-title>Unsupervised generation of data mining features from linked open data</article-title>
          .
          <source>In Proceedings of the 2nd International Conference on Web Intelligence</source>
          , Mining and Semantics (p.
          <fpage>31</fpage>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Generating possible interpretations for statistics from Linked Open Data</article-title>
          .
          <source>In The Semantic Web: Research and Applications</source>
          . Springer, pp.
          <fpage>560574</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Verborgh</surname>
          </string-name>
          , R.,
          <string-name>
            <surname>Van Deursen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp; Van de Walle,
          <string-name>
            <surname>R.</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Enabling advanced context-based multimedia interpretation using linked data.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>