<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graph Kernels for Task 1 and 2 of the Linked Data Data-Mining Challenge 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gerben Klaas Dirk de Vries</string-name>
          <email>g.k.d.devries@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>System and Network Engineering Group, Informatics Institute, University of Amsterdam</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>In this paper we present the application of two RDF graph kernels to task 1 and 2 of the linked data data-mining challenge. Both graph kernels use term vectors to handle RDF literals. Based on experiments with the task data, we use the Weisfeiler-Lehman RDF graph kernel for task 1 and the intersection path tree kernel for task 2 in our nal classi ers for the challenge. Applying these graph kernels is very straightforward and requires (almost) no preprocessing of the data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Kernel methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are a popular solution to learning from structured data.
Graph kernels provide an interesting approach to learning from RDF. Currently
there exists some research on this topic [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. The main advantage of using graph
kernels on RDF data is that the technique is very generic and can be widely
applied without much preprocessing and domain knowledge of the datasets
involved.
      </p>
      <p>In this paper we present the application of two RDF graph kernel methods
in two data-mining from RDF tasks. Both tasks are part of the linked data
data-mining challenge1, which is part of the 2013 Data-Mining on Linked Data
(DMoLD) workshop. We apply these techniques with very little preprocessing
of the task data.</p>
      <p>In the rest of this paper we rst brie y describe the algorithms used, which
are both more elaborately described in other papers. Then we discuss some
experiments with the data, which lead to classi ers used for the two tasks. We
end with some conclusions.
The graph kernel approach to learning from RDF is based on the idea that RDF
instances are represented by their subgraphs. By computing a kernel function
on the subgraphs and then training a classi er, i.e. a support vector machine,
we can predict properties for the instances.</p>
      <sec id="sec-1-1">
        <title>1 http://keg.vse.cz/dmold2013/data-description.html</title>
        <p>
          For tasks 1 and 2 of the challenge we used two graph kernel algorithms. The
rst algorithm is an extension of the Weisfeiler-Lehman RDF (WL RDF) kernel
presented in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The second kernel is called the Intersection Tree Path (ITP)
kernel and is presented in an accompanying paper in the DMoLD workshop [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>The WL RDF kernel computes the number of subtrees shared between two
graphs by using the Weisfeiler-Lehman test of graph isomorphism. The rewriting
procedure underlying the Weisfeiler-Lehman kernel creates a multiset label for
each vertex/edge, based on the labels of the neighbors of that vertex/edge. This
multiset is sorted and together with the original label concatenated into a string,
which is the new label. For each unique string a new (shorter) label is introduced
and this replaces the original vertex label. Based on the counts of the di erent
labels a feature vector is constructed for each instance.</p>
        <p>
          The intersection tree path kernel counts all the paths starting from the
instance vertex up to depth d and creates a feature vector with these counts. It is
similar to the intersection subtree kernel in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], but is a lot faster to compute.
        </p>
        <p>
          Both algorithms that we use are extended to handle RDF literals using a
bagof-words approach, by creating a term vector for each literal. Two term vectors
are compared when the vertex label, for WL RDF, or the path to the vertex,
for ITP, are equal. More details can be found in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. When working with text
and term vectors it is standard to normalize term vectors by converting them to
Term Frequency-Inverse Document Frequency (TF-IDF) vectors. We also apply
this normalization to our computed feature vectors, for both the WL RDF kernel
and the intersection path tree kernel.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>In the following section we present our experiments with the Weisfeiler-Lehman
RDF with bag-of-words kernel and the intersection tree path with bag-of-words
kernel on task 1 and 2 of the linked data data-mining challenge. First we discuss
some preprocessing that we did and then we present the results on the two
tasks. During our investigation, some other graph kernels were also applied to
the data, but we do not cover those here, since the two kernels discussed in this
paper showed the best results.</p>
      <p>
        All of the code for the experiments was written in Java and is available
online.2 For our classi cation algorithm we used the Java version of LibLINEAR
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and we used SESAME3 to handle RDF data and add extra inferenced triples
using its RDFS reasoner.
3.1
      </p>
      <p>Preprocessing
We have applied relatively little preprocessing to the provided train and test
sets. In all the datasets we had to correct the literals of the type: xsd:gYear,</p>
      <sec id="sec-2-1">
        <title>2 https://github.com/Data2Semantics/d2s-tools</title>
      </sec>
      <sec id="sec-2-2">
        <title>3 http://www.openrdf.org/</title>
        <p>since Sesame did not parse them properly. This correction was simple and only
involved changing the full date string to just the year.</p>
        <p>For task 1 we removed instances for which the label occurred less than 5 times,
to remove outliers. The numberOfTenders values for the remaining instances
were then binned in the following bins: [0:5; 1:5); [1:5; 2:5); [2:5; 3:5); [3:5; 4:5),
[4:5; 5:5),[5:5; 6:5); [6:5; 7:5),[7:5; 8:5); [8:5; 9:5); [9:5; 12:5); [12:5; 15:5); [15:5; 18:5),
[18:5; 23:5]. The size of the bins follows the distribution of the values of the
numberOfTenders property. We chose to introduce binning so that next to a
regression algorithm we can also use a classi cation algorithm.</p>
        <p>We experimented with loading additional data from DBpedia via the sameAs
relations provided in the datasets. However, this did not in uence performance.
Considering the nature of our graph kernel approach, this is somewhat to be
expected, since the links to DBpedia only add more general knowledge to the
graph instead of more speci c knowledge.
3.2</p>
        <p>Task 1
For the rst task we used the default SVC classi cation and SVR regression
algorithms in LibLINEAR. The classi cation algorithm was used with the binned
version of the numberOfTenders property as classes. For the regression algorithm we
used the numberOfTenders value directly. The training dataset was split into 80%
data for training and 20% for testing. The part for training was again split in 80%
for training and 20% for validation, in order to optimize the parameters of the
algorithms. These splits were repeated 10 times. For classi cation and regression
we optimized the C parameter from 10 4; 0:001; 0:01; 0:1; 1; 10; 100; 1000; 104
and in regression we also optimized the p parameter from 10 6; 10 5; 10 4,
0:001; 0:01. During optimization the evaluation function given for this task4 (T1)
was used to test the trained model. In the classi cation case we used the average
of a bin as prediction for the numberOfTenders. Before training the
numberOfTenders property was removed from the data. We test three extraction depths
for the subgraphs: 1; 2; 3.</p>
        <p>Results for this experiment are given in Table 1. The T1 binned scores are
the evaluation function scores for the classi cation algorithm. The other three
scores are for the regression algorithm, which also include the Mean Squared
Error (MSE) and Mean Absolute Error (MAE). The scores in bold are the scores
for the kernel and settings that we used in the nal classi er.</p>
        <p>The scores for the di erent kernels and settings do not di er much. For our
nal classi er we chose the WL RDF with bag-of-words kernel, with depth 2
and h = 2, since this showed good results for all 4 scores. For our nal classi er,
which we used on the challenge test set, we used the binned version of the
numberOfTenders property.</p>
      </sec>
      <sec id="sec-2-3">
        <title>4 http://keg.vse.cz/dmold2013/data-description.html</title>
        <p>depth</p>
        <p>T1
binned
Task 2 is a binary classi cation problem and we used the default SVC classi
cation algorithm in LibLINEAR to train a classi er. Since this dataset is smaller
than the task 2 dataset we used a cross-validation setup for the experiments. Per
kernel we did a 10-fold cross-validation which was repeated 10 times. Within each
fold the C parameter was optimized by doing 10-fold cross-validation. Before
training the multicontracts property was removed from the data.</p>
        <p>Table 2 presents the results for this experiment. We provide the accuracy and
F1 scores. Again, the bold scores indicate the kernel and settings that we used
for the nal classi er. The intersection tree path kernel achieves better scores
than the WL RDF kernel (especially F1). The baseline accuracy is 0:81 and F1
is 0:50. Given the scores that we achieved, we can conclude that this task is
di cult.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>We have presented the application of Weisfeiler-Lehman RDF and intersection
path tree kernels, both extended with bag-of-words term vector for the literals,
to task 1 and 2 of the linked data data-mining challenge. For task 1 the nal
classi er was trained using the WL RDF with bag-of-words kernel with the
values of the numberOfTenders property binned into 13 bins. The nal classi er
for task 2 was trained with the intersection tree path with bag-of-words kernel.
The application of both kernels in the two tasks was very straightforward. The
most complicated preprocessing step was the binning of the numberOfTenders
values.
depth
acc.</p>
      <p>F1
Weisfeiler-Lehman RDF with Bag-of-Words
h = 0 h = 2 h = 4
0.82
0.84
0.81</p>
      <p>F1
0.83
0.83
0.79</p>
      <p>F1
Acknowledgments This publication was supported by the Dutch national
program COMMIT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Cristianini</surname>
          </string-name>
          , N.:
          <article-title>Kernel Methods for Pattern Analysis</article-title>
          . Cambridge University Press, New York, NY, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Losch, U.,
          <string-name>
            <surname>Bloehdorn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rettinger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Graph kernels for rdf data</article-title>
          . In Simperl, E.,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Presutti</surname>
          </string-name>
          , V., eds.
          <source>: ESWC</source>
          . Volume
          <volume>7295</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2012</year>
          )
          <volume>134</volume>
          {
          <fpage>148</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. de Vries,
          <string-name>
            <surname>G.K.D.</surname>
          </string-name>
          :
          <article-title>A fast approximation of the weisfeiler-lehman graph kernel for rdf data</article-title>
          . In Blockeel, H.,
          <string-name>
            <surname>Kersting</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nijssen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zelezny</surname>
          </string-name>
          , F., eds.: ECML/PKDD. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. de Vries, G.K.D., de Rooij,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>A fast and simple graph kernel for rdf</article-title>
          .
          <source>In: DMoLD</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBLINEAR: A library for large linear classi cation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          (
          <year>2008</year>
          )
          <year>1871</year>
          {
          <fpage>1874</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>