<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Binary Vector based Propositionalization Strategy for Multivalued Relations in Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Florian Jakobs</string-name>
          <email>florian.jakobs@stud.uni-due.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yordan Terziev</string-name>
          <email>yordan.terziev@paluno.uni-due.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volker Gruhn</string-name>
          <email>volker.gruhn@paluno.uni-due.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Duisburg-Essen</institution>
          ,
          <addr-line>Essen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Machine learning on linked data is strongly dependent on the selection of high quality data features to achieve good results and build reusable and generalizable models. In this work, we explore the problem of representing multivalued relations in a suitable form for machine learning while keeping the human comprehensibility of the resulting model. Specifically, we propose the use of a binary vector representation and compare it to two state of the art approaches. Our evaluation shows that the binary vector representation achieves mostly higher accuracy in comparison to standard propositionalization techniques. It also achieves comparable accuracy to a recently presented graph embeddings approach, while retaining the human comprehensibility.</p>
      </abstract>
      <kwd-group>
        <kwd>Propositionalization</kwd>
        <kwd>Linked open data</kwd>
        <kwd>Data mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Linked Open Data (LOD) has become one of the main approaches for combining
structured and unstructured data from different domains. It provides context and relations
between entities and connects them to other domains. In this way, LOD can be used as
source for querying all kind of properties for entities of interest. The queried properties
can be used as features for training machine learning (ML) models that subsequently
can be applied to multiple real-world problems like classifying the entities or validating
their relations to other entities.</p>
      <p>
        However, a typical problem in ML on LOD is the generation and selection of
features. To extract features from a LOD graph different transformations have to be
performed in a process called propositionalization [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The different transformations may
produce features for every outgoing/ingoing relation, binary features that represent the
existence of a given relation or numerical features that count the number of relations of
a certain type [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Furthermore it is also possible to generate features using graph
embedding techniques and graph sub-structures [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ]
      </p>
      <p>
        In this work, we complement the proposed approaches for generating features with
a previously unexplored technique for representing multivalued properties as binary
vector. We evaluate this technique and compare it to the approaches published in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        There are multiple ways to perform a propositionalization of multivalued features. On
the one hand Ristoski and Paulheim [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] use the Weisfeiler-Lehman algorithm (WL)
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]in order to generate sub structures in the RDF graph. On the other hand, they take
into consideration neural language modelling techniques [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] based on random graph
walks for latent feature generation (e.g. W2V CBOW 500 in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
      </p>
      <p>While in these approaches the problem of representing multivalued features is
nonexistent, the generated features (latent or compounded) are not comprehensible for
humans. Therefore, the resulting models cannot be used in domains where one needs to
understand how the model makes its decisions.</p>
      <p>
        Another approach for generating features is presented in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
most recently evaluated in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this approach known as rel-vals out [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the features
are derived from generic relations-values for outgoing relations of an entity including
the value of the relation. The multivalued relations on the other hand, are handled by
counting every same named relation and using the sum as the attribute’s value [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>
        We propose a binary vector representation for propositionalization of multivalued
relations that keeps the connected values. The implementation can be found on our GitHub
account [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].These type of relations have been only indirectly covered by the counting
approach in [
        <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
        ]. The main goal of our approach is, to generate features that are
comprehensible for humans and decisions made by the generated models are retraceable.
This is crucial for critical domains like medicine, where ethical and legal concerns are
at stake. To present our approach, we use the example RDF Graph in Fig. 1 and the
resulting representation in Table 1.
      </p>
      <p>P1
worksAtProject</p>
      <p>Person1
financedBy</p>
      <p>O1
worksAtProject</p>
      <p>O2
financedBy</p>
      <p>P2
financedBy</p>
      <p>P3
worksAtProject</p>
      <p>Person2</p>
      <p>
        Let’s say the feature generation starts at Person1. The approach performs a Breadth
First Search (BFS) with a predefined depth given by the user. For the first iteration, we
generate a feature for every entity and property reached by the BFS. We consider only
outgoing relations and their target’s value. Because of the relations presented in Fig. 1,
we generate a feature for the projects a person is working at. This feature is multivalued,
because a person can participate in multiple projects. For each subsequent iteration, we
take the values of the previously generated features and perform a BFS on them until
the predefined depth is reached. This results in new features like the financing
organization (O1 or O2) of a project shown in Fig. 1 that indirectly finances the person. After
the final step of generating features, we transform the data into binary vectors. The
example from Fig. 1 results in the binary vectors (one per row) shown in Table 1.
Our final vectors consist of nominal values for single valued relations and of the values
0 and 1, which result from the transformation of each value of every multivalued
relation. This means we produce a feature that for example links a specific project with its
financier (e.g. P3:::O2), as shown in Table 1
We compare the here presented approach to rel-vals out and W2V CBOW 500 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] on
the datasets presented in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We used the same small datasets and the DBPedia datasets
as presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and also recreated the presented evaluation environment. Currently
we measured the performance accuracy using only C4.5 ML algorithm, because
decision trees are closest to the human understanding of decision making thus keeping the
model comprehensible. For building the decision trees we used Weka Version 3.6.13.
      </p>
      <p>
        Table 2 and Table 3 show the measured accuracy of the binary vector representation
in comparison to the results published in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The results show that our approach
executed with BFS depth 1 outperforms rel-vals out in all datasets, except MUTAG. For
the Albums and Movies dataset Ristoski et. al. state that their approach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] does not
terminate within 10 days or has run out of memory. We compare these two approaches,
because both only take the directly connected values in consideration.
      </p>
      <p>
        W2V CBOW 500 and DB2vec CBOW 500 8 on the other hand, use paths with length
8 (node hop distance of 4) to generate sentences. Thus, we compare these techniques to
Binary-Vector-BFS-Depth4. On the small datasets, our approach outperforms the W2V
CBOW 500 except on MUTAG. On the large datasets similarly to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] our approach
failed to terminate after 3 days on the Albums, Movies and Forbes datasets. On the
other two datasets (Cities and AAUP) our approach achieved slightly better accuracy.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        Having the goal of comprehensibility in mind the proposed representation as binary
vectors provides a better understanding of the resulting model in comparison to latent
features generated by RDF2Vec [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The approach outperforms the standard
propositionalization strategy on almost all datasets. Comparing the binary vector representation
to the word embeddings approach showed that on the small datasets, it achieved mostly
higher accuracy. On the large datasets the accuracy was comparable, however, our
approach didn’t terminate within 3 days on some of them. Our assumption is that for the
smaller datasets the neural network couldn’t be trained sufficiently, and therefore
performed with lower accuracy. However, this assumption has to be further evaluated in
our future work. Furthermore, we would like to experiment with other machine learning
models, improve accuracy by restricting the relations to semantic correct ones and
experiment with Depth First Search as graph traversal strategy.
      </p>
      <p>Acknowledgments. The presented work has been partly funded by the German Federal
Ministry of Education and Research (BMBF, FKZ 16SV7120K) SIM4BGM</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Džeroski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavrač</surname>
          </string-name>
          , N. (eds.):
          <source>Relational Data Mining</source>
          . Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Rdf2vec. Rdf graph embeddings for data mining</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          , pp.
          <fpage>498</fpage>
          -
          <lpage>514</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>A comparison of propositionalization strategies for creating features from linked open data</article-title>
          .
          <source>Linked Data for Knowledge Discovery</source>
          <volume>6</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Shervashidze</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schweitzer</surname>
          </string-name>
          , P.,
          <string-name>
            <surname>van Leeuwen</surname>
            ,
            <given-names>E.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehlhorn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borgwardt</surname>
            ,
            <given-names>K.M.</given-names>
          </string-name>
          :
          <article-title>Weisfeiler-lehman graph kernels</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2539</fpage>
          -
          <lpage>2561</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fümkranz</surname>
          </string-name>
          , J.:
          <article-title>Unsupervised generation of data mining features from linked open data</article-title>
          .
          <source>In: Proceedings of the 2nd international conference on web intelligence, mining and semantics</source>
          , p.
          <volume>31</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jakobs</surname>
          </string-name>
          , F.: ba_edtEvaluation, https://github.com/Floishy/ba_edtEvaluation
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vries</surname>
          </string-name>
          , G.K.D. de, Paulheim, H.:
          <article-title>A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          , pp.
          <fpage>186</fpage>
          -
          <lpage>194</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>