<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RDF Knowledge Graph Mining over Data Cubes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Petr Novak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vojtech Svatek</string-name>
          <email>svatekg@vse.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Prague University of Economics and Business</institution>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Existing e orts in RDF knowledge graph (KG) mining solely addressed fact-level KGs. Many KGs, especially those published by government entities, however contain aggregated data modeled using data cubes. We applied a state-of-the-art KG mining tool on RDF datasets of two government institutions and on interlinked fact-level KGs (Wikidata and YAGO), in multiple settings. Some of the results are meaningful, the main challenges being the low level of support of individual rules and their di cult readability.</p>
      </abstract>
      <kwd-group>
        <kwd>knowledge graph</kwd>
        <kwd>relational mining</kwd>
        <kwd>Data Cube</kwd>
        <kwd>linked government data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        in such a scenario and the potential usability of the discovered rules. A secondary
aim was to provide additional feedback to an RDF mining tool, RDFRules [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
as a novel reimplementation of AMIE developed by our group. Compared to the
original AMIE, RDFRules provides, among other, ample pre-processing options,
a pattern language for specifying the shape of rules, mining over multiple graphs,
or rule post-processing (clustering and pruning).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>RDF Mining Speci cs</title>
      <p>
        While rule mining sometimes lags behind other (statistical or neural) inductive
learning paradigms in predictive accuracy, its advantage is the immediate
interpretability of the models. However, rule mining algorithms, such as Apriori [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
are limited by the shape of the analyzed data in the form of a single or
several related tables. Therefore, these algorithms are not applicable to the
graphstructured RDF model. The application of Inductive Logical Programming to
rule mining does not have this limitation, but requires negative examples. These
are mostly unavailable in Linked Data. An algorithm speci cally designed to
mine rules from data operating under OWA and consisting of binary predicates
(just as Linked Data) is AMIE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. AMIE (and its novel implementation
RDFRules [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) mines rules consisting of a head as a single triple and a body as a
conjunction of triples, with triples containing variables, e.g.
      </p>
      <p>(?a ex:worksFor ?b) ^ (?b ex:hasHeadquartersIn ?c) ) (?a ex:livesIn ?c)
The rules are ltered by interestingness measures, some of which are adapted
from propositional mining (e.g., support) and some are new (e.g., head coverage).
In the context of the Apriori algorithm, the support of a rule corresponds to the
number of transactions (rows in the table) that conform to both the assumption
and the conclusion of the rule. As classical support cannot be straightforwardly
transferred to the case of rules with variables, for the AMIE algorithm the
support measure is rede ned as the number of triples that are correctly predicted
by the rule. Head coverage, in turn, is a novel measure introduced by AMIE. It
is de ned as the ratio of the support of a rule and the head size: the number of
triples having the same predicate as the rule head.</p>
      <p>Both support and head coverage describe a quantitative signi cance of the
rule in relation to the examined data. They quantify the true predictions of the
rule but do not take into account the false predictions, as e.g. the con dence
measure does. Generally speaking, con dence is a ratio of true predictions of a
rule to the sum of true predictions and the counterexamples. While mining rules
from the KG, however, one cannot simply consider the absence of a triple as
a counterexample. AMIE treats this problem by de ning a con dence measure
based on so-called Partial Completeness Assumption (PCA), which states, that
if the KG contains triple with a subject s a predicate p, the KG already
contains a complete set of possible triples with the subject s and the predicate p.
Therefore, a triple predicted by the rule missing in the KG is only considered a
counterexample if its subject appears in the KG with the triple's predicate and
with another object o0.</p>
    </sec>
    <sec id="sec-3">
      <title>RDF Data Cube Mining Speci cs</title>
      <p>Let us now brie y compare the problem addressed in our research with both
traditional analytics over data cubes and with rule mining from fact KGs.</p>
      <p>The analysis of multi-dimensional cubes by statistical or OLAP techniques
operates on continuous values of measures. Conversely, for rule mining the
measures have to be discretized into intervals, to ensure both su cient frequencies
of values and a manageable search space. Since it entails some information loss,
mining a single data cube after discretization would probably be meaningless.
The strong side of RDF rule mining may however be the possibility to mine
across multiple interlinked cubes, and further, to connect fact KGs to them.</p>
      <p>The rules must respect the structure of the cubes. All rule atoms (triple
patterns) referring to a data cube will have a variable representing the
observation as their subject, and their predicates will correspond to dimensions and
measures; additionally, the observation variable will be linked to the dataset IRI
through the DCV predicate qb:dataSet. A further data cube can be linked to the
rst one through a common dimensional value; similarly, any triple from a fact
KG can be linked to a dimensional value.</p>
      <p>The presence of data cube atoms also impacts the computation of rules'
interestingness measures. Namely, the total number of instantiations of a set of
rule atoms whose predicates are dimensions, e.g.,</p>
      <p>
        (?o ex:refArea ?a) ^ (?o ex:refPeriod ?b)
equals to the product of the number of areas and the number of periods, since
each combination of dimensions is present in the data cube. Cubes with more
values for their dimensions thus arti cially increase the support of rules in which
they appear. However, as we veri ed [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], another interestingness measure of
AMIE/RDFRules, head coverage, is neutral wrt. the data cube size.
      </p>
      <p>
        The greatest problem in rule mining from aggregated data (identi ed in
propositional rule mining [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], but valid for relational mining as well) is the
assurance of commensurability of the dimensional values. For example, if some region
is much bigger than others, it will tend to trivially appear in many rules.
Decomposition of large cubes to multiple smaller ones (and then running separate
mining tasks over them) is necessary in such cases; this however leads to lowered
values of interestingness measures. The same discretization of measure values
may also be only shared between observations with a commensurable context.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>We carried out a series of experiments involving datasets published in RDF by
two government institutions: the pensions dataset by the Czech Social Security
Administration, surveying the average amount of pension, the average age and
the number of persons, and several datasets by the Czech Statistical O ce, such
as the `jobs applicant and unemployment rate' (further, JAUR), or the dataset
of average salaries. The shared dimensions of the datasets are the sex, the year
and the region. Additionally, we extracted relevant portions of public fact KGs:
Wikidata and YAGO. The triples extracted from Wikidata deal with heads of
state and regional governments in particular years, their political parties, and
the political alignment of these parties (left, center, far-right, etc.). Similarly,
the triples extracted from YAGO relate to the regions.</p>
      <p>The rules were required to have a measure in the head. The body pattern
varied from task to task. Representative examples of analytical questions expressed
through the RDFRules pattern functionality are: 1) Is there a relation between
the pension expenses and the political alignment of the state government? 2)
Is there a relation between the number of job applicants and the features of a
region? 3) For which groups does a relation between salaries and pensions hold?</p>
      <p>RDFRules was called through its API. The source code of the experiments
is stored in a repository on github2 in the form of jupyter notebooks.</p>
      <p>In the rst task, \political alignment vs. pensions", the pensions dataset was
mined in combination with Wikidata. The data set was sliced into subcubes
for each pension kind, and the discretization was performed separately on each.
The rule pattern can be translated as: `If in any year the current prime minister
belongs to a political party that has a certain political alignment then the annual
expenses of a certain pension kind t into a certain interval'. The uniform size
of the subcubes allowed to use the support as the interestingness measure; 29
rules had a support higher than 1. The following example rule has the support
of 3 and the con dence value of 1, and states that all of the three years of the
center-right government featured old-age pension expenses in the upper third
for the period observed in the data set.
(?headRole x:appliesToRefPeriod ?refPeriod) ^
(wd:Q213 p:P6 ?headRole) ^
(?headRole ps:P6 ?person) ^
(?person p:P102 ?partyMembership) ^
(?alignment rdfs:label "centre-right") ^
(?party wdt:P1387 ?alignment) ^
(?partyMembership ps:P102 ?party) ^
(?partyMembership x:appliesToRefPeriod ?refPeriod) ^
(?observation cssz-dimension:refPeriod ?refPeriod) ^
(?observation cssz-dimension:druh-duchodu
cssz-pensionkind:PK_old_age_total_S_SI_SRN_ST_SD_SR_2010) ^
(?observation qb:dataSet cssz-dataset:vydaje-na-duchody-v-cr)
-&gt;
(?observation cssz-measure:vydaje-na-duchody-opravene-o-zalohy-v-tis-kc
&lt;&lt;3.1797095155E8__3.8222294828E8)_ef3_3/3&gt;)</p>
      <p>In the second task, the JAUR dataset was mined in combination with YAGO.
The rule pattern was more open here: any features from YAGO could appear in
the body as long as they were linked to a region. The algorithm then resorted to
nding common features of areas with a certain measure in a certain interval.</p>
      <p>In the third task the two data cubes from di erent providers were analyzed,
and their measures were required to appear one in the body and one in the head.</p>
      <p>Across the tasks, the intuitive plausibility of the rules varied. The rules with
a `monetary' measure in the head (as in the rst task) may su er from a bias
introduced by the in ation; it makes the pensions partially grow regardless the
2 https://github:com/nvkp/diploma-thesis-code
government alignment. When we allowed for measures from the same cube to
appear in the body and in the head, plausible but trivial rules were found (in the
second task), e.g., describing the relationship between the two JAUR measures:
the unemployment rate and the number of job applicants. Some of the
associations found via fact KG links also featured chains re ecting rather indirect if
not dubious relationships; for example, the birth regions of players of a
particular sports club were associated with a high number of job applicants. The rules
for the third, cross-cube task were often quite meaningful, relating, e.g., lower
pensions of some kind (e.g., widow/er ones) to lower salaries, although, similar
ndings would presumably be discovered using statistical (e.g., regression)
techniques over non-discretized data, i.e., the added value of relational mining is a
bit questionable here (since no fact KG was linked in this case).</p>
      <p>Setting the rule patterns to connect the aggregated values to the instance
data of the KGs just as in the rst and the second mining task, however,
undoubtedly has a potential to yield new insights to the cubes' data, that cannot
be obtained from performing the statistical techniques alone. And the RDFRules
framework proved to be well suited for this type of analysis, thanks to its native
recognition of the owl : sameAs predicates, discretization functionality, and the
ability to control the shape of the generated rules with its rule pattern syntax.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Future Work</title>
      <p>Evaluation of the rule quality by domain experts would be desirable in the next
phase. Some smart visualization of the rules should also be considered, since the
number of atoms required to express the cube structures is often high and some
are not completely intuitive. Beyond the government data, the approach could
also be tried on industrial data warehouses, which contain similarly structured
multi-dimensional data, and their integration with both private and public KGs
and subsequent discovery of cross-graph patterns might be attractive.
This work has been partially supported by CHIST-ERA within the CIMPLE
project (CHIST-ERA-19-XAI-003).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Capadisli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedl</surname>
          </string-name>
          , R.:
          <article-title>Linked Statistical Data Analysis</article-title>
          .
          <source>In: Semantic Statistics</source>
          <year>2013</year>
          , http://ceur-ws:org/Vol-
          <volume>1549</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Galarraga</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Te ioudi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>AMIE: Association rule mining under incomplete evidence in ontological knowledge bases</article-title>
          .
          <source>In: WWW</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chudan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Association rule mining as a support for OLAP</article-title>
          .
          <source>PhD Thesis</source>
          . Univ. of Econ., Prague,
          <year>2015</year>
          . https://insis:vse:cz/zp/portal zp:
          <source>pl?podrobnosti zp=25910</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Imielinski</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swami</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Mining association rules between sets of items in large databases</article-title>
          . In: SIGMOD'
          <volume>93</volume>
          :
          <fpage>207</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Novak</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Experiment with rule mining from linked government data</article-title>
          .
          <source>MSc. Thesis</source>
          . Prague University of Economics and Business,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zeman</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kliegr</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Svatek</surname>
          </string-name>
          , V.:
          <article-title>RDFRules: Making RDF rule mining easier and even more e cient</article-title>
          .
          <source>Semantic web</source>
          ,
          <year>2021</year>
          , Vol.
          <volume>12</volume>
          , No.
          <volume>4</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>