<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Constructing demand-driven Wikidata Subsets?</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Alexander University Erlangen-Nurnberg</institution>
          ,
          <addr-line>Nurnberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer Institute for Integrated Circuits IIS</institution>
          ,
          <addr-line>Nurnberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A Wikidata subset is a selected subset from the set of all triples included in Wikidata. We approach the task to create a subset with a demand-based subset that satis es a present use case. The introduced algorithm constructs triples in multiple steps starting from a seed of URIs. Di erent input options for the seed, the sequence of construction steps, and a lter enable adaptations to the use case. A formal description and matching SPARQL queries complement the algorithm. Hospital data provides a running example.</p>
      </abstract>
      <kwd-group>
        <kwd>Wikidata</kwd>
        <kwd>Subset</kwd>
        <kwd>Subgraph</kwd>
        <kwd>Neighbourhood</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        to a smaller demand-driven subset brings multiple advantages. These include
faster results (answers) to queries (questions) of machines (humans), an easier
overview of available information, and the possibility of copies on regular-sized
machines. Beghaeiraveri et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Mimouni et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] list extended advantages
of subsets/subgraphs.
      </p>
      <p>Challenges in building a subset include the volume of Wikidata's data and
the resulting possibilities for subsets. A total number of triples N results in 2N
possible subsets, as any number up to N triples can be included in the subset in
any variation. The task is to nd a subset out of these based on a use case. As
it is not feasible to create all subsets and choose afterwards, the task changes to
nd a starting point (the seed) in the data and from there draw a line separating
included and excluded data. The seed thus would be any set of Wikidata items,
including none and all.</p>
      <p>The subset must t the demands of a use case. The subset mustn't exclude
relevant data to not miss any information which may lead to wrong conclusions.
The subset also mustn't include unnecessary information to reduce computation
costs when working with the subset and to increase the clarity of its content. In
summary, we look for a minimal subset that still covers our demand (a minimal
full-covering subset).</p>
      <p>Another challenge is how to describe the need of the use case for the subset.
We need enough detail to ful ll the criteria of the last paragraph while o ering
a feasible expression. Furthermore, the algorithm that creates the subset has to
process the expression in a way that the subset ful lls it.</p>
      <p>Currently available subsets of Wikidata don't contain domain data our use
cases demand. Existing approaches to subsetting don't o er the depth of subsets
we need for our use cases. Therefore, a new solution for demand-driven Wikidata
subsets is required.</p>
      <p>The main contribution of this paper is an algorithm to construct a
demanddriven Wikidata subset including formal descriptions (see section 6.1). The
necessary steps are explained using a running example of hospital data (see section
3). Additionally, we show how the SPARQL query language5 can be used to
implement the steps (see section 6).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Wikidata has a webpage6 on subsetting that collects e orts and ideas on the
subject. The content focuses on stability and reusability of subsets, a focus we
don't share (see section 4).
5 See https://www.w3.org/TR/sparql11-query/ (Last accessed 29 July 2021).
6 Available at https://www.wikidata.org/wiki/Wikidata:WikiProject Schemas/Subsetting
(Last accessed 29 July 2021).</p>
      <p>Downloadable subsets from Wikidata itself exist in the form of dumps7 but
aren't domain-speci c. Instead, Wikidata refers to WDumper8 for custom dumps
that can be domain-speci c.</p>
      <p>
        WDumper is a service that provides a subset of Wikidata according to various
lters. However, it doesn't support variables as SPARQL queries do for example.
Therefore, we can only query the direct neighbourhood of known Wikidata items
selected by lters. This doesn't comply the demand of our use cases as we also
look for data we are not yet aware of and thus can't select with lters. An
evaluation of WDumper by Beghaeiraveri et al. con rms that the service can be
used in some use cases but also has limitations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>DBpedia is similar to Wikidata and o ers various download options9 but
neither of them is domain-speci c.</p>
      <p>
        The Concise Bounded Description (CBD) of a resource "is a subgraph
consisting of those statements which together constitute a focused body of knowledge
about the resource" [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We pursue the concept of a focused body of knowledge,
but not regarding a single resource but rather a set of resources of a domain.
This includes triples in a exible depth from the starting node.
      </p>
      <p>
        Beghaeiraveri et al. introduce the Topical Subset, "a set of entities related
to a particular topic and the relationships between them" [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Similarly to the
subsets we want to create, a Topical Subset includes data in the neighbourhood
of a selected seed with the focus on a domain. However, a Topical Subset doesn't
include triples in a exible depth around the seed and thus doesn't construct a
subset around the seed but rather gathers data about the seed. Therefore, the
subsets our algorithm creates includes Topical Subsets but not vice versa.
      </p>
      <p>
        Mimouni et al. present an algorithm to build a Context Graph [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Their
algorithm has the same basic structure as our algorithm shown in section 6.1,
nevertheless di erences exist. The Context Graph algorithm has the
neighborhood depth as an input parameter while we propose a more exible construction
tuple. Furthermore, their algorithm lters entities provided in the input while
our approach includes a more general lter function that may exclude entities,
properties, and classes that are discovered by the algorithm at runtime. These
di erences enable our algorithm to be adjustable to the demands of a use case.
      </p>
      <p>
        A research area with similarities to our approach to Wikidata subsets is
focused Web crawlers. Olston and Najork describe the basic algorithm: "Given
a set of seed Uniform Resource Locators (URLs), a crawler downloads all the
Web pages addressed by the URLs, extracts the hyperlinks contained in the
pages, and iteratively downloads the Web pages addressed by these hyperlinks."
[9, p. 178]. Our approach starts with a set of seed Uniform Resource
Identiers (URIs), downloads all Wikidata triples containing the URIs, extracts newly
found URIs, and iteratively downloads the triples containing these URIs.
Focused Web crawlers look to minimize the number of accessed Web pages and
7 Available at https://www.wikidata.org/wiki/Wikidata:Database download (Last
accessed 29 July 2021).
8 Available at https://wdumps.toolforge.org/ (Last accessed 29 July 2021).
9 Available at https://www.dbpedia.org/resources/ (Last accessed 29 July 2021).
maximize the percentage of relevant pages among those [
        <xref ref-type="bibr" rid="ref1 ref9">9, 1</xref>
        ]. We share these
goals for subsets as we want minimal but relevant data. To accomplish this,
focused Web crawlers lter and rate every URL using various metrics and then
add them to a set of yet to be downloaded URLs called frontier F . Each
iteration, the top-rated URL in F is downloaded and processed which again adds
new URLs to F [
        <xref ref-type="bibr" rid="ref4 ref9">9, 4</xref>
        ]. We use lters to exclude URIs we found. However, we
don't iterate over single URLs in F but rather the complete set Fi which results
in a completely new set of URIs Fi+1 for the next iteration. Therefore, we also
don't rate the URIs.
      </p>
      <p>
        Another related research area is link traversal query execution. The main idea
is to follow URIs found when querying distributed RDF data while executing
said query [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Similar to focused Web crawlers and our approach, a frontier
of to be processed URIs exists.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Running Example</title>
      <p>We use the graph in gure 1 as an example of a Wikidata subset throughout
this paper. It shows simpli ed data from Wikidata on the hospital Charite.
The rounded vertices represent Wikidata items (e.g. Q162684) that we present
by their (sometimes shortened) labels (e.g. Charite) instead of their URIs (e.g.
https://www.wikidata.org/wiki/Q162684). We use labels to increase readability,
the real identi er of an item remains its URI which the original triples contain.
The angular vertices represent literals that are concrete data values (e.g. 1958).
Charite is the seed of the example subset.
The displayed subset shows some concepts of this paper: From the seed,
inbound and outbound triples in varying distances contain data relevant for the
use case and are thus included in the subset. When we iterate to construct the
subset, we have to keep an eye on already visited URIs because multiple paths
may exist from the seed to an URI (see Berlin in g. 1). Not all data in the
adjacency of the seed is relevant to the use case (see (CBF namedAfter BF ) in
g. 1). Therefore, we lter the triples before we add them to the subset.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Approaches</title>
      <p>Two general approaches come to mind when thinking about how and why to
create a Wikidata subset: (1) stable de ned subsets, (2) subsets by acute demand.
We shortly discuss both.</p>
      <p>(1) An organization or group creates stable de ned subsets and provides
them to the public, maybe on the Wikidata website. Advantages include: The
user of the subset { whoever that might be { does not necessarily need to know
how to build a subset or be familiar with Wikidata content, as the creation is
done and the download easy. The subset can also be used to normalize the
contained Wikidata classes by analyzing instances and developing a
recommendation for minimal properties. Disadvantages include: The available subsets don't
comply the demand of individual use cases. Open questions include: Must all
data belong to a subset? Must data belong to only one subset? What is a fair
number of subsets considering relevance and minimalism? E ort: Large onetime
e ort to create all subsets (ignoring updates).</p>
      <p>(2) An algorithm that creates a subset by acute demand is available. Every
time a subset is required, it is created as needed. Advantages include: With the
demand known, the subset can be minimal. Additionally, the subsets are always
relevant because they only exist if there is an acute demand. Disadvantages
include: The user has to create a subset himself. Therefore they needs time as
well as knowledge of their demand, the domain of the subset, and Wikidata.
Open questions include: How to express the demand of a use case? How to t
the subset to the demand? E ort: Pay as you use (after a tool is available).</p>
      <p>When considering the use cases from our project work, the second approach
is better tting. In this way, we receive useful (minimal and full-covering) results
in a quicker fashion. Reusability of created subsets is thus not the focus of our
work.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Preliminaries</title>
      <p>Let G be a set of RDF triples T , U be the set of URIs, B be the set of blank
nodes, and L be the set of literals. Each triple T = (s; p; o) 2 (U [ B) U
(U [ B [ L) consists of a subject s, predicate p and object o according to the
RDF speci cation10. The subject s 2 (U [ B) may be an URI or blank node.
The predicate p 2 U is always an URI. The object o 2 (U [ B [ L) may be an
URI, a blank node, or a literal. G also represents a labeled directed multigraph,
but we continue the view as a triple set in this text.
10 See https://www.w3.org/TR/rdf11-concepts/ (Last accessed 30 July 2021).</p>
      <sec id="sec-5-1">
        <title>Let W G be the set of all triples in Wikidata. Let S subset. Let Q U be the set of all Wikidata items. Let F URIs that are part of the frontier in an algorithm.</title>
      </sec>
      <sec id="sec-5-2">
        <title>W be a Wikidata Q be the set of</title>
        <p>6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Solution Concept</title>
      <p>In RDF, data is described through its relations. Therefore, all triples containing a
certain URI describe the semantics of that URI. Additionally, data of interest for
one domain is usually connected. Consequently, selected major URIs of interest
and their surroundings build domain data. This ts our goal of a demand-driven
subset as the demand usually covers one domain. The surroundings of a seed of
URIs can be constructed through multiple iterations over adjacent URIs
(neighbours) with increasing distance to the seed. In summary, a seed of selected URIs
and data connected to it in varying distance in the form of triples build a subset
(see g. 1). Di erent options regarding the seed, the neighbourhood, and lters
enable us to adjust the subset to the demand of a use case as shown in the
following sections.
6.1</p>
      <sec id="sec-6-1">
        <title>Basic Algorithm</title>
        <p>We de ne the function that constructs a Wikidata subset S as</p>
        <p>S = subset(W; C; seed(); f ilter())
(1)
with the set of all triples in Wikidata W , the construction tuple C = (c0; :::; cn),
the seed function, and the f ilter function as inputs. The basic algorithm is
shown in the following.
1. Select a set of URIs F0 Q as the seed (F0 = seed(W )).
2. For each construction step ci 2 C,
(a) get all triples Si W that contain an URI f 2 Fi of the frontier Fi
according to the construction step ci (Si = construct(W; Fi; ci)).
(b) lter the triples from set Si to a set Si0 Si (Si0 = f ilter(Si)).
(c) add Si0 to the desired subset S W (S = S [ Si0).
(d) add Fi to the set of visited URIs V (V = V [ Fi).
(e) if ci+1 exists: get the set of URIs Fi+1 Q that are adjacent to the URIs</p>
        <p>Fi Q according to the construction step ci (Fi+1 = f rontier(Si0; ci; Fi; V )).
3. Export the complete subset S.</p>
        <p>How these steps look in detail and combine is the focus of the following sections.
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Seed Considerations</title>
        <p>The seed acts as a starting point for the subset construction. It consists of a set
of URIs F0 Q. F0 will be the frontier of the rst construction step that gets
triples from Wikidata that contain an URI u 2 F0. The function F0 = seed(W )
is provided as an input to the algorithm to enable maximal exibility.</p>
        <p>One possibility to express seed() is a SPARQL SELECT query that runs
on the Wikidata set W and returns F0. A SPARQL query provides suitable
possibilities to de ne the seed. In the example of gure 1, the SPARQL query
for German hospitals shown in listing 1.1 returns Charite 2 F0 as one URI of
the seed.</p>
        <p>Listing 1.1: SPARQL query for German hospitals that returns the seed F0
SELECT ? h o s p i t a l
FROM W
WHERE f
? h o s p i t a l wdt : P31 wd : Q16917 ; #i n s t a n c e O f h o s p i t a l
wdt : P17 wd : Q183 . #c o u n t r y Germany
g
A basic possibility would be to provide F0 in seed() by hand. This is a feasible
option if only a few known URIs are relevant for the use case. In our example,
the domain could not include all German hospitals but rather only the Charite,
so F0 = fChariteg.
6.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Construction/Triple Considerations</title>
        <p>In terms of RDF triples, we de ne the neighbourhood of a URI u in a set of
triples G as the subset NGRDF (u) G containing all triples T (u; p; o) 2 G and
T (s; p; u) 2 G that have u as subject or object. In these triples, all URIs adjacent
to the URI u in G are included. The neighbourhood NGRDF (U ) of a set of URIs
U de nes the union of the neighbourhood of each URI u 2 U . An example from
gure 1 is provided at the end of this section.</p>
        <p>
          Because triples build a directed graph, we can further distinguish adjacent
URIs of an URI u between predecessors and successors. The triples that contain
u as object and a predecessor are inbound triples. They provide "secondary
knowledge" about u [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The triples that contain u as subject and a successor
are outbound triples. They provide "primary knowledge" about u [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. We de ne
the subset of inbound triples of a URI u from a set of triples G as the subset
NG RDF (u) G containing all triples T (s; p; u) 2 G that have u as object.
We de ne the subset of outbound triples of a URI u from a set of triples G
as the subset NG+RDF (u) G containing all triples T (u; p; o) 2 G that have u
as subject. The sets of inbound triples NG RDF (U ) G and outbound triples
NG+RDF (U ) G of a set of URIs U de ne as the union of the appropriate sets
of each URI u 2 U .
        </p>
        <p>Over multiple iterations, we look at the neighbourhood of the given frontier
Fi to construct the subset S. The input of the subset must express a sequence of
construction steps adequately. Let C = (c0; c1; :::; cn) be a tuple with ci 2 f$;
; !g that represents the sequence of construction steps. For every iteration i, the
expansion ci is executed on the frontier Fi resulting in the partial subset Si W
that adds to the complete subset S. The construction step ci may construct the
whole neighbourhood ($), the set of inbound triples ( ), or the set of outbound
triples (!). The size n of the tuple C expresses the number of iterations and
the maximum depth of triples around the seed.</p>
        <p>The dedicated function Si = construct(W; Fi; ci) speci es to:
(2)
Regarding an implementation, several options to access and query Wikidata
exist11. Available are per-item access with dereferenceable URIs, a SPARQL
endpoint, and an endpoint for Linked Data Fragments. We won't evaluate these
options as the implementation is not a focus of this paper. However, to give
another expression of the constructed triples, the listings 1.2, and 1.3 show the
SPARQL CONSTRUCT queries to construct the inbound triples (ci = ) and
the outbound triples (ci = !) respectively. Both queries together construct the
neighbourhood (ci = $).</p>
        <p>Listing 1.2: SPARQL query that constructs the inbound triples of an URI u
CONSTRUCT
WHERE f GRAPH W f ? s ?p $u g . g</p>
        <sec id="sec-6-3-1">
          <title>CONSTRUCT</title>
          <p>WHERE f GRAPH W f $u ?p ? o g . g
Listing 1.3: SPARQL query that constructs the outbound triples of an URI u
In the example of gure 1, the rst construction step c0 = $ returned the
inbound and outbound triples of the seed F0 = fChariteg. Therefore, the rst
partial subset is S0 = NWRDF (Charite) and returns the triples:
{(KfP partOf Charite), (KfN partOf Charite),
(Charite subsidiary CBF), (Charite locatedIn Berlin),
(Charite subsidiary VK)}
6.4</p>
        </sec>
      </sec>
      <sec id="sec-6-4">
        <title>Filter Considerations</title>
        <p>When we construct the neighbourhood of a URI, we receive all triples connected
with that URI. This data quickly accumulates to a sizable subset. However,
one of our goals for the subset is a smaller size. Additionally, we look for a
minimal and full-covering subset. Thus, the algorithm has a lter function that
can exclude triples that don't contain relevant information. It reduces a subset
Si to a ltered subset Si0 Si.
11 See https://www.wikidata.org/wiki/Wikidata:Data access (Last accessed 2 August
2021).</p>
        <p>Because there are plenty of options to lter triples and the focus of lters
varies between di erent use cases, the lter function Si0 = f ilter(Si) is provided
as an input to the algorithm to enable maximal exibility.</p>
        <p>Some basic lter options to exclude triples are language-tagged literals for
selected languages and triples using wikibase:Statements as they model duplicate
information. Further possibilities are lters by Wikidata's property hierarchy or
a lter for designated classes.</p>
        <p>In this paper, however, we propose the idea of a lter for rarely used
properties. We argue that triples with rarely used properties provide relatively little
value for later analysis (like the namedAfter property in g. 1). If we analyse
hundreds or thousands of instances of one class, we can exclude properties that
are only used by one or two instances. This may not sound much, but a few
properties ltered for many URIs add up. Additionally, the lter function also
reduces data volume in future iterations because every removed triple also means
one less adjacent URI for the next construction, so the frontier Fi+1 shrinks (see
section 6.5).</p>
        <p>Let Ap+x Si be the set of triples Ap+x = fT = (u; px; o) j T 2 Si and u 2 Fig
with the URIs u 2 Fi and a property px. Let Apx Si be the set of triples
Apx = fT = (s; px; u) j T 2 Si and u 2 Fig with the URIs u 2 Fi and a
property px. Let (px) = jApxj be the fraction of URIs with the property px
jF j
from all URIs in the frontier F . Let 0 be the threshold value any property
px must reach ( (px) 0) so that Apx is not removed from Si in order to
build Si0. For the construction of inbound triples (ci = ), only Apx is relevant,
as jAp+xj = 0. For the construction of outbound triples (ci = !), only Ap+x is
relevant, as jApxj = 0. For the construction of the neighbourhood (ci = $), Ap+x
and Apx are relevant and calculated separately.</p>
        <p>The seed F0 of a subset probably contains Wikidata items (URIs) of only a
few classes. URIs of the same class likely have similar properties, so (px) tends
to be relatively large for most properties. However, when we iteratively construct
the neighbourhood, the classes of the URIs u 2 Fi will quickly di erentiate
and (px) decrease. Therefore, a constant value for 0 might not be the best
solution. Instead, there are arguments to both decrease and increase it with each
construction step. We might decrease i to not exclude too many triples that
could provide relevant information. We might also increase i to remove even
more triples that could provide irrelevant information. Both can be done in a
additive ( i+1 = i ) or multiplicative ( i+1 = i ) way. The decision to
increase or decrease i is in its consequence a decision for a more dense, maybe
minimal subset or a full-covering subset that doesn't want to miss out relevant
data. The values 0 and are set by input parameters of the lter function. It
is also possible to select 0 = = 0 to disable the property lter.</p>
        <p>With the proposed lter for rare properties, the function Si0 = f ilter(Si)
speci es to:</p>
        <p>Si0 = Si n fT = (s; px; o) j (px) &lt;
i( 0; )g
(3)
6.5</p>
      </sec>
      <sec id="sec-6-5">
        <title>Frontier Considerations</title>
        <p>
          With the data that describes the URIs of the frontier Fi collected and ltered
(subset Si0), the next construction step is due. Therefore, we need to select a new
set of URIs Fi+1. Because we build a subset from a seed through the construction
of the surroundings of the seed, the new frontier Fi+1 consists of URIs found
in the last construction step. These new URIs are adjacent to the last frontier
Fi and can therefore be described using the set of adjacent URIs NSi0 (Fi) from
graph theory [
          <xref ref-type="bibr" rid="ref3 ref5">5, 3</xref>
          ].
        </p>
        <p>However, not all adjacent URIs are proper for Fi+1. Let's say we expand
through the complete neighbourhood. In the example of gure 1, we reach the
URI Berlin with the rst construction step. Therefore, the URI is part of the
following frontier (Berlin 2 Fi+1). The URI CBF is also included in that frontier
(CBF 2 Fi+1). After another construction step, we reach Berlin again in the
outbound triple from CBF. This would mean that we process the neighbourhood
of Berlin twice. Thus, we must only select unvisited URIs for the new set of
URIs Fi+1. The set of visited URIs V collects all newly visited URIs after every
construction step. V can then be excluded from Fi+1. In consequence, the sets
Fi and Fi+1 are disjoint: Fi \ Fi+1 = ;.</p>
        <p>We can also omit literals from the set Fi+1 because they are never the subject
of a triple and we are not interested in URIs with identical literal values as object
(e.g. the year 1958 in g. 1). Additionally, we can omit blank nodes because they
can't be queried.</p>
        <p>The example in gure 1 shows another scenario we have to consider for
Fi+1. The displayed graph represents the subset we would like to receive as
a result of our algorithm. The URI Charite serves as seed (F0 = fChariteg).
Initially, we are interested in all data related to the seed, so we construct the
complete neighbourhood in the rst step (c0 = $). However, out of all the
URIs we discover with this construction, we know we only want to continue
the construction in the next step with the successors, because the outbound
relations of the seed are more important for our use case. In this case, the
frontier must only consider the URIs NS+i0 (Fi) succeeding the set of URIs Fi.
The preceding URIs NSi0 (Fi) are the opposite option. To include this information
in the input, we add two options for the sequence of expansion steps C: Only
select predecessors ($ ) or successors ($+) after constructing the complete
neighbourhood. Therefore, the expansion options are ci 2 f$; $ ; $+; ; !g.</p>
        <p>For a construction of inbound (ci = ) or outbound (ci = !) triples, the
frontier also de nes with NSi0 (Fi) or NS+i0 (Fi) respectively because only those
adjacent URIs exist.</p>
        <p>In conclusion, the function Fi+1 = f rontier(Si0; ci; Fi; V ) speci es to:
Fi+1 =
8&gt;(NSi0 (Fi) \ U ) n V if ci 2 f$g
&lt;
(NSi0 (Fi) \ U ) n V</p>
        <p>g
&gt;:(NS+i0 (Fi) \ U ) n V if ci 2 f$+; !g
if ci 2 f$ ;
The listings 1.4, 1.5, and 1.6 show SPARQL SELECT queries that select all URIs
u that are adjacent (ci 2 f$g) / predecessors (ci 2 f$ ; g) / successors
(4)
SELECT ?u
FROM Si '
WHERE f ?u ?p $ f .</p>
        <p>FILTER ( isURI ( ? u ) ) g</p>
        <sec id="sec-6-5-1">
          <title>SELECT ?u</title>
          <p>FROM Si '
WHERE f $ f ?p ?u .</p>
          <p>FILTER ( isURI ( ? u ) ) g
Listing 1.6: SPARQL query that selects URIs u that are successors of f 2 Fi for
the frontier Fi+1 from Si0
(ci 2 f$+; !g) to an URI f 2 Fi respectively. The SPARQL queries don't
exclude visited URIs u 2 V .</p>
          <p>Listing 1.4: SPARQL query that selects URIs u adjacent to f 2 Fi for the frontier
Fi+1 from S0</p>
          <p>i
SELECT ?u
FROM Si '
WHERE f f $ f ?p ?u . g UNION f? u ?p $ f . g</p>
          <p>FILTER ( isURI ( ? u ) ) g
Listing 1.5: SPARQL query that selects URIs u that are predecessors of f 2 Fi
for the frontier Fi+1 from Si0
In the example of gure 1, the rst construction step c0 =$+ results in the
frontier F1 = fCBF; Berlin; V Kg (see section 6.3 for S0). The following step
only constructs outbound triples (c1 = !). The corresponding subset S1 =
NW+RDF (F1) returns the triples:
{(CBF locatedIn Berlin), (CBF inception 1958),
(CBF namedAfter BF), (Berlin population 3644826),
(Berlin contains Mitte), (VK locatedIn Mitte),
(VK inception 1906)}
The resulting frontier would be F2 = fBF; M itteg, however, the subset from
gure 1 is already completed with C = ($+; !).
7
7.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <sec id="sec-7-1">
        <title>Summary</title>
        <p>The paper introduces an algorithm S = subset(W; C; seed(); f ilter()) to create
a demand-driven Wikidata subset. Therefore, we construct triples in multiple
steps starting from a seed. Available options ci 2 f$; $ ; $+; ; !g that
sequence the construction o er exibility. A lter reduces the number of triples
to exclude irrelevant data.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Future Work</title>
        <p>We are currently in the process of implementing the introduced algorithm. With
a working implementation, we plan to construct a Wikidata subset of the more
challenging domain of microelectronics and their supply chains.</p>
        <p>Once we created multiple subsets, we need an evaluation to compare them.
The clustering coe cient from graph theory seems like one possible parameter.</p>
        <p>One advantage of Wikidata is its links to other data sources on the web.
With techniques from link traversal query execution, we could follow the links
and include data we nd in the subset.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Batsakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petrakis</surname>
            ,
            <given-names>E.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milios</surname>
          </string-name>
          , E.:
          <article-title>Improving the performance of focused web crawlers</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          <volume>68</volume>
          (
          <issue>10</issue>
          ),
          <volume>1001</volume>
          {1013 (Oct
          <year>2009</year>
          ). https://doi.org/10.1016/j.datak.
          <year>2009</year>
          .
          <volume>04</volume>
          .002
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Beghaeiraveri</surname>
            ,
            <given-names>S.A.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A.J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNeill</surname>
            ,
            <given-names>F.J.:</given-names>
          </string-name>
          <article-title>Experiences of Using WDumper to Create Topical Subsets from Wikidata</article-title>
          .
          <source>In: Proceedings of the 2nd International Workshop on Knowledge Graph Construction co-located with 18th Extended Semantic Web Conference (ESWC</source>
          <year>2021</year>
          ). p.
          <volume>15</volume>
          (
          <year>Jun 2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bondy</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murty</surname>
            ,
            <given-names>U.S.R.</given-names>
          </string-name>
          :
          <source>Graph Theory with Applications</source>
          . Elsevier, New York,
          <volume>5</volume>
          <fpage>edn</fpage>
          . (
          <year>1982</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chakrabarti</surname>
          </string-name>
          , S., van den Berg, M.,
          <string-name>
            <surname>Dom</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Focused crawling: a new approach to topic-speci c Web resource discovery</article-title>
          .
          <source>Computer Networks</source>
          <volume>31</volume>
          (
          <fpage>11</fpage>
          -
          <lpage>16</lpage>
          ),
          <volume>1623</volume>
          {1640 (May
          <year>1999</year>
          ). https://doi.org/10.1016/S1389-
          <volume>1286</volume>
          (
          <issue>99</issue>
          )
          <fpage>00052</fpage>
          -
          <lpage>3</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dundar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aytac</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kilic</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>The common-neighbourhood of a graph</article-title>
          .
          <source>Boletim da Sociedade Paranaense de Matematica</source>
          <volume>23</volume>
          -
          <issue>32</issue>
          (
          <issue>1</issue>
          ),
          <volume>23</volume>
          (
          <year>2017</year>
          ). https://doi.org/10.5269/bspm.v35i1.
          <fpage>22464</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freytag</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          :
          <article-title>Executing SPARQL Queries over the Web of Linked Data</article-title>
          . In: Bernstein,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Karger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.R.</given-names>
            ,
            <surname>Heath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Feigenbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Maynard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Motta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Thirunarayan</surname>
          </string-name>
          ,
          <string-name>
            <surname>K</surname>
          </string-name>
          . (eds.)
          <source>The Semantic Web - ISWC 2009. Lecture Notes in Computer Science</source>
          , vol.
          <volume>5823</volume>
          , pp.
          <volume>293</volume>
          {
          <fpage>309</fpage>
          . Springer, Berlin, Heidelberg (
          <year>2009</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -04930-9 19
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freytag</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          :
          <article-title>Foundations of Traversal Based Query Execution over Linked Data (Extended Version)</article-title>
          .
          <source>In: HT '12: Proceedings of the 23rd ACM conference on Hypertext and social media</source>
          . pp.
          <volume>43</volume>
          {
          <issue>52</issue>
          (Jun
          <year>2012</year>
          ). https://doi.org/10.1145/2309996.2310005
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mimouni</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moissinac</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          :
          <article-title>Domain Speci c Knowledge Graph Embedding for Analogical Link Discovery</article-title>
          .
          <source>International Journal On Advances in Intelligent Systems</source>
          <volume>13</volume>
          (
          <issue>1</issue>
          &amp;2),
          <volume>140</volume>
          {150 (Jun
          <year>2020</year>
          ), https://hal-cnrs.archivesouvertes.fr/hal-03052226
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Olston</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Najork</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <given-names>Web</given-names>
            <surname>Crawling</surname>
          </string-name>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ),
          <volume>175</volume>
          {
          <fpage>246</fpage>
          (
          <year>2010</year>
          ). https://doi.org/10.1561/1500000017
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stickler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <string-name>
            <surname>CBD - Concise Bounded Description</surname>
          </string-name>
          (
          <year>Jun 2005</year>
          ), https://www.w3.org/Submission/CBD/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>