<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Constructing demand-driven Wikidata Subsets</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Daniel</forename><surname>Henselmann</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Fraunhofer Institute for Integrated Circuits IIS</orgName>
								<address>
									<settlement>Nürnberg</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andreas</forename><surname>Harth</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Fraunhofer Institute for Integrated Circuits IIS</orgName>
								<address>
									<settlement>Nürnberg</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Friedrich-Alexander University Erlangen-Nürnberg</orgName>
								<address>
									<settlement>Nürnberg</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Constructing demand-driven Wikidata Subsets</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">D3A3C1E5E5F8D0924071527B5CA91979</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T11:47+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Wikidata</term>
					<term>Subset</term>
					<term>Subgraph</term>
					<term>Neighbourhood</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>A Wikidata subset is a selected subset from the set of all triples included in Wikidata. We approach the task to create a subset with a demand-based subset that satisfies a present use case. The introduced algorithm constructs triples in multiple steps starting from a seed of URIs. Different input options for the seed, the sequence of construction steps, and a filter enable adaptations to the use case. A formal description and matching SPARQL queries complement the algorithm. Hospital data provides a running example.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Wikidata itself describes a Wikidata subset (Q96051494) as "part of the data contained in Wikidata" 3 . It is a selected subset from the set of all Resource Description Framework (RDF) 4 statements (triples) included in Wikidata. This paper proposes a way to construct a Wikidata subset.</p><p>Within our work in research and industry projects, we came upon three use cases that we considered for this paper: (1) data on life science institutions, <ref type="bibr" target="#b1">(2)</ref> data on executives of listed corporations, (3) data on microelectronics and their supply chains. In all three cases, we want to extract fitting domain data from Wikidata according to our demands in the project.</p><p>Wikidata is one of the first addresses to link own RDF data to open data on the web or simply receive additional information to enhance own data. Wikidata is constantly growing in data volume and thus brings limitations to the practical interactions with the whole dataset. Reducing the Wikidata content to a smaller demand-driven subset brings multiple advantages. These include faster results (answers) to queries (questions) of machines (humans), an easier overview of available information, and the possibility of copies on regular-sized machines. Beghaeiraveri et al. <ref type="bibr" target="#b1">[2]</ref> and Mimouni et al. <ref type="bibr" target="#b7">[8]</ref> list extended advantages of subsets/subgraphs.</p><p>Challenges in building a subset include the volume of Wikidata's data and the resulting possibilities for subsets. A total number of triples N results in 2 N possible subsets, as any number up to N triples can be included in the subset in any variation. The task is to find a subset out of these based on a use case. As it is not feasible to create all subsets and choose afterwards, the task changes to find a starting point (the seed) in the data and from there draw a line separating included and excluded data. The seed thus would be any set of Wikidata items, including none and all.</p><p>The subset must fit the demands of a use case. The subset mustn't exclude relevant data to not miss any information which may lead to wrong conclusions. The subset also mustn't include unnecessary information to reduce computation costs when working with the subset and to increase the clarity of its content. In summary, we look for a minimal subset that still covers our demand (a minimal full-covering subset).</p><p>Another challenge is how to describe the need of the use case for the subset. We need enough detail to fulfill the criteria of the last paragraph while offering a feasible expression. Furthermore, the algorithm that creates the subset has to process the expression in a way that the subset fulfills it.</p><p>Currently available subsets of Wikidata don't contain domain data our use cases demand. Existing approaches to subsetting don't offer the depth of subsets we need for our use cases. Therefore, a new solution for demand-driven Wikidata subsets is required.</p><p>The main contribution of this paper is an algorithm to construct a demanddriven Wikidata subset including formal descriptions (see section 6.1). The necessary steps are explained using a running example of hospital data (see section 3). Additionally, we show how the SPARQL query language<ref type="foot" target="#foot_0">5</ref> can be used to implement the steps (see section 6).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Wikidata has a webpage 6 on subsetting that collects efforts and ideas on the subject. The content focuses on stability and reusability of subsets, a focus we don't share (see section 4). Downloadable subsets from Wikidata itself exist in the form of dumps<ref type="foot" target="#foot_2">7</ref> but aren't domain-specific. Instead, Wikidata refers to WDumper<ref type="foot" target="#foot_3">8</ref> for custom dumps that can be domain-specific.</p><p>WDumper is a service that provides a subset of Wikidata according to various filters. However, it doesn't support variables as SPARQL queries do for example. Therefore, we can only query the direct neighbourhood of known Wikidata items selected by filters. This doesn't comply the demand of our use cases as we also look for data we are not yet aware of and thus can't select with filters. An evaluation of WDumper by Beghaeiraveri et al. confirms that the service can be used in some use cases but also has limitations <ref type="bibr" target="#b1">[2]</ref>.</p><p>DBpedia is similar to Wikidata and offers various download options<ref type="foot" target="#foot_4">9</ref> but neither of them is domain-specific.</p><p>The Concise Bounded Description (CBD) of a resource "is a subgraph consisting of those statements which together constitute a focused body of knowledge about the resource" <ref type="bibr" target="#b9">[10]</ref>. We pursue the concept of a focused body of knowledge, but not regarding a single resource but rather a set of resources of a domain. This includes triples in a flexible depth from the starting node.</p><p>Beghaeiraveri et al. introduce the Topical Subset, "a set of entities related to a particular topic and the relationships between them" <ref type="bibr" target="#b1">[2]</ref>. Similarly to the subsets we want to create, a Topical Subset includes data in the neighbourhood of a selected seed with the focus on a domain. However, a Topical Subset doesn't include triples in a flexible depth around the seed and thus doesn't construct a subset around the seed but rather gathers data about the seed. Therefore, the subsets our algorithm creates includes Topical Subsets but not vice versa.</p><p>Mimouni et al. present an algorithm to build a Context Graph <ref type="bibr" target="#b7">[8]</ref>. Their algorithm has the same basic structure as our algorithm shown in section 6.1, nevertheless differences exist. The Context Graph algorithm has the neighborhood depth as an input parameter while we propose a more flexible construction tuple. Furthermore, their algorithm filters entities provided in the input while our approach includes a more general filter function that may exclude entities, properties, and classes that are discovered by the algorithm at runtime. These differences enable our algorithm to be adjustable to the demands of a use case.</p><p>A research area with similarities to our approach to Wikidata subsets is focused Web crawlers. Olston and Najork describe the basic algorithm: "Given a set of seed Uniform Resource Locators (URLs), a crawler downloads all the Web pages addressed by the URLs, extracts the hyperlinks contained in the pages, and iteratively downloads the Web pages addressed by these hyperlinks." <ref type="bibr">[9, p. 178</ref>]. Our approach starts with a set of seed Uniform Resource Identifiers (URIs), downloads all Wikidata triples containing the URIs, extracts newly found URIs, and iteratively downloads the triples containing these URIs. Focused Web crawlers look to minimize the number of accessed Web pages and maximize the percentage of relevant pages among those <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b0">1]</ref>. We share these goals for subsets as we want minimal but relevant data. To accomplish this, focused Web crawlers filter and rate every URL using various metrics and then add them to a set of yet to be downloaded URLs called frontier F . Each iteration, the top-rated URL in F is downloaded and processed which again adds new URLs to F <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b3">4]</ref>. We use filters to exclude URIs we found. However, we don't iterate over single URLs in F but rather the complete set F i which results in a completely new set of URIs F i+1 for the next iteration. Therefore, we also don't rate the URIs.</p><p>Another related research area is link traversal query execution. The main idea is to follow URIs found when querying distributed RDF data while executing said query <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>. Similar to focused Web crawlers and our approach, a frontier of to be processed URIs exists.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Running Example</head><p>We use the graph in figure <ref type="figure" target="#fig_0">1</ref> as an example of a Wikidata subset throughout this paper. It shows simplified data from Wikidata on the hospital Charité. The rounded vertices represent Wikidata items (e.g. Q162684) that we present by their (sometimes shortened) labels (e.g. Charité) instead of their URIs (e.g. https://www.wikidata.org/wiki/Q162684). We use labels to increase readability, the real identifier of an item remains its URI which the original triples contain. The angular vertices represent literals that are concrete data values (e.g. 1958). Charité is the seed of the example subset. The displayed subset shows some concepts of this paper: From the seed, inbound and outbound triples in varying distances contain data relevant for the use case and are thus included in the subset. When we iterate to construct the subset, we have to keep an eye on already visited URIs because multiple paths may exist from the seed to an URI (see Berlin in fig. <ref type="figure" target="#fig_0">1</ref>). Not all data in the adjacency of the seed is relevant to the use case (see (CBF namedAfter BF ) in fig. <ref type="figure" target="#fig_0">1</ref>). Therefore, we filter the triples before we add them to the subset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Approaches</head><p>Two general approaches come to mind when thinking about how and why to create a Wikidata subset: (1) stable defined subsets, (2) subsets by acute demand. We shortly discuss both.</p><p>(1) An organization or group creates stable defined subsets and provides them to the public, maybe on the Wikidata website. Advantages include: The user of the subset -whoever that might be -does not necessarily need to know how to build a subset or be familiar with Wikidata content, as the creation is done and the download easy. The subset can also be used to normalize the contained Wikidata classes by analyzing instances and developing a recommendation for minimal properties. Disadvantages include: The available subsets don't comply the demand of individual use cases. Open questions include: Must all data belong to a subset? Must data belong to only one subset? What is a fair number of subsets considering relevance and minimalism? Effort: Large onetime effort to create all subsets (ignoring updates).</p><p>(2) An algorithm that creates a subset by acute demand is available. Every time a subset is required, it is created as needed. Advantages include: With the demand known, the subset can be minimal. Additionally, the subsets are always relevant because they only exist if there is an acute demand. Disadvantages include: The user has to create a subset himself. Therefore they needs time as well as knowledge of their demand, the domain of the subset, and Wikidata. Open questions include: How to express the demand of a use case? How to fit the subset to the demand? Effort: Pay as you use (after a tool is available).</p><p>When considering the use cases from our project work, the second approach is better fitting. In this way, we receive useful (minimal and full-covering) results in a quicker fashion. Reusability of created subsets is thus not the focus of our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Preliminaries</head><p>Let G be a set of RDF triples T , U be the set of URIs, B be the set of blank nodes, and L be the set of literals. Each triple</p><formula xml:id="formula_0">T = (s, p, o) ∈ (U ∪ B) × U × (U ∪ B ∪ L)</formula><p>consists of a subject s, predicate p and object o according to the RDF specification<ref type="foot" target="#foot_5">10</ref> . The subject s ∈ (U ∪ B) may be an URI or blank node. The predicate p ∈ U is always an URI. The object o ∈ (U ∪ B ∪ L) may be an URI, a blank node, or a literal. G also represents a labeled directed multigraph, but we continue the view as a triple set in this text.</p><p>Let W ⊂ G be the set of all triples in Wikidata. Let S ⊆ W be a Wikidata subset. Let Q ⊂ U be the set of all Wikidata items. Let F ⊆ Q be the set of URIs that are part of the frontier in an algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Solution Concept</head><p>In RDF, data is described through its relations. Therefore, all triples containing a certain URI describe the semantics of that URI. Additionally, data of interest for one domain is usually connected. Consequently, selected major URIs of interest and their surroundings build domain data. This fits our goal of a demand-driven subset as the demand usually covers one domain. The surroundings of a seed of URIs can be constructed through multiple iterations over adjacent URIs (neighbours) with increasing distance to the seed. In summary, a seed of selected URIs and data connected to it in varying distance in the form of triples build a subset (see fig. <ref type="figure" target="#fig_0">1</ref>). Different options regarding the seed, the neighbourhood, and filters enable us to adjust the subset to the demand of a use case as shown in the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Basic Algorithm</head><p>We define the function that constructs a Wikidata subset S as</p><formula xml:id="formula_1">S = subset(W, C, seed(), f ilter())<label>(1)</label></formula><p>with the set of all triples in Wikidata W , the construction tuple C = (c 0 , ..., c n ), the seed function, and the f ilter function as inputs. The basic algorithm is shown in the following.</p><p>1. Select a set of URIs F 0 ⊆ Q as the seed (F 0 = seed(W )).</p><p>2. For each construction step c i ∈ C, (a) get all triples S i ⊆ W that contain an URI f ∈ F i of the frontier F i according to the construction step c i (S i = construct(W, F i , c i )). (b) filter the triples from set S i to a set</p><formula xml:id="formula_2">S i ⊆ S i (S i = f ilter(S i )). (c) add S i to the desired subset S ⊆ W (S = S ∪ S i ). (d) add F i to the set of visited URIs V (V = V ∪ F i ). (e) if c i+1 exists: get the set of URIs F i+1 ⊆ Q that are adjacent to the URIs F i ⊆ Q according to the construction step c i (F i+1 = f rontier(S i , c i , F i , V )). 3. Export the complete subset S.</formula><p>How these steps look in detail and combine is the focus of the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Seed Considerations</head><p>The seed acts as a starting point for the subset construction. It consists of a set of URIs F 0 ⊆ Q. F 0 will be the frontier of the first construction step that gets triples from Wikidata that contain an URI u ∈ F 0 . The function F 0 = seed(W ) is provided as an input to the algorithm to enable maximal flexibility.</p><p>One possibility to express seed() is a SPARQL SELECT query that runs on the Wikidata set W and returns F 0 . A SPARQL query provides suitable possibilities to define the seed. In the example of figure <ref type="figure" target="#fig_0">1</ref>, the SPARQL query for German hospitals shown in listing 1.1 returns Charité ∈ F 0 as one URI of the seed.</p><p>Listing 1.1: SPARQL query for German hospitals that returns the seed F 0 SELECT ? h o s p i t a l FROM W WHERE { ? h o s p i t a l wdt : P31 wd : Q16917 ; #i n s t a n c e O f h o s p i t a l wdt : P17 wd : Q183 . #c o u n t r y Germany } A basic possibility would be to provide F 0 in seed() by hand. This is a feasible option if only a few known URIs are relevant for the use case. In our example, the domain could not include all German hospitals but rather only the Charité, so F 0 = {Charité}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Construction/Triple Considerations</head><p>In terms of RDF triples, we define the neighbourhood of a URI u in a set of triples G as the subset N RDF G (u) ⊆ G containing all triples T (u, p, o) ∈ G and T (s, p, u) ∈ G that have u as subject or object. In these triples, all URIs adjacent to the URI u in G are included. The neighbourhood N RDF G (U ) of a set of URIs U defines the union of the neighbourhood of each URI u ∈ U . An example from figure <ref type="figure" target="#fig_0">1</ref> is provided at the end of this section.</p><p>Because triples build a directed graph, we can further distinguish adjacent URIs of an URI u between predecessors and successors. The triples that contain u as object and a predecessor are inbound triples. They provide "secondary knowledge" about u <ref type="bibr" target="#b9">[10]</ref>. The triples that contain u as subject and a successor are outbound triples. They provide "primary knowledge" about u <ref type="bibr" target="#b9">[10]</ref>. We define the subset of inbound triples of a URI u from a set of triples G as the subset N −RDF G (u) ⊆ G containing all triples T (s, p, u) ∈ G that have u as object. We define the subset of outbound triples of a URI u from a set of triples G as the subset N +RDF Over multiple iterations, we look at the neighbourhood of the given frontier F i to construct the subset S. The input of the subset must express a sequence of construction steps adequately. Let C = (c 0 , c 1 , ..., c n ) be a tuple with c i ∈ {↔, ← , →} that represents the sequence of construction steps. For every iteration i, the expansion c i is executed on the frontier F i resulting in the partial subset S i ⊆ W that adds to the complete subset S. The construction step c i may construct the whole neighbourhood (↔), the set of inbound triples (←), or the set of outbound triples (→). The size n of the tuple C expresses the number of iterations and the maximum depth of triples around the seed.</p><p>The dedicated function S i = construct(W, F i , c i ) specifies to:</p><formula xml:id="formula_3">S i =      N RDF W (F i ) if c i = ↔ N +RDF W (F i ) if c i = → N −RDF W (F i ) if c i = ← (2)</formula><p>Regarding an implementation, several options to access and query Wikidata exist<ref type="foot" target="#foot_6">11</ref> . Available are per-item access with dereferenceable URIs, a SPARQL endpoint, and an endpoint for Linked Data Fragments. We won't evaluate these options as the implementation is not a focus of this paper. However, to give another expression of the constructed triples, the listings 1.2, and 1.3 show the SPARQL CONSTRUCT queries to construct the inbound triples (c i = ←) and the outbound triples (c i = →) respectively. Both queries together construct the neighbourhood (c i = ↔). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Filter Considerations</head><p>When we construct the neighbourhood of a URI, we receive all triples connected with that URI. This data quickly accumulates to a sizable subset. However, one of our goals for the subset is a smaller size. Additionally, we look for a minimal and full-covering subset. Thus, the algorithm has a filter function that can exclude triples that don't contain relevant information. It reduces a subset S i to a filtered subset S i ⊆ S i .</p><p>Because there are plenty of options to filter triples and the focus of filters varies between different use cases, the filter function S i = f ilter(S i ) is provided as an input to the algorithm to enable maximal flexibility. Some basic filter options to exclude triples are language-tagged literals for selected languages and triples using wikibase:Statements as they model duplicate information. Further possibilities are filters by Wikidata's property hierarchy or a filter for designated classes.</p><p>In this paper, however, we propose the idea of a filter for rarely used properties. We argue that triples with rarely used properties provide relatively little value for later analysis (like the namedAfter property in fig. <ref type="figure" target="#fig_0">1</ref>). If we analyse hundreds or thousands of instances of one class, we can exclude properties that are only used by one or two instances. This may not sound much, but a few properties filtered for many URIs add up. Additionally, the filter function also reduces data volume in future iterations because every removed triple also means one less adjacent URI for the next construction, so the frontier F i+1 shrinks (see section 6.5).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Let A +</head><p>px ⊆ S i be the set of triples</p><formula xml:id="formula_4">A + px = {T = (u, p x , o) | T ∈ S i and u ∈ F i } with the URIs u ∈ F i and a property p x . Let A − px ⊆ S i be the set of triples A − px = {T = (s, p x , u) | T ∈ S i and u ∈ F i } with the URIs u ∈ F i and a property p x . Let α(p x ) = |Apx| |F |</formula><p>be the fraction of URIs with the property p x from all URIs in the frontier F . Let α 0 be the threshold value any property p x must reach (α(p x ) ≥ α 0 ) so that A px is not removed from S i in order to build S i . For the construction of inbound triples (c i = ←), only A − px is relevant, as |A + px | = 0. For the construction of outbound triples (c i = →), only A + px is relevant, as |A − px | = 0. For the construction of the neighbourhood (c i = ↔), A + px and A − px are relevant and calculated separately. The seed F 0 of a subset probably contains Wikidata items (URIs) of only a few classes. URIs of the same class likely have similar properties, so α(p x ) tends to be relatively large for most properties. However, when we iteratively construct the neighbourhood, the classes of the URIs u ∈ F i will quickly differentiate and α(p x ) decrease. Therefore, a constant value for α 0 might not be the best solution. Instead, there are arguments to both decrease and increase it with each construction step. We might decrease α i to not exclude too many triples that could provide relevant information. We might also increase α i to remove even more triples that could provide irrelevant information. Both can be done in a additive (α i+1 = α i ± β) or multiplicative (α i+1 = α i * β) way. The decision to increase or decrease α i is in its consequence a decision for a more dense, maybe minimal subset or a full-covering subset that doesn't want to miss out relevant data. The values α 0 and β are set by input parameters of the filter function. It is also possible to select α 0 = β = 0 to disable the property filter.</p><p>With the proposed filter for rare properties, the function S i = f ilter(S i ) specifies to:</p><formula xml:id="formula_5">S i = S i \ {T = (s, p x , o) | α(p x ) &lt; α i (α 0 , β)}</formula><p>(3)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5">Frontier Considerations</head><p>With the data that describes the URIs of the frontier F i collected and filtered (subset S i ), the next construction step is due. Therefore, we need to select a new set of URIs F i+1 . Because we build a subset from a seed through the construction of the surroundings of the seed, the new frontier F i+1 consists of URIs found in the last construction step. These new URIs are adjacent to the last frontier F i and can therefore be described using the set of adjacent URIs N Si (F i ) from graph theory <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b2">3]</ref>. However, not all adjacent URIs are proper for F i+1 . Let's say we expand through the complete neighbourhood. In the example of figure <ref type="figure" target="#fig_0">1</ref>, we reach the URI Berlin with the first construction step. Therefore, the URI is part of the following frontier (Berlin ∈ F i+1 ). The URI CBF is also included in that frontier (CBF ∈ F i+1 ). After another construction step, we reach Berlin again in the outbound triple from CBF. This would mean that we process the neighbourhood of Berlin twice. Thus, we must only select unvisited URIs for the new set of URIs F i+1 . The set of visited URIs V collects all newly visited URIs after every construction step. V can then be excluded from F i+1 . In consequence, the sets F i and F i+1 are disjoint:</p><formula xml:id="formula_6">F i ∩ F i+1 = ∅.</formula><p>We can also omit literals from the set F i+1 because they are never the subject of a triple and we are not interested in URIs with identical literal values as object (e.g. the year 1958 in fig. <ref type="figure" target="#fig_0">1</ref>). Additionally, we can omit blank nodes because they can't be queried.</p><p>The example in figure <ref type="figure" target="#fig_0">1</ref> shows another scenario we have to consider for F i+1 . The displayed graph represents the subset we would like to receive as a result of our algorithm. The URI Charité serves as seed (F 0 = {Charité}). Initially, we are interested in all data related to the seed, so we construct the complete neighbourhood in the first step (c 0 = ↔). However, out of all the URIs we discover with this construction, we know we only want to continue the construction in the next step with the successors, because the outbound relations of the seed are more important for our use case. In this case, the frontier must only consider the URIs N + Si (F i ) succeeding the set of URIs F i . The preceding URIs N − Si (F i ) are the opposite option. To include this information in the input, we add two options for the sequence of expansion steps C: Only select predecessors (↔ − ) or successors (↔ + ) after constructing the complete neighbourhood. Therefore, the expansion options are c i ∈ {↔, ↔ − , ↔ + , ←, →}.</p><p>For a construction of inbound (c i = ←) or outbound (c i = →) triples, the frontier also defines with N − Si (F i ) or N + Si (F i ) respectively because only those adjacent URIs exist.</p><p>In conclusion, the function F i+1 = f rontier(S i , c i , F i , V ) specifies to:</p><formula xml:id="formula_7">F i+1 =      (N Si (F i ) ∩ U ) \ V if c i ∈ {↔} (N − Si (F i ) ∩ U ) \ V if c i ∈ {↔ − , ←} (N + Si (F i ) ∩ U ) \ V if c i ∈ {↔ + , →}<label>(4)</label></formula><p>The listings 1.4, 1.  The resulting frontier would be F 2 = {BF, M itte}, however, the subset from figure <ref type="figure" target="#fig_0">1</ref> is already completed with C = (↔ + , →).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Summary</head><p>The paper introduces an algorithm S = subset(W, C, seed(), f ilter()) to create a demand-driven Wikidata subset. Therefore, we construct triples in multiple steps starting from a seed. Available options c i ∈ {↔, ↔ − , ↔ + , ←, →} that sequence the construction offer flexibility. A filter reduces the number of triples to exclude irrelevant data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Future Work</head><p>We are currently in the process of implementing the introduced algorithm. With a working implementation, we plan to construct a Wikidata subset of the more challenging domain of microelectronics and their supply chains.</p><p>Once we created multiple subsets, we need an evaluation to compare them. The clustering coefficient from graph theory seems like one possible parameter.</p><p>One advantage of Wikidata is its links to other data sources on the web. With techniques from link traversal query execution, we could follow the links and include data we find in the subset.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Example of simplified data from Wikidata on the hospital Charité</figDesc><graphic coords="4,134.77,430.73,345.84,142.24" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>G</head><label></label><figDesc>(u) ⊆ G containing all triples T (u, p, o) ∈ G that have u as subject. The sets of inbound triples N −RDF G (U ) ⊆ G and outbound triples N +RDF G (U ) ⊆ G of a set of URIs U define as the union of the appropriate sets of each URI u ∈ U .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Listing 1 . 2 :</head><label>12</label><figDesc>SPARQL query that constructs the inbound triples of an URI u CONSTRUCT WHERE { GRAPH W { ? s ?p $u } . } Listing 1.3: SPARQL query that constructs the outbound triples of an URI u CONSTRUCT WHERE { GRAPH W { $u ?p ? o } . } In the example of figure 1, the first construction step c 0 = ↔ returned the inbound and outbound triples of the seed F 0 = {Charité}. Therefore, the first partial subset is S 0 = N RDF W (Charité) and returns the triples: {(KfP partOf Charité), (KfN partOf Charité), (Charité subsidiary CBF), (Charité locatedIn Berlin), (Charité subsidiary VK)}</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>FILTER ( isURI ( ? u ) ) } In the example of figure 1, the first construction step c 0 =↔ + results in the frontier F 1 = {CBF, Berlin, V K} (see section 6.3 for S 0 ). The following step only constructs outbound triples (c 1 = →). The corresponding subset S 1 = N +RDF W (F 1 ) returns the triples: {(CBF locatedIn Berlin), (CBF inception 1958), (CBF namedAfter BF), (Berlin population 3644826), (Berlin contains Mitte), (VK locatedIn Mitte), (VK inception 1906)}</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>5, and 1.6 show SPARQL SELECT queries that select all URIs u that are adjacent (c i ∈ {↔}) / predecessors (c i ∈ {↔ − , ←}) / successors (c i ∈ {↔ + , →}) to an URI f ∈ F i respectively. The SPARQL queries don't exclude visited URIs u ∈ V . Listing 1.4: SPARQL query that selects URIs u adjacent to f ∈ F i for the frontier F i+1 from S i $ f ?p ?u . } UNION {?u ?p $ f . } FILTER ( isURI ( ? u ) ) } Listing 1.5: SPARQL query that selects URIs u that are predecessors of f ∈ F i for the frontier F i+1 from S i Listing 1.6: SPARQL query that selects URIs u that are successors of f ∈ F i for the frontier F i+1 from S</figDesc><table><row><cell>SELECT ?u</cell></row><row><cell>FROM Si '</cell></row><row><cell>WHERE { { SELECT ?u</cell></row><row><cell>FROM Si '</cell></row><row><cell>WHERE { ?u ?p $ f .</cell></row><row><cell>FILTER ( isURI ( ? u ) ) }</cell></row></table><note>i SELECT ?u FROM Si ' WHERE { $ f ?p ?u .</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_0">See https://www.w3.org/TR/sparql11-query/ (Last accessed 29 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_1">Available at https://www.wikidata.org/wiki/Wikidata:WikiProject Schemas/Subsetting (Last accessed 29 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_2">Available at https://www.wikidata.org/wiki/Wikidata:Database download (Last accessed 29 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_3">Available at https://wdumps.toolforge.org/ (Last accessed 29 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_4">Available at https://www.dbpedia.org/resources/ (Last accessed 29 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_5">See https://www.w3.org/TR/rdf11-concepts/ (Last accessed 30 July 2021).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_6">See https://www.wikidata.org/wiki/Wikidata:Data access (Last accessed 2 August 2021).</note>
		</body>
		<back>

			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Supported by the German Federal Ministry of Education and Research (FKZ 16ME0224). <ref type="bibr" target="#b2">3</ref> See https://www.wikidata.org/wiki/Q96051494 (Last accessed 20 Mai 2021). <ref type="bibr" target="#b3">4</ref> See https://www.w3.org/TR/rdf11-concepts/ (Last accessed 4 August 2021).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Improving the performance of focused web crawlers</title>
		<author>
			<persName><forename type="first">S</forename><surname>Batsakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">G</forename><surname>Petrakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Milios</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.datak.2009.04.002</idno>
		<ptr target="https://doi.org/10.1016/j.datak.2009.04.002" />
	</analytic>
	<monogr>
		<title level="j">Data &amp; Knowledge Engineering</title>
		<imprint>
			<biblScope unit="volume">68</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="1001" to="1013" />
			<date type="published" when="2009-10">Oct 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Experiences of Using WDumper to Create Topical Subsets from Wikidata</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A H</forename><surname>Beghaeiraveri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J G</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">J</forename><surname>Mcneill</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd International Workshop on Knowledge Graph Construction co-located with 18th Extended Semantic Web Conference</title>
				<meeting>the 2nd International Workshop on Knowledge Graph Construction co-located with 18th Extended Semantic Web Conference<address><addrLine>ESWC</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021-06">2021. Jun 2021</date>
			<biblScope unit="page">15</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Graph Theory with Applications</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Bondy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><forename type="middle">S R</forename><surname>Murty</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1982">1982</date>
			<publisher>Elsevier</publisher>
			<pubPlace>New York</pubPlace>
		</imprint>
	</monogr>
	<note>5 edn</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Focused crawling: a new approach to topic-specific Web resource discovery</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Den Berg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dom</surname></persName>
		</author>
		<idno type="DOI">10.1016/S1389-1286(99)00052-3</idno>
		<ptr target="https://doi.org/10.1016/S1389-1286(99)00052-3" />
	</analytic>
	<monogr>
		<title level="j">Computer Networks</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">11-16</biblScope>
			<biblScope unit="page" from="1623" to="1640" />
			<date type="published" when="1999-05">May 1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The common-neighbourhood of a graph</title>
		<author>
			<persName><forename type="first">P</forename><surname>Dundar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Aytac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kilic</surname></persName>
		</author>
		<idno type="DOI">10.5269/bspm.v35i1.22464</idno>
		<ptr target="https://doi.org/10.5269/bspm.v35i1.22464" />
	</analytic>
	<monogr>
		<title level="j">Boletim da Sociedade Paranaense de Matemática</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">23</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Executing SPARQL Queries over the Web of Linked Data</title>
		<author>
			<persName><forename type="first">O</forename><surname>Hartig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Freytag</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3</idno>
		<idno>-642-04930-9 19</idno>
		<ptr target="https://doi.org/10.1007/978-3" />
	</analytic>
	<monogr>
		<title level="m">The Semantic Web -ISWC 2009</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">A</forename><surname>Bernstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Karger</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Heath</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Feigenbaum</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Motta</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Thirunarayan</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">5823</biblScope>
			<biblScope unit="page" from="293" to="309" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Foundations of Traversal Based Query Execution over Linked Data (Extended Version)</title>
		<author>
			<persName><forename type="first">O</forename><surname>Hartig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Freytag</surname></persName>
		</author>
		<idno type="DOI">10.1145/2309996.2310005</idno>
		<ptr target="https://doi.org/10.1145/2309996.2310005" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd ACM conference on Hypertext and social media</title>
				<meeting>the 23rd ACM conference on Hypertext and social media</meeting>
		<imprint>
			<date type="published" when="2012-06">Jun 2012</date>
			<biblScope unit="page" from="43" to="52" />
		</imprint>
	</monogr>
	<note>HT &apos;12</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Domain Specific Knowledge Graph Embedding for Analogical Link Discovery</title>
		<author>
			<persName><forename type="first">N</forename><surname>Mimouni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Moissinac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Vu</surname></persName>
		</author>
		<ptr target="https://hal-cnrs.archives-ouvertes.fr/hal-03052226" />
	</analytic>
	<monogr>
		<title level="j">International Journal On Advances in Intelligent Systems</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">1&amp;2</biblScope>
			<biblScope unit="page" from="140" to="150" />
			<date type="published" when="2020-06">Jun 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Web Crawling</title>
		<author>
			<persName><forename type="first">C</forename><surname>Olston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Najork</surname></persName>
		</author>
		<idno type="DOI">10.1561/1500000017</idno>
		<ptr target="https://doi.org/10.1561/1500000017" />
	</analytic>
	<monogr>
		<title level="j">Foundations and Trends in Information Retrieval</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="175" to="246" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Stickler</surname></persName>
		</author>
		<ptr target="https://www.w3.org/Submission/CBD/" />
		<title level="m">CBD -Concise Bounded Description</title>
				<imprint>
			<date type="published" when="2005-06">Jun 2005</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
