<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Experiences of Using WDumper to Create Topical Subsets from Wikidata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Seyed Amir Hosseini Beghaeiraveri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>@hw.ac.uk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alasdair J.G. Gray</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>]a.j.g.gray@hw.ac.uk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fiona J. McNeill</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>]f.j.mcneill@ed.ac.uk</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Informatics, The University of Edinburgh</institution>
          ,
          <addr-line>Edinburgh</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Mathematical and Computer Sciences, Heriot-Watt University</institution>
          ,
          <addr-line>Edinburgh</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Wikidata is a general-purpose knowledge graph covering a wide variety of topics with content being crowd-sourced through an open wiki. There are now over 90M interrelated data items in Wikidata which are accessible through a public query endpoint and data dumps. However, execution timeout limits and the size of data dumps make it di cult to use the data. The creation of arbitrary topical subsets of Wikidata, where only the relevant data is kept, would enable reuse of that data with the bene ts of cost reduction, ease of access, and exibility. In this paper, we provide a working de nition for topical subsets over the Wikidata Knowledge Graph and evaluate a third-party tool (WDumper) to extract these topical subsets from Wikidata.</p>
      </abstract>
      <kwd-group>
        <kwd>wikidata</kwd>
        <kwd>knowledge graph subsetting</kwd>
        <kwd>topical subset</kwd>
        <kwd>wdumper</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A Knowledge Graph (KG) is de ned as representing real-world entities as
nodes in a graph with the relationships between them captured as edges [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
In recent years, there have been a growing number of publicly available KGs;
ranging from focused topic speci c ones such as GeoLinkedData [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], EventMedia
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and UniProt [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], to more general knowledge ones such as Freebase [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
DBpedia [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and Wikidata [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. These general purpose KGs cover a variety of
topics from sports to geography, literature to life science, with varying degrees
of granularity.
      </p>
      <p>
        Wikidata [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is a collaborative open KG created by the Wikimedia
Foundation. The main purpose of Wikidata is to provide reliable structured data to
feed other Wikimedia projects such as Wikipedia, and is actively used in over
800 Wikimedia projects3. It contains over 90 million data items covering over
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
3 https://www.wikidata.org/wiki/Wikidata:Statistics accessed 4 February 2021
75 thousand topics4. Regular dumps of the data are published in JSON and
RDF under the Creative Commons CC0 public license. However, the size of the
gzipped download les has grown from 3GB in 2015 to 85GB in 2020, and keeps
increasing as more data is added. The size of these les make it increasingly
difcult and costly for others to download and reuse, particularly if only focused
on a particular topic within the data, e.g. life sciences data or politicians. While
Wikidata can be queried directly through an open SPARQL endpoint5, it is
subject to usage limits (as are other public endpoints of KGs) which limits the
scale and complexity of queries that can be issued [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Thus there is a need for
topical subsets of large KGs that contain all data on a speci c topic to enable
complex analysis queries to be conducted in a cost e ective and time e cient
way.
      </p>
      <p>Our motivation for research on Wikidata subsetting is a combination of
research goals, exibility and ease of use. From the exibility and ease of use
perspective, we are looking for Wikidata subsets that can allow users to run
smaller versions of Wikidata on available platforms such as laptops and PCs.
Wikidata, as a knowledge graph with an interesting data model, has signi cant
features for inspiration and improvement, but the speed of research and the
diversity of researchers will be reduced if any experiment on it requires powerful
servers, processing clusters, and hard disk arrays. From the research point of
view, our motivation is creating a type of subset we call Topical Subset which is
a set of entities related to a particular topic and the relationships between them.
Having topical subsets of Wikidata for example in the elds of art, life sciences,
or sports, not only helps us achieve the rst goal ( exibility and ease of use)
but also provides a platform for comparing and evaluating Wikidata features in
di erent topics. The subset will also make the research experiments reproducible
as the data used can be archived and shared more easily.</p>
      <p>
        You can envision that such subsets can be generated through SPARQL
CONSTRUCT queries. While this is straightforward for small subsets focused
on a single entity type, e.g. politicians, it does not scale to interrelated topics
that make up a larger domain, e.g. the life sciences subset de ned by GeneWiki
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. In this paper, we present our experiences of de ning and creating topical
subsets over Wikidata using WDumper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; a third-party tool provided by
Wikidata to create custom RDF dumps of Wikidata. We de ne topical subsets over
a KG (Section 3.2) and evaluate WDumper as a practical tool to extract such
subsets (Section 4).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        KG subsetting, particularly in the context of Wikidata [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], has been gaining
attention in the last couple of years, with use cases and potential approaches
4 Query to count all data items https://w.wiki/yVY accessed 9 February 2021. Note
that the query execution timesout if you try to return the count query SELECT
(COUNT(DISTINCT ?0) AS ?numTopics) WHERE f ?s wdt:P31 ?o g.
5 https://query.wikidata.org/sparql accessed February 2021
being explored [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Although there have been several approaches to subsetting
proposed, to the best of our knowledge there is no agreed de nition for a topical
subset nor a uni ed and evaluated way to create such subsets. The Graph to
Graph Mapping Language (G2GML) enables the conversion of RDF graphs into
property graphs using an RDF le or SPARQL endpoint and a mapping le [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
This could be exploited to create a topical subset based on the de nition in the
mapping le. However the output would be a property graph. Context Graph
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] can be used to generate subsets of an RDF KG, and was developed to create
subsets for testing knowledge base completion techniques. The approach captures
all nodes and edges within a given radius of a set of seed nodes. While the
generated subsets are suitable for testing knowledge base completion techniques,
there is no guarantee to the topical cohesion of the subset, and thus do not
meet the needs identi ed for this work. YAGO4 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is a collection of Wikidata
instances under the type hierarchy of schema.org [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It suggests a new logically
consistent knowledge base over the Wikidata A-Boxes, however, there is no choice
as to what topic be in the nal KG.
      </p>
      <p>
        WDumper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a third-party tool for creating custom and partial RDF
dumps of Wikidata suggested at the Wikidata database download page6. The
WDumper backend uses the Wikidata Toolkit (WDTK) Java library to apply
lters on the Wikidata entities and statements, based on a speci ed con guration
that is created by its Python frontend. This tool needs a complete JSON dump
of Wikidata and creates an N-Triple le as output based on lters that the
con g le explains. The tool can be used to create custom subsets of Wikidata.
In Section 4.5 we will investigate whether it can generate topical subsets.
      </p>
      <p>
        Shape Expressions (ShEx) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a structural schema language allowing
validation, traversal and transformation of RDF graphs. By caching triples that
match the constraints, referred to as \slurping", a subset of the dataset can be
created that conforms to the ShEx schema. Therefore to create a topical subset,
all that is required is the de nition of the ShEx schema. However, this approach
is not available at scale yet and thus cannot be applied to Wikidata.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Knowledge Graph Subsetting</title>
      <p>General purpose KGs like Wikidata are valuable sources of facts on a wide
variety of topics. However, their increasing size makes them costly and slow to
use locally. Additionally, the large volume of data in Wikidata increases the time
required to run complex queries. This often restricts the types of queries that
can be posed over the public endpoint since it has a strict 60-second limit on the
execution time of queries. Downloading and using a local version of Wikidata is a
way of circumventing the timeout limit. However, this is not a cheap option due
to the size of the data. A suggested system to have a personal copy of Wikidata
includes 16 vCPUs, 128GB memory, and 800GB of raided SSD space7.
6 https://www.wikidata.org/wiki/Wikidata:Database_download
7 See this post:
https://addshore.com/2019/10/your-own-wikidata-queryservice-with-no-limits/</p>
      <p>There are a large number of use case scenarios where users will not need access
to all topics in a large general purpose KG. A small and complete enough subset
can be more suitable for many purposes. With a small subset inference strategies
can be applied to the data and complete in reasonable time. Topical subsets
could also be published along with papers, which provides better reproducibility
of experiments. Therefore having a topical subset that is smaller but has the
required data can enable complex query processing on cheap servers or personal
computers | reducing the overall cost | whilst also providing an improvement
in query execution times.
3.1</p>
      <sec id="sec-3-1">
        <title>Topical Subset Use Cases</title>
        <p>We now de ne four use cases for topical subsets in Wikidata that we will use
to review WDumper, and can also be used in other reviews as a comparison
platform. Note that the use cases are de ned in terms of English language
statements. A subsetting approach, method, or tool would need to formalise these,
as appropriate for their con guration, to extract the relevant data.
Politicians: This subset should contain all entities that are an instance of the
class politician, or any of its subclasses. In the case of Wikidata, this would
be the class Q82955, while for DBpedia it would be the class Politician. The
subset should contain all facts pertaining to these entities, i.e. in Wikidata
all statements and properties.</p>
        <p>General(military) Politicians: The subset should contain all entities that
are an instance of the class politician(Q82955) or any of its subclasses, who
also are a military o cer (Q189290) and have the rank of general (Q83460),
i.e. politico-military individuals. The main goal of this use case is to see the
e ect of having more conditions in the English de nition on the run-time
and the volume of the output of subset extraction tools.</p>
        <p>UK Universities: The subset should contain all instances of the class
university (Q3918) or any of its subclasses, that are located in the UK. The subset
should contain all statements and properties pertaining to these entities. This
use case extends the complexity of the subset by having alternative
properties and values to satisfy, e.g. the location can be captured in Wikidata with
the properties country(P17), located in territory(P131), or location(P276).
Likewise the country could be stated as one of the component parts of the
UK, e.g. Scotland.</p>
        <p>
          Gene Wiki: This case is based on the class-level diagram of the Wikidata
knowledge graph for biomedical entities given in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The class-level
diagram speci es 17 di erent item types from Wikidata mentioned in the Gene
Wiki project. The subset should contain all instances of these 17 classes and
their subsets.
        </p>
        <p>The selection of these use cases is a combination of research and experimental
goals. The Gene Wiki and Politicians use cases have been selected for future
research purposes because of their hypothetical richness in references. The other
Listing 1.1. An example of a function R which is a query to return all entities with
type city
SELECT ? entity WHERE {</p>
        <p>? entity wdt : P31 wd : Q515 . # instance of ( P31 ) city ( Q515 )
}
two use cases have been chosen to explore the expressiveness of a topical subset
de nition, and then to explore the runtime execution of these.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Topical Subset De nition</title>
        <p>We now provide a de nition for topical subsets based on the Wikidata data
model. Wikidata consists of the following collections:
{ E: set of Wikidata entities { their ID starts with a Q.
{ P : set of Wikidata properties { their ID starts with a P.</p>
        <p>{ S: set of Wikidata statements.</p>
        <p>Now we de ne the lter function R : E ! E as a black-box that can be applied
on E and selects a nite number of its members related to a speci c topic. Let
ER E be the output of the function R. For entity e 2 E let Se S be
all simple and complex Wikidata statements in which e is the subject. Note
that in Wikidata, a simple statement is a regular RDF triple, while a complex
statement is a triple that references and/or quali ers attached to it. Also, let Pe
be all properties which are used in Se triples either for the statement itself or
quali ers/references. With these assumptions, we de ne dump DR as a topical
subset of Wikidata with respect to R:</p>
        <p>DR := (ER; [ Pe; [ Se)
e2ER
e2ER
From the de nitions of Pe and Se we can conclude that</p>
        <p>S Pe P and
e2ER</p>
        <p>S and subsequently DR is a subset of Wikidata. We consider R
S Se
e2ER
as black-box; the input of R is the set of all Wikidata entities and its output is a
subset of Wikidata entities related to a speci c topic. The function R can be any
set of de nitions, rules, or lters that describe a related group of entities. The
de nition of R depends on the topic that is being described. One example of R
is a simple SELECT query that describes all entities that have type city (Q515)
(Listing 1.1).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>WDumper</title>
      <p>WDumper is a tool provided by Wikidata for producing custom subsets from
Wikidata8. The R function can be seen as a lter approach on entities. For each
8 https://www.wikidata.org/wiki/Wikidata:Database_download#Partial_RDF_
dumps - accessed 9 February 2021
topic, the appropriate lters on entities must be de ned. Once the lters are
de ned and the subset ER is extracted, WDumper extracts all statements with
origin e, where e 2 ER, along with their quali ers and references. A component
overview of WDumper is given in Figure 1.</p>
      <p>WDumper requires two inputs. The rst is the complete dump of the
Wikidata in JSON. The second is a JSON speci cation le that contains rules and
lters for determining which entities, properties and statements to extract from
the full Wikidata dump. This is the de nition of the function R. The output of
WDumper is an N-Triple (.nt) le that contains the entities and statements
speci ed in the second input. There is also a GUI for creating the input speci cation
le. We review WDumper through the following steps:
1. Writing WDumper speci cations for the use cases in Section 3.1.
2. Running WDumper with the above speci cations on two complete Wikidata
dumps belonging to two di erent time points and compare the run-time
and the volume of the extracted output.
3. Evaluating the extracted output via performing di erent queries both on the
output and the input full dump.
4. Summarizing results and expressing strengths and weaknesses of WDumper.
4.1</p>
      <sec id="sec-4-1">
        <title>WDumper Speci cation Files</title>
        <p>
          Section 3.1 introduced use cases to evaluate subset extraction tools. In this
section, we describe the corresponding WDumper speci cation les [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] generated
using the GUI depicted in Figure 2. The GUI provides several controls as to
what to include in the subset. The rst is to select whether you are interested
in items or properties. Using properties allows you to extract a subset of the
Wikidata ontology, while items returns data instances. Items can be further
ltered by giving a property and value. Other options then permit you to select
whether all statements are returned (\any") or just the top ranked statements
(\best rank")9. These lters allow the WDumper to extract the intermediate
9 Note that in Wikidata, each statement can have a rank that can be used to identify
the preferred value if the statement has more than one value. Ranks are available in
Wikidata RDF data model like quali ers and references.
nodes of statements, references, and quali ers. In all use cases, we created a
speci cation with and without the additional references and quali ers. This
allows us to investigate the e ect of statement lters on execution time and output
volume. In the speci cation le, these lters can be seen in the statement
subarray, as \quali ers", \references", \simple", \full" keys which are false or true
respectively.
        </p>
        <p>Politicians: We de ne a lter on the occupation property(P106) of the entities
to be a politician(Q82955).</p>
        <p>General(military) Politicians: This extends the Politicians de nition with
two more conditions. The rst is the occupation property(P106) to be
military o cer(Q189290). The second is the military rank property(P410) to be
general(Q83460).</p>
        <p>UK Universities: We de ne a lter on entities with two conditions: the
instance of property(P31) to be university(Q189290), and the country
property(P17) to be United Kingdom(Q145).</p>
        <p>
          Gene Wiki: For each item type of the class-diagram in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], we create a lter
on the corresponding entities in WDumper via the instance of property(P31)
to be gene(Q7187), protein(Q8054), etc. In this case, no lters are de ned
on the types as we require all statements associated with the types ot be in
the subset.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental Setup</title>
        <p>We now give details of our experimental environment in which we will evaluate
WDumper.</p>
        <p>Input Dumps. We use two full dumps of the Wikidata database. The rst full
dump is from 27 April 201510, and the second is from 13 November 202011. The
selected 2020 dump was the latest JSON dump available when conducting our
evaluation. The selected 2015 dump is the rst archive date for which both JSON
and Turtle les are available (we need the JSON le for WDumper running, while
the Turtle le is needed to import the full dump in a triplestore and evaluate
output of WDumper based on the input). Table 1 provides summary information
10 Downloaded from https://archive.org/download/wikidata-json-20150427 -
accessed 11 November 2020
11 Downloaded from https://dumps.wikimedia.org/wikidatawiki/entities/ -
accessed 15 November 2020
about these two dumps. The 2015 dump is smaller, can be stored and processed
locally even on PCs, and it takes a much shorter time to generate output. For
this reason, it is very suitable for initial tests. The 2020 dump, on the other
hand, is much richer and can provide insights on how WDumper deals with
large datasets of the size that Wikidata now produces.</p>
        <p>Experimental Environment. Experiments were performed on a multi-core server
with 64 8-core AMD 6380 2.5GHz 64bit CPUs, 512GB of memory and 2.7TB
disk space. Java openJDK version 11 (build 11+28) and Gradle 6.7.1 was used
to compile and run WDumper.</p>
        <p>Experimental Run. The calculated times have been extracted from the elapsed
time mentioned in WDumper output log. For each of the execution cases, three
independent runs were performed. The average and standard deviation from
these times were calculated.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Evaluating WDumper</title>
        <p>
          WDumper was run with the speci cation les described in Section 4.1 and the
two Wikidata dumps. The generated subsets are included in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and the results
stated in Table 2. For each use case, we generated subsets of simple statements
and with the inclusion of statement nodes, references, and quali ers (labelled
\withRQFS" in the table).
        </p>
        <p>Initial observations from Table 2 show that the run-time on the 2020 dump is
signi cantly longer than the 2015 dump. This can be justi ed by the larger
volume data that must be processed to produce the subset. In all cases, generating
the subset with additional statements took longer than generating the simple
statements, and produced more volume in the output, indicating the addition
of references, quali ers, and statements nodes in the output. For example, this
change is very signi cant in the case of Gene Wiki in the 2020 dump. The added
lters as well as the conditions added to the lters also have a direct e ect on the
run-time and an inverse e ect on the output volume, which is to be expected. Of
course within run-times, the amount of data that must be written in the output
must also be considered. This is evident in comparisons between UK universities
and the military politicians in which the volume of data has a greater impact
than the number of conditions. Overall, considering the high volume of data,
the time required to extract a topical subset by WDumper seems appropriate.
13 https://wikidata-todo.toolforge.org/stats.php
Adding more lters does not have a huge e ect on runtime which is dominated
by data volume.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Topical Subset Validation</title>
        <p>The previous section considered the runtime performance of the WDumper tool
and the size of the generated subsets. We now consider validating the content
of the subsets. That is, we consider if the produced output has the information
that it was supposed to have according to the de nition in Section 3.2. Our
assessment will be based on the following conditions:
Condition 1: The number of ltered entities in the output should be equal
to the same number of entities in the input dump. For example, in the
Politicians use case, the number of persons with the occupation of politician
in output should be equal to the number of persons with the occupation of
politician in the corresponding input dump. This condition can be tested
with COUNT queries on the input and output datasets.</p>
        <p>Listing 1.2. Commands used for xing syntax errors of the 2015 dump.
sed -i -E 's /( &lt;.*)}(.* &gt;)/\1\2/ ' &lt; dump_file &gt;
sed -i -E 's /( &lt;.*)\\ n (.* &gt;)/\1\2/ ' &lt;dump_file &gt;
sed -i -E 's /( &lt;.*)\|(.* &gt;)/\1\2/ ' &lt;dump_file &gt;
Condition 2: For each entity that is supposed to be in output, the number of
its related statements must be equal in both input and output datasets. For
example, in the Politicians use case, if the main dump has 50 statements
about George Washington, we expect to see the same number of statements
about this politician in the output too. This condition can be tested using
DESCRIBE queries.</p>
        <p>Condition 3: WDumper can extract intermediate statement nodes, references,
and quali ers exactly as they are at the input dump. This condition can be
tested by querying the quali ers and references of some given statements.
Testing the conditions requires running di erent queries on the input and output
of WDumper. For the output of WDumper, we use Apache Jena Fuseki version
3.17 to import data as TDB2 RDF datasets and perform queries.
Data Corrections. We encountered problems loading the 2015 turtle dump14
into Fuseki. Errors arose from the inclusion of bad line endings in more than 100
cases, and the existence of characters such as `na'. Unacceptable characters such
as `nn' and `nn' can also be seen in the WDumper outputs, which reinforces
the possibility that this problem occurs due to the conversion of information
from the JSON le to RDF format.</p>
        <p>
          For the WDumper outputs and the 2015 dump, these errors were
manually xed using the sed commands given in Listing 1.2. The sanitized versions
are available in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In the case of 2020 dump, we use Wikidata Query Service
(WDQS) due to the time that would be required to x and load this data into
Fuseki. The date of implementating our evaluation queries is approximately two
months after the creation date of the 2020 dump (27 November 2020). In this
period, new data may have entered Wikidata which are available by WDQS and
are not present in 2020 dump (and subsequently are not present in the WDumper
output). Because of this, there may be slight di erences in the counts of
entities and statements between input and output that is not related to WDumper
functionality. We tried to use Wikidata history query service15 to quantify the
rate of Wikidata increases in this period, but the history covers a range from
the creation of Wikidata to July 1st 2019.
        </p>
        <p>Validation of Condition 1. We use COUNT queries to validate this condition.
The purpose of these queries is to count the entities that should be in the output
according to the lter(s) of each use case. If WDumper is performing correctly,
14 Downloaded from https://archive.org/download/wikidata-json-20150427 -
accessed 20 December 2020
15 https://www.wikidata.org/wiki/Wikidata:History_Query_Service
Listing 1.3. COUNT queries for evaluating condition 1. Pre xes and most of Gene
Wiki's query have been deleted for more readability.
############ Politicians ##############################
SELECT ( COUNT ( DISTINCT ? item ) AS ? count ) WHERE {
? item wdt : P106 wd: Q82955 . # occupation of politician
}
############ UK Universities ##########################
?}SEiLteEmCT ww(ddttC::OUPPN31T17 (wwddD::ISQQT134I95N18CT ;.?##itecimon)usnttaArSnyce? ocfoofunUtnu)intievWdeHrERsKEiit{nygdom
############ General ( military ) Politicians ############
?SEiLteEmCT www(dddtttC:::OUPPPN114T001660 (wwwDdddI:::STQQQI881N238C949T5625090? i;;.tem###) mooAiccSlcciuut?ppaaacrttoyiiuoonnntr)anookffWHEompRfioElli{igttei.nceiroaafnlficer
}
{US#?NE#IL#iOEt#NCe#Tm###(w#dC#tO#U:#NPT31G(enwDedI:STWQIi4Nk2iC3T02#6?#i#t.#}e#m#)###A#S##i?#n#sct#oau#nn#tc#e#)##oWf#H#E#RaE#c#{t#i#v#e##s#i#te
{? item wdt : P31 wd: Q4936952 .} # instance of anat . struct .</p>
        <p>UNION
# ...</p>
        <p>UNION
{? item wdt : P31 wd: Q50379781 .} # instance of therap . use
}</p>
        <sec id="sec-4-4-1">
          <title>Use case</title>
          <p>Politicians
General (military) Politicians
UK Universities</p>
          <p>Gene Wiki
2015 Dump 2020 Dump</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>Output Input Output Input</title>
          <p>246,009 246,044 641,387 646,401
165 165 597 602
73 73 183 186
19,432 19,432 3,282,560 3,283,471
the result of this count should be the same on both the input and output datasets.
These queries will be di erent for each use case, depending on the de nition of
that use case. For example, while in the Politicians use case we count the number
of people with political jobs, in the case of Gene Wiki we count the union of
entities of type disease, genes, proteins, etc. Listing 1.3 shows the queries executed
for each use case. These queries run on the each use case's output \withRQFS"
since these are included in the input dataset. The results of performing the
COUNT queries are shown in Table 3.</p>
          <p>Our results show that for the 2015 dump, the number of entities in the
output and input is equal except for the Politicians use case. In both 2015 and
2020 dumps, the di erence between input and output is less than one percent in
the cases of inequality. In the case of 2020 dump, the di erence can be attributed
to the entry of new data in the interval between our tests and the dump date.
This is reasonable especially in the case of Gene Wiki where bots are importing
new information into Wikidata every day. In the case of the 2015 dump in the
Politicians row, the 35 di erences between input and output is unjusti able. The
reason for this di erence may be the inability of WDumper to parse the data
of these entities in the input dump. WDumper uses the JSON le as input, and
to be able to fetch an entity, it must see the speci c structure of the Wikidata
arrays and sub-arrays in the JSON le. Some entities may not have this complete
structure in the JSON le but they do exist in the Turtle le.</p>
          <p>Validation of Condition 2. To validate this condition, in each use case we use
DESCRIBE queries for an arbitrary entity that is in the WDumper output.
DESCRIBE queries list all triples of the given entity. We expect that the result
of the DESCRIBE queries should be the same on both the input and output
datasets. For each use case, we selected an arbitrary entity (called Tested Entity)
which is present in both the input and output dataset. We then run a DESCRIBE
wd:Q... query and count the extracted triples. Table 4 shows our results.</p>
          <p>From Table 4 it is clear that the number of triples in the DESCRIBE queries
in both the 2020 and 2015 dumps are not equal. This di erence prompted us
to explore the di erences using the compare module of the RDFlib library.
It was found that in the case of the 2015 dump, the input dump contains
predicates such as &lt;http://www.w3.org/2004/02/skos/core#prefLabel&gt; and
&lt;http://schema.org/name&gt;, which are not extracted by WDumper. Table 5,
shows the details and total numbers of predicates that are in the input dump
(2015 dump) for the selected entities and WDumper could not fetch. As we can
see, the total column is exactly the di erence between the describe queries. In
the case of 2020 dump, some of predicates with &lt;http://www.w3.org/2004/02/
skos/core#&gt; pre x, such as dateModified, and all &lt;http://www.wikidata.
org/prop/direct-normalized/&gt; predicates are not detectable by WDumper.
However, in both dumps the statements whose predicate is a property of
Wikidata (e.g. P31, P106, etc.), were completely extracted by WDumper.</p>
        </sec>
        <sec id="sec-4-4-3">
          <title>Entity</title>
          <p>Q23
Q355643
Q1094046
Q17487737
P26
P485
P355
P680</p>
        </sec>
        <sec id="sec-4-4-4">
          <title>Quali ers References</title>
          <p>Property Output Input Output Input</p>
          <p>Validation of Condition 3. To validate this condition, we selected an arbitrary
entity from each use case, and for this entity, we considered one of its statements.
We then counted the quali ers and the references of this statement in 2020 dump
(over the WDQS) and in the output of WDumper. Table 6 shows the selected
entity, selected property, and the number of quali ers and references for them.
From Table 6, it is clear that WDumper can extract quali ers and references
completely from the input.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>Summarizing Results, Strengths and Weaknesses</title>
        <p>The results of our evaluations show that WDumper, as a custom dump
production tool, can be used to create some topical subsets. This tool can correctly and
completely extract the entities speci ed by its lters. It also extracts almost all
statements related to entities (except it is not designed to extract some pre xes).
One of the features we have been looking for is the ability to extract references
and quali ers of Wikidata statements, which WDumper can do. Setting up this
tool is not very complicated; the user only needs to select the lters of the
entities and statements, run the tool and it extracts all of the information at once.
Its GUI is also somewhat helpful, while the JSON structure of its speci cation
les is also simple and understandable.</p>
        <p>Limitations and Weaknesses. The most important weakness of WDumper with
regard to the topical subsets is the limitation in the de nition of entity lters. In
WDumper, entities can only be ltered based on the presence of a Px property
or having the value v for a Px property. Although it is possible to deploy any
number of such lters, this is not enough to specify some kinds of use cases.
For example, suppose we want to specify the Scottish universities subset. By
reviewing some of these universities on the Wikidata website, we can nd that
their corresponding entity does not have any property that directly indicates
they belong to Scotland. Of course, we can de ne the R function of these subsets
through indirect methods (for example, considering the Geo-location of entities
in Scottish universities), but these type of lters are not available in WDumper.</p>
        <p>The recognition of type hierarchies is another limitation of WDumper. In
the case of UK universities, for example, the University of Edinburgh(Q160302)
is not among the universities extracted by WDumper. The reason for this is
that the instance of property(P31) in this university refers to public
university(Q875538) instead of university(Q3918). In SPARQL queries, such cases are
handled by the property paths like wdt:P31/wdt:P279 . These property paths
are not available in WDumper. The strategy of considering more lters to cover
all subtypes needs a comprehensive knowledge of Wikidata ontology. This
strategy will fail if the class hierarchy changes.</p>
        <p>Another limitation is the inability to communicate between di erent lters
in multi- lter cases. For example, in the Gene Wiki use case, we may want
diseases that are somehow related to a gene or protein, while in the WDumper
output there are diseases that have nothing to do with genes, proteins, and other
Gene Wiki types. The inability to choose another output format other than
NTriples, especially the Wikibase JSON output, which is more suitable for using
the subset produced in a Wikibase instance and also has a smaller volume, is
another limitation.</p>
        <p>The main implications of these limitations is the reduction of exibility of
subset extracting with this tool. With these weaknesses, users have to spend
much more time de ning the desired subset.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we reviewed the issue of building topical subsets over Wikidata.
Our motivation for topical subsets is to enable e cient evaluation of complex
queries over the knowledge graph with lower costs, reproducibility of experiments
through archiving datasets, ease of use, and exibility. We provided example use
cases for topical subsets as well as a de nition for topical subsets. This de nition
enables us to evaluate and compare subset creation tools.</p>
      <p>In this study we used WDumper for topical subset extraction over Wikidata
and tested it by measuring run-time and output volume on four di erent use
cases. We evaluated the correctness of the subsets generated by WDumper by
comparing the answers to queries over the subsets and the full knowledge graph.
Our experience shows that WDumper can be used to generate topical
subsets of Wikidata in some use cases but not for all use cases. WDumper can
extract the entities speci ed by its lters and extract most statements related
to those entities; it also fetches the statement nodes and references/quali ers.
However, WDumper has some weaknesses regarding topical subsets. Its main
problem is the way it de nes lters on entities that reduces the power of this
tool to build topical subsets. The most tangible issue is the inability to de ne
and fetch subclasses of a class of entities, which is important in many use
cases. Our suggestion for the future works is to explore alternative subsetting
approaches such as using SPARQL queries or Shape Expressions. With selectors
like SPARQL queries or ShEx schemata, we can increase the expressivity of the
subset creation. It will also allow for subsets to be created on Knowledge Graphs
other than Wikidata.</p>
      <p>Acknowledgement. We would like to acknowledge the fruitful discussions with
the participants of project 35 of the BioHackathon-Europe 2020; Dan Brickley,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Wdumper</surname>
          </string-name>
          <article-title>- a tool to create custom wikidata rdf dumps</article-title>
          , https://tools.wmflabs. org/wdumps/, , GitHub repository: https://github.com/bennofs/wdumper
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Wikidata:WikiProject Schemas/Subsetting - Wikidata, https://www.wikidata. org/wiki/Wikidata:WikiProject_Schemas/Subsetting, accessed
          <year>2020</year>
          -
          <volume>12</volume>
          -31
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>DBpedia: A Nucleus for a Web of Open Data</article-title>
          . In: ISWC (
          <year>2007</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -76298-052
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Beghaeiraveri</surname>
            ,
            <given-names>S.A.H.</given-names>
          </string-name>
          :
          <source>Wikidata dump 27-04-2015 xed syntax errors (Feb</source>
          <year>2021</year>
          ). https://doi.org/10.5281/zenodo.4534445
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Beghaeiraveri</surname>
            ,
            <given-names>S.A.H.</given-names>
          </string-name>
          :
          <article-title>Wikidata Subsets and Speci cation Files Created by WDumper</article-title>
          (
          <year>Feb 2021</year>
          ). https://doi.org/10.5281/zenodo.4495855
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bollacker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tufts</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pierce</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cook</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A platform for scalable, collaborative, structured information integration</article-title>
          . In: IIWeb'
          <volume>07</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gayo</surname>
            ,
            <given-names>J.E.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ammar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brickley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>Knowledge graphs and wikidata subsetting (</article-title>
          <year>2021</year>
          ). https://doi.org/10.37044/osf.io/wu9et
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gayo</surname>
            ,
            <given-names>J.E.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prud'Hommeaux</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boneva</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Validating RDF data</article-title>
          , vol.
          <volume>7</volume>
          . Morgan &amp; Claypool Publishers (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Guha</surname>
            ,
            <given-names>R.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brickley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macbeth</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Schema.
          <article-title>org: evolution of structured data on the web</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>59</volume>
          (
          <issue>2</issue>
          ),
          <volume>44</volume>
          {
          <fpage>51</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomqvist</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cochez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Knowledge Graphs</article-title>
          . arXiv:
          <year>2003</year>
          .02320 [cs] (
          <year>2020</year>
          ), http://arxiv.org/abs/
          <year>2003</year>
          .02320
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Khrouf</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          , R.:
          <article-title>Eventmedia: A lod dataset of events illustrated with media (</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lopez-Pellicer</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          , et al.:
          <article-title>Geo linked data</article-title>
          .
          <source>In: DEXA</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Matsumoto</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamanaka</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiba</surname>
          </string-name>
          , H.:
          <article-title>Mapping rdf graphs to property graphs</article-title>
          . arXiv preprint arXiv:
          <year>1812</year>
          .
          <year>01801</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mimouni</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moissinac</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Domain speci c knowledge graph embedding for analogical link discovery</article-title>
          .
          <source>Advances in Intelligent Systems</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Tanon</surname>
            ,
            <given-names>T.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Yago 4: A reason-able knowledge base</article-title>
          .
          <source>In: European Semantic Web Conference</source>
          . pp.
          <volume>583</volume>
          {
          <fpage>596</fpage>
          . Springer (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. UniProt Consortium:
          <article-title>Uniprot: a hub for protein information</article-title>
          .
          <source>NAR</source>
          <volume>43</volume>
          (
          <issue>D1</issue>
          ) (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>CACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <volume>78</volume>
          {
          <fpage>85</fpage>
          (
          <year>2014</year>
          ). https://doi.org/10.1145/2629489
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Waagmeester</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Wikidata as a knowledge graph for the life sciences</article-title>
          .
          <source>eLife 9</source>
          (
          <year>2020</year>
          ). https://doi.org/10.7554/eLife.52614
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>