<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. E. Labra Gayo, R. Navigli, S. Neumaier, A.-C. N. Ngomo, A. Polleres, S. M. Rashid,
A. Rula, L. Schmelzeisen, J. Sequeda, S. Staab, A. Zimmermann, Knowledge Graphs, ACM
Computing Surveys</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3447772</article-id>
      <title-group>
        <article-title>entities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jose Emilio Labra-Gayo</string-name>
          <email>labra@uniovi.es</email>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>54</volume>
      <issue>2021</issue>
      <fpage>2</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>Wikidata is one of the most successful Semantic Web projects. Its underlying Wikibase data model departs from RDF with the inclusion of several features like qualifiers and references, built-in datatypes, etc. Those features are serialized to RDF for content negotiation, RDF dumps and in the SPARQL endpoint. Wikidata adopted the entity schemas namespace using the ShEx language to describe and validate the RDF serialization of Wikidata entities. In this paper we propose WShEx, a language inspired by ShEx that directly supports the Wikibase data model and can be used to describe and validate Wikibase entities. The paper presents the abstract syntax and formal semantics of the WShEx language.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>graphs.</p>
      <p>Wikibase was initially created as a set of MediaWiki extensions which facilitated adoption by</p>
      <sec id="sec-1-1">
        <title>Wikidata query service. In 2019, Wikidata created a new namespace for entity schemas which can be used to describe and validate the RDF serialization of Wikidata entities using ShEx [2] Entity schemas ofer a concise language to describe Wikibase entities. Users can create new schemas for diferent</title>
        <p>LGOBE
Wikidata’22: Wikidata workshop at ISWC 2022
0000-0001-8907-5348 (J. E. Labra-Gayo)</p>
        <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
htp:/ceur-ws.org CEUR Workshop Proceedings (CEUR-WS.org)
ISN1613-073
purposes and there are ShEx-based tools that can be used to check if entities conform to entity
schemas or visualize entity schemas. At the time of this writing there are more than 370
entity schemas created3 but there are no evidences about their general adoption as part of the
mainstream workflow employed by Wikidata users. Although there may be several reasons
for this like the lack of better tool support, one aspect that can also afect this situation is that
entity schemas describe the RDF serialization of entities, instead of their underlying Wikibase
data model. This aspect makes entity schemas a bit more verbose and aggravates their usability.
In this paper, we propose a new language called WShEx which is inspired by ShEx and can be
used to directly describe and validate entities based on the Wikibase data model.</p>
      </sec>
      <sec id="sec-1-2">
        <title>The first motivation for the development of WShEx was to create subsets of Wikidata in</title>
        <p>diferent domains using a concise and human-readable language. In order to process JSON-based
Wikidata dumps, it was not possible to directly use entity schemas which describe the RDF
serialization, so we developed WShEx, a language similar to ShEx that could be used to describe
Wikibase data models directly. Some parts of this paper have been extracted from [3], a larger
paper where we also describe the diferent subsetting techniques employed.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Wikibase data model</title>
      <p>The Wikibase data model4 is an abstract data model that can have diferent serializations like
JSON and RDF. It is defined using UML data structures and a notation called Wikidata Object</p>
      <sec id="sec-2-1">
        <title>Notation.</title>
      </sec>
      <sec id="sec-2-2">
        <title>Informally, the data model is formed from entities and statements about those entities.</title>
        <p>There are two main types of entities: item and properties5. An item is identified using a Q
followed by a number and can represent any thing like an abstract of concrete concept. For
example, Q80 represents Tim Berners-Lee in Wikidata. A property is identified by a P followed
by a number and represents a relationship between an item and a value. For example, P19
represents the property place of birth in Wikidata. The values that can be associated to a
property are constrained to belong to some specific datatype. There can be compound datatypes
like geographical coordinates. Some of Wikibase datatypes are: quantities, dates and times,
geographic locations and shapes, monolingual and multilingual texts, etc.</p>
        <p>A statement consists of:
• A property which is usually denoted using a P followed by a number.
• A declaration about the possible value (in Wikibase terms, it is called a snak) which can
be a specific value, a no value declaration or a some value declaration.
• A rank declaration which can be either preferred, normal or deprecated.
• Zero or more qualifiers which consist of a list of property-value pairs
• Zero or more references which consist of a list of property-value pairs.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3A directory for entity schemas can be seen at https://www.wikidata.org/wiki/Wikidata:Database_reports/</title>
        <p>EntitySchema_directory</p>
      </sec>
      <sec id="sec-2-4">
        <title>4https://www.mediawiki.org/wiki/Wikibase/DataModel</title>
      </sec>
      <sec id="sec-2-5">
        <title>5There are other types of entities like Lexemes which we omit for brevity</title>
        <p>Example 2 (RDF serialization of a node). As an example, a fragment of the information about
Tim Berners-Lee that declares that he is an instance of Human, has birth place London, has
birth date 1955 and has employer with value CERN between 1984 and 1994 is represented in
RDF (Turtle) 8 as:
wd:Q80 rdf:type wikibase:Item ;
wdt:P31 wd:Q5 ; # instance of = Human
wdt:P19 wd:Q84 ; # birthplace = London
wdt:P569 "1955-06-08T00:00:00Z"^^xsd:dateTime ; # birthDate
wdt:P108 wd:Q42944 ; # employer = CERN
p:P108 :Q80-4fe7940f .</p>
        <p>7https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format</p>
      </sec>
      <sec id="sec-2-6">
        <title>8The full Turtle serialization can be obtained at: https://www.wikidata.org/wiki/Special:EntityData/Q80.ttl</title>
        <p>The RDF serialization uses a direct arc to represent the preferred statement represented by
prefix alias wdt: leaving the rest of the values of a property accessible through the namespaces
p:, ps: and pq:. The reification model employed by Wikidata creates auxiliary nodes that
represent each statement. In the previous example, the node :Q80-4fe7940f represents the
statement which can be qualified with the start and end time.</p>
        <p>The RDF serialization model is employed in Wikidata to follow the linked data principles that
enable to have a logical URI of a concept separated from its representation in diferent formats
like HTML, JSON or RDF. It is also employed by the Wikidata Query Service which allow users
to retrieve data using the SPARQL endpoint [6, 7] and by RDF-based Wikidata dumps.</p>
        <p>Wikidata adopted entity schemas using ShEx in a new namespace (schema entities start by
letter E followed by a number). As an example, listing 1 presents an entity schema for researcher
entities9. Lines 1–6 contain prefix declarations following Turtle tradition. Lines 8–24 declare a
&lt;Researcher&gt; shape which in this case has 7 triple constraints. The first constraint (line 9) states
that items that conform to &lt;Researcher&gt; must be instances of Humans. The next line declares
that the values of birth place (wdt:P19) must conform to shape &lt;Place&gt; declared in line 25. Line 11
declares that the values of property wdt:P569 must belong to datatype xsd:dateTime. The question
mark indicates that they are optional. Line 12 declares that the values of property wdt:P108
(employer) must conform to shape &lt;Organization&gt; which is defined in line 28 (it is empty in this
case). The star at the end indicates that there can be zero or more statements about wdt:P108.
Lines 13–17 declare the constraints on the qualifiers, in this case, that it is optional to have
a pq:P580 (start) time and a pq:P582 end time statement. Notice that these declarations about
qualifiers resemble the RDF serialization model which requires one to repeat the value of the
property p:P108 and ps:P108 for the statement. Lines 18–23 follow a similar pattern.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. WShEx</title>
      <p>As we have seen in the previous section, entity schemas in ShEx require users to be aware of
how qualifiers and references are serialized in RDF which can lead to some duplication in their
definition making their definitions more verbose than necessary. Another problem of ShEx
schemas is that they cannot be used to directly describe the contents of Wikidata dumps in
JSON which follow the Wikibase data model. In order to solve those issues, we propose an
variant of ShEx called WShEx that can be used describe and validate the Wikibase data model.
In this way, WShEx can also be used to validate Wikibase dumps in JSON without requiring
them to be serialized in RDF. Figure 2 represents the relationship between ShEx and WShEx.</p>
      <sec id="sec-3-1">
        <title>9The entity schema has also been created in Wikidata as:E371</title>
      </sec>
      <sec id="sec-3-2">
        <title>In the following sections we present a simplified formal definition of WShEx adapted to</title>
        <p>Wikibase graphs as defined in 1 by presenting an abstract syntax followed by its semantic
specification 10.</p>
        <sec id="sec-3-2-1">
          <title>3.1. WShEx Abstract Syntax</title>
          <p>A WShEx Schema is defined as a tuple ⟨L,  ⟩ where L set of shape labels, and  ∶ L → S is a total
function from labels to w-shape expressions where the set of shape expressions  ∈ S is defined
using the following abstract syntax:
::=
cond</p>
          <p>Basic boolean condition on nodes (node constraint)
10The specification of WShEx is published at https://www.weso.es/WShEx/






′
::=
::=
::=
::=
::=
|
|
|
|
|
|
|
|
|
|
|
|
|




⌊⌋
⌈⌉
 ∣ 
ps*
ps , ps</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Shape</title>
      </sec>
      <sec id="sec-3-4">
        <title>Conjunction</title>
      </sec>
      <sec id="sec-3-5">
        <title>Shape label reference for l ∈ L</title>
      </sec>
      <sec id="sec-3-6">
        <title>Closed shape</title>
      </sec>
      <sec id="sec-3-7">
        <title>Open shape</title>
      </sec>
      <sec id="sec-3-8">
        <title>Shape definition</title>
        <p>3.2. Semantics
In order to define the semantics of WShEx we employ a conformance relation parameterized by
a shape assignment G, ,  ⊨  with the meaning that node  in graph G conforms to shape
expression  with shape assignment  according to the rules 1.</p>
        <p>We also define a conformance relation G, ,  ⊩  which declares that the triples  in graph
G conform to the triple expression  with the shape assignment  using the rules 2 which takes
ℎ

()</p>
        <p>= 
G, ,  ⊨ 
ℎ
 
into account qualifier specifiers.</p>
        <p>ℎ

 
 2
( 1,  2) ∈ ()</p>
        <p>G,  1,  ⊩  1</p>
        <p>G,  2,  ⊩  2
G, ,  ⊩</p>
        <p>1;  2</p>
        <p>G, ,  ⊩ 
1</p>
        <p>G, ,  ⊩</p>
        <p>1
1 ∣  2

2</p>
        <p>G, ,  ⊩ 
G, ,  ⊩</p>
        <p>2
1 ∣  2
( 1,  2) ∈ ()</p>
        <p>G,  1,  ⊩</p>
        <p>G,  2,  ⊩ ∗
 = {⟨, , , ⟩}</p>
        <p>G, ,  ⊢ 
property-value elements, a shape assignment  and a qualifier specifier
 is defined with the
rules presented in table 3.</p>
        <sec id="sec-3-8-1">
          <title>3.3. Implementation and use cases</title>
          <p>WShEx is currently implemented as a module13 inside the ShEx-s project14, a Scala
implementation of ShEx.</p>
          <p>The implementation includes a matcher for Entity documents which can be used to validate
Wikidata dumps in JSON format. In fact, the initial motivation for WShEx was the possibility
to validate JSON dumps instead of RDF dumps to create Wikidata subsets [3].</p>
        </sec>
      </sec>
      <sec id="sec-3-9">
        <title>One practical aspect that appeared is the need of a converter between Entity schemas defined</title>
        <p>in ShEx and WShEx. We have already implemented a first version of this converter. This
13https://github.com/weso/shex-s/tree/master/modules/wshex</p>
        <p>G, ,  ⊢ ⌊⌋
ℎ 
1 G, ,  ⊢</p>
        <p>G, ,  ⊢</p>
        <p>2
G, ∅,  ⊢</p>
        <p>G, ,  ⊢</p>
        <p>1
G, ,  ⊢</p>
        <p>G, ,  ⊢ 
1 ,  2
1
l</p>
        <p>G, ,  ⊢ 
G, ,  ⊢ ⌈⌉
2
1 ∣  2
G,  2,  ⊢ ∗

1 G, ∅,  ⊢ ∗</p>
        <p>approach allows to leverage the existing entity schemas which are defined in ShEx, convert
them to WShEx and use them to process Wikibase JSON dumps.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Related work</title>
      <p>Our definition of Wikibase graphs was inspired by the formal definitions used for knowledge
graphs in books like [8], which define two main data models: directed labeled graphs, which
resemble RDF-based graphs and property graphs. We were also inspired by MARS
(MultiAttributed Relational Structures) [4], which present a a generalized notion of property graphs
adapted to Wikidata. In that paper, they also define MAPL (Multi-Attributed Predicate Logic) as
a logical formalism that can be used for ontological reasoning.</p>
      <p>Since the appearance of ShEx in 2014, there has been a lot of interest about RDF validation
and description. In 2017, the data shapes working group proposed SHACL (Shapes Constraint
Language) as a W3C recommendation [9]. Although SHACL can be used to describe RDF,
its main purpose is to validate and check constraints about RDF data. ShEx was adopted by
Wikidata in 2019 to define entity schemas [ 10]. We consider that ShEx adapts better to describe
data models than SHACL, which is more focused on constraint violations. A comparison
between both is provided in [11] while in [12], a simple language is defined that can be used as
a common subset of both.</p>
      <p>Improving quality of Knowledge graphs in general, and Wikidata in particular, has been the
focus of some research like [13, 14, 15].</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and future work</title>
      <p>WShEx can be used to describe and validate Wikibase graphs using an extension of Shape
Expressions that handle also qualifiers. We consider that WShEx schemas are more succint and
adapt better to the Wikibase data model. The language has been partially implemented and is
being used to create Wikidata subsets from JSON dumps. Future work is still necessary to finish
the implementation including a full grammar for the compact syntax, more complete support
for other Wikibase constructs like labels, descriptions, aliases, other built-in datatypes, ranks,
etc. and implement a full validator for Wikibase graphs based on WShEx.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by the Project ANGLIRU: ANGLIRU: Applying kNowledge
Graphs to research data interoperabiLIty and ReUsability, code: PID2020-117912RB and by the
Alfred P. Sloan Foundation under grant number G-2021-17106 for the development of Scholia.
[1] S. Malyshev, M. Krötzsch, L. González, J. Gonsior, A. Bielefeldt, Getting the most out of</p>
      <sec id="sec-6-1">
        <title>Wikidata: Semantic technology usage in Wikipedia’s knowledge graph, in: The Semantic</title>
      </sec>
      <sec id="sec-6-2">
        <title>Web - ISWC 2018 - 17th International Semantic Web Conference, Monterey, CA, USA,</title>
        <p>October 8-12, Springer, 2018, pp. 376–394.
[2] E. Prud’hommeaux, J. E. Labra Gayo, H. Solbrig, Shape Expressions: An RDF Validation
and Transformation Language, in: H. Sack, A. Filipowska, J. Lehmann, S. Hellmann (Eds.),
Proceedings of the 10th International Conference on Semantic Systems, SEMANTICS
2014, Leipzig, Germany, September 4-5, 2014, ACM Press, 2014, pp. 32–40. doi:10.1145/
2660517.2660523.
[3] J. E. Labra Gayo, Creating knowledge graphs subsets using shape expressions, 2021. URL:
https://arxiv.org/abs/2110.11709. doi:10.48550/ARXIV.2110.11709.
[4] M. Marx, M. Krötzsch, V. Thost, Logic on MARS: ontologies for generalised property
graphs, in: C. Sierra (Ed.), Proceedings of the 26th International Joint Conference on</p>
      </sec>
      <sec id="sec-6-3">
        <title>Artificial Intelligence (IJCAI’17), International Joint Conferences on Artificial Intelligence,</title>
        <p>2017, pp. 1188–1194. doi:10.24963/ijcai.2017/165.
[5] F. Erxleben, M. Günther, M. Krötzsch, J. Mendez, D. Vrandečić, Introducing Wikidata to
the Linked Data Web, in: P. Mika, T. Tudorache, A. Bernstein, C. Welty, C. A. Knoblock,
D. Vrandečić, P. T. Groth, N. F. Noy, K. Janowicz, C. A. Goble (Eds.), Proceedings of the
13th International Semantic Web Conference (ISWC 2014), volume 8796 of LNCS, Springer,
2014, pp. 50–65. doi:10.1007/978- 3- 319- 11964- 9\_4.
[6] A. Bielefeldt, J. Gonsior, M. Krötzsch, Practical linked data access via SPARQL: the case of
wikidata, in: T. Berners-Lee, S. Capadisli, S. Dietze, A. Hogan, K. Janowicz, J. Lehmann
(Eds.), Proceedings of the WWW2018 Workshop on Linked Data on the Web (LDOW-18),
volume 2073 of CEUR Workshop Proceedings, CEUR-WS.org, 2018.
[7] S. Malyshev, M. Krötzsch, L. González, J. Gonsior, A. Bielefeldt, Getting the Most out of</p>
      </sec>
      <sec id="sec-6-4">
        <title>Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph, in: D. Vrandečić,</title>
      </sec>
      <sec id="sec-6-5">
        <title>K. Bontcheva, M. C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.-A. Kafee, E. Simperl</title>
        <p>(Eds.), Proceedings of the 17th International Semantic Web Conference (ISWC’18), volume
11137 of LNCS, Springer, 2018, pp. 376–394.
[8] A. Hogan, E. Blomqvist, M. Cochez, C. Dámato, G. D. Melo, C. Gutierrez, S. Kirrane,</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>