<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Validating Danish Wikidata lexemes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Finn Arup Nielsen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katherine Thornton</string-name>
          <email>katherine.thornton@yale.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Emilio Labra Gayo</string-name>
          <email>labra@uniovi.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cognitive Systems, DTU Compute, Technical University of Denmark</institution>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Oviedo</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yale University Library</institution>
          ,
          <addr-line>New Haven, CT</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Two of the newest features of Wikidata are support for lexicographic data (lexemes), and support for Shape Expressions (ShEx). We demonstrate the rst application of ShEx for validation of entity data for Wikidata lexemes. Validation of entity data in Wikidata against ShEx schemas allows editors to discover missing or incorrect information. It may also form a basis for discussion of the data models implicitly used in Wikidata. We present a use case and benchmark for ShEx and discuss its current limitations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Since 2018, it has been possible to represent lexeme data with associated
information about forms and senses in Wikidata [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Statistics from June 2019,
indicate that more than 47,000 lexemes for more than 330 human languages have
already been added to Wikidata. For Danish alone, there are entries for over
1,700 lexemes, including nouns, verbs, adjectives, adverbs and words from
several other word classes.4 These lexemes can be described by properties specifying
forms, senses, language, lexical categories, grammatical features, hyphenation,
etc. We may also link lexemes to external linguistic resources, e.g., DanNet [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is
a wordnet for Danish. Words found in version 2.2 of DanNet have an associated
Wikidata property. Thus, each lexeme has associated structured data suitable
for machine consumption.
      </p>
      <p>
        ShEx (Shape Expressions) is a concise, formal language for modeling and
validating RDF graphs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The ShEx compact syntax allows users to rapidly
write schemas to capture data models as they evolve. ShEx is actively being
used to validate data in Wikidata [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In May 2019, Wikidata enabled a new
namespace, which allows Wikidata editors to store and collaboratively edit ShEx
schemas. Editors can subsequently use these schemas for validation of Wikidata
items.
4 Updated statistics are available from the Ordia tool: https://tools.wmflabs.org/
ordia/statistics. Language statistics are available at https://tools.wmflabs.
org/ordia/language/.
      </p>
      <p>Here we describe our initial experiences with using ShEx for validation of
linguistic data on Wikidata, focusing on Danish Wikidata lexemes. This is a
convenient method for a user to quickly gain an overview of the current status
of the data with regard to a speci c schema, entirely using tools in the Wikidata
ecosystem. Apart from discovering missing or incorrect data, we identify cases
where ShEx pinpoints issues for discussion about a `data model' for the Danish
lexicographic data, as well as cases where ShEx validation is di cult.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Numerous mechanisms and tools are available that allow Wikidata users to
validate entered lexeme data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Some of the mechanisms that work across Wikidata
entities are: 1) Datatype restrictions per-property: The de nition of properties
speci es which datatype is allowed in the value eld, such that, e.g., a free-form
text (literal) cannot be used where a Wikidata item (`IRI') is required. 2)
Literal value restrictions: Wikidata also sets up constraints on literal data values
via regular expressions de ned in the P1793 Wikidata property. 3) Property
constraints: Each property may also be associated with constraints that provide
hints to editors about what values are expected, and what values are not. 4)
Identifying patterns using SPARQL: With the SPARQL-based Wikidata Query
Service (WDQS), users can formulate queries that can show inconsistency in the
lexeme data, e.g., ` nd every lexeme without any form' as suggested on the Ideas
of queries wiki page.5
      </p>
      <p>
        Scholia, a SPARQL-based web application and Wikidata frontend focusing
on scholarly data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], dynamically creates so-called subaspects with information
about (possibly) missing data in Wikidata. For instance, the `missing' subaspect
for an author displays author name strings that could be resolved to Wikidata
items and authored publications that are missing speci cation of one or more
main subjects.
      </p>
      <p>
        Ordia is a web application for Wikidata lexemes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Currently it has no
validation, but browsing its various dynamically generated web pages a user
may discover unusual patterns, e.g., the web page https://tools.wmflabs.
org/ordia/lexical-category/ displays values used as lexical categories for
lexemes where rare categories can contain errors. An example is L45350 which
has the lexical category centralized version control system, | an obvious error.
      </p>
      <p>
        In addition to [
        <xref ref-type="bibr" rid="ref12 ref9">9</xref>
        ], ShEx and Wikidata have also been described in connection
with schema inference [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The Wikidata ShEx Inference tool can be used to
automatically suggest a schema for a type of resource based on properties that
have been used on similar items in Wikidata.
      </p>
      <p>This is the rst application of ShEx validation to the lexeme namespace in
Wikidata. Validation using ShEx provides more comprehensive validation than
has previously been possible with Wikidata tools. Datatype restrictions and
literal value restrictions can be expressed in ShEx and identi ed alongside property
5 https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Ideas_of_
queries.
constraints for multiple properties at the same time. This o ers more complete
information, by allowing users to test situations that previously could only be
approached piece by piece according to the above rules. The constraint system
is useful for communicating expectations and potential issues to users on a
perproperty basis. Validating entity data against a schema describing a data shape
can span multiple properties.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Validation of Danish lexemes</title>
      <p>We wrote ShEx schemas for Danish lexemes with the identi ers E15 (Danish
lexeme), E34 (Danish noun), E54 (lexeme), E56 (Danish verb), E62 (Danish
pronoun) and E65 (Danish numeral) as well as a ShEx for Danish hyphenation
E68.6</p>
      <p>
        Many rules exist for Danish words and grammar, see, e.g., [
        <xref ref-type="bibr" rid="ref1 ref3">1,3</xref>
        ], and we
coded several in ShEx wrt. forms, senses, hyphenation, conjugation class, DanNet
identi er, grammatical gender, etc. Here we will discuss a few:
1. All Danish Wikidata lexemes should have one unique value for DanNet
words, | either one unique identi er or no value. Proper Danish nouns,
adverbs, pronouns and words from a number of other word classes should
not have an associated DanNet identi er.
2. A Danish noun should have one single grammatical gender, either common
gender or neuter.
3. Each part of a hyphenated representation should contain a vowel.
4. The grammatical gender of a compound should have the same grammatical
gender as the nal lexeme of the compound, except for compounds su xed
-fuld.
      </p>
      <p>
        For the rst rule, related to DanNet identi ers, Wikidata has the ability to
explicitly state `no value' and ShEx can test for the presence of this `no value'
in the Wikidata property for DanNet (P6140) with \a [ wdno:P6140 ]". In
the cases where the DanNet identi er is present with a single value, we can
test its format with a regular expression. Current DanNet identi ers conform
to a format with 8 digits. The combined ShEx constraint for Wikidata DanNet
statements then reads a [ wdno:P6140 ] | ps:P6140 /^[
        <xref ref-type="bibr" rid="ref1 ref12 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
        ]{8}$/". A test
for the uniqueness of DanNet identi ers is not currently possible in ShEx, but
it is possible that this feature will be added [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The second rule for two grammatical genders is represented by the
following ShEx schema: wdt:P5185 [ wd:Q1305037 wd:Q1775461 ]. Danish does not
reveal grammatical gender in plural, so plurale tantum (word with only plural
forms) and its derived lexemes have no obvious gender and no gender is speci ed
in Wikidata. The same lack of obvious gender occurs for a word such as a druk
6 The Wikidata page with editing capabilities are, e.g., https://www.wikidata.org/
wiki/EntitySchema:E15 while the URL for ShEx import is https://www.wikidata.
org/wiki/Special:EntitySchemaText/E68 for E15
and for proper nouns. Either we need to de ne the gender or modify the ShEx
schema to accommodate the exceptions. We also note that a few nouns have
multiple genders, and these nouns also require exceptions.</p>
      <p>For the third rule, we note that a Danish hyphenation rule can apply across
most, if not all, word classes, thus we can write a general schema for hyphenation
and use ShEx's IMPORT feature to include shapes of the schema as part of another
schema. The essential part in the current Danish hyphenation schema (E68) is:
\a [ wdno:P5279 ] | ps:P5279 /.*[aeiouy a] .*\u2027.*[aeiouy a].
*/ ;". Here wdno:P5279 catches Wikidata's `no value' for forms without
possible hyphenation. The regular expression uses the hyphenation point unicode
character. It could be further re ned and extended to accommodate accented
vowels and words with multiple possible hyphenation points.</p>
      <p>The fourth rule, related to the gender of compounds, requires a more
elaborate ShEx schema. Our current implementation enumerates over grammatical
gender and over the number of compound parts as well as creates exceptions for
the -fuld su x and words that are not compounds. The result is a verbose ShEx.</p>
      <p>A ShEx schema is submitted to the ShEx2 Simple Online Validator7 with a
list of Wikidata items to be tested, | most conveniently found by a SPARQL
query to WDQS. The conformance reports produced in the ShEx validation
process provide detailed feedback about which statements on which items have
issues. Testing a subgraph for conformance to the ShEx schema, users can survey
multiple items in a single validation session. This allows users to address issues
across a set of items rather than reviewing them individually. By reading these
conformance reports we discovered numerous issues. Most of the non-conformant
items we discover are errors of omission, rather than errors of commission. In
the former case, e.g., nouns missing grammatical gender, while in the latter case,
e.g., nouns with a wrong gender. We corrected a number of the erroneous entries
and added many new statements.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>ShEx conformance reports can be used as a basis for concrete discussions among
editors about approaches for improving the data. Communicating via a schema
language allows editors to be explicit about their expectations and intentions.</p>
      <p>We note that some of the rules may be contested by Wikidata users. For
instance, there has been discussion on which character should indicate hyphenation
and how Wikidata should indicate that a word form cannot be hyphenated.8</p>
      <p>For people preparing to write schemas for use in the Wikidata context, we
o er the following practical recommendations. While writing schemas is an
investment of e ort, the ability to quickly gain an overview of a subset of the
Wikidata graph is useful. It may be possible to reuse shapes from existing schemas,
7 https://tools.wmflabs.org/shex-simple/wikidata/packages/shex-webapp/
doc/shex-simple.html.
8 The hyphenation discussions have taken place on the so-called talk page of the
property on Wikidata: https://www.wikidata.org/wiki/Property_talk:P5279.
thus it may be helpful to explore existing schemas. Reading existing schemas is
also a helpful way to get a sense of how others have expressed data models that
could serve as examples.</p>
      <p>The opportunity to provide a link to a schema in the ShEx namespace of
Wikidata supports collaborative re nement of data models, such as those for
Danish lexemes. Thus, if some of the rules discussed in this paper require
renement, editors of Wikidata will be able to point to schemas, to query maps
indicating relevant sub-graphs of Wikidata, and to individual lexeme items which
may need to be revised.9</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Allan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lundsk</surname>
            r-Nielsen,
            <given-names>T.</given-names>
          </string-name>
          : Danish.
          <string-name>
            <surname>Routledge</surname>
          </string-name>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prud</surname>
          </string-name>
          'hommeaux, E.: Shape
          <string-name>
            <surname>Expressions (ShEx) Primer</surname>
          </string-name>
          (
          <year>July 2017</year>
          ), http://shex.io/shex-primer-
          <volume>20170713</volume>
          /
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heltoft</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Grammatik over det Danske Sprog</article-title>
          . University Press of Southern Denmark (
          <year>February 2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Labra</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.E.</surname>
          </string-name>
          , Prud'hommeaux,
          <string-name>
            <given-names>E.G.</given-names>
            ,
            <surname>Boneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <source>Validating RDF Data</source>
          , vol.
          <volume>7</volume>
          (
          <year>September 2017</year>
          ), http://book.validatingrdf.com/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          :
          <article-title>Ordia: A Web application for Wikidata lexemes</article-title>
          .
          <source>In: ESWC 2019 Posters &amp; Demos (May</source>
          <year>2019</year>
          ), http://www2.compute.dtu.dk/pubdb/views/edoc_ download.php/7137/pdf/imm7137.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mietchen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Willighagen</surname>
          </string-name>
          , E.:
          <article-title>Scholia, Scientometrics and Wikidata</article-title>
          . In: The Semantic Web:
          <article-title>ESWC 2017 Satellite Events</article-title>
          (
          <year>October 2017</year>
          ), http://www2. imm.dtu.dk/pubdb/views/edoc_download.php/7010/pdf/imm7010.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>B.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nimb</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asmussen</surname>
          </string-name>
          , J., S rensen, N.H.,
          <string-name>
            <surname>Trap-Jensen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lorentzen</surname>
          </string-name>
          , H.:
          <article-title>DanNet: the challenge of compiling a wordnet for Danish by reusing a monolingual dictionary</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>43</volume>
          ,
          <issue>269</issue>
          {
          <fpage>299</fpage>
          (
          <year>August 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Prud'hommeaux</surname>
            ,
            <given-names>E.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labra</surname>
            <given-names>Gayo</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.E.</given-names>
            ,
            <surname>Solbrig</surname>
          </string-name>
          , H.:
          <article-title>Shape expressions: an RDF validation and transformation language</article-title>
          .
          <source>SEM '14: Proceedings of the 10th International Conference on Semantic Systems</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Thornton</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solbrig</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stupp</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labra</surname>
            <given-names>Gayo</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.E.</given-names>
            ,
            <surname>Mietchen</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , Prud'hommeaux,
          <string-name>
            <given-names>E.G.</given-names>
            ,
            <surname>Waagmeester</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Using Shape Expressions (ShEx) to Share RDF Data Models and to Guide Curation with Rigorous Validation</article-title>
          .
          <source>In: The Semantic Web</source>
          . pp.
          <volume>606</volume>
          {
          <issue>620</issue>
          (May
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          ,
          <issue>78</issue>
          {85 (
          <year>October 2014</year>
          ), http://cacm.acm.org/ magazines/2014/10/178785-wikidata/fulltext
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Werkmeister</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          : Schema Inference on Wikidata (
          <year>October 2018</year>
          ), https: //github.com/lucaswerkmeister/master-thesis/releases/download/final/ master-thesis-Lucas-Werkmeister.pdf
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>9 This work is funded by the Innovation Fund Denmark through DABAI and ATEL projects. The third author is partially funded by the Spanish Ministry of Economy and Competitiveness (Society challenges</article-title>
          :
          <fpage>TIN2017</fpage>
          -88877-R).
          <article-title>We appreciate feedback from Eric Prud'hommeaux on the ShEx schemas discussed in this paper</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>