Validating Danish Wikidata lexemes Finn Årup Nielsen1 , Katherine Thornton2 , and Jose Emilio Labra Gayo3 1 Cognitive Systems, DTU Compute, Technical University of Denmark, Denmark faan@dtu.dk 2 Yale University Library, New Haven, CT, USA katherine.thornton@yale.edu 3 University of Oviedo, Spain labra@uniovi.es Abstract. Two of the newest features of Wikidata are support for lexi- cographic data (lexemes), and support for Shape Expressions (ShEx). We demonstrate the first application of ShEx for validation of entity data for Wikidata lexemes. Validation of entity data in Wikidata against ShEx schemas allows editors to discover missing or incorrect information. It may also form a basis for discussion of the data models implicitly used in Wikidata. We present a use case and benchmark for ShEx and discuss its current limitations. 1 Introduction Since 2018, it has been possible to represent lexeme data with associated infor- mation about forms and senses in Wikidata [10]. Statistics from June 2019, in- dicate that more than 47,000 lexemes for more than 330 human languages have already been added to Wikidata. For Danish alone, there are entries for over 1,700 lexemes, including nouns, verbs, adjectives, adverbs and words from sev- eral other word classes.4 These lexemes can be described by properties specifying forms, senses, language, lexical categories, grammatical features, hyphenation, etc. We may also link lexemes to external linguistic resources, e.g., DanNet [7] is a wordnet for Danish. Words found in version 2.2 of DanNet have an associated Wikidata property. Thus, each lexeme has associated structured data suitable for machine consumption. ShEx (Shape Expressions) is a concise, formal language for modeling and validating RDF graphs [8]. The ShEx compact syntax allows users to rapidly write schemas to capture data models as they evolve. ShEx is actively being used to validate data in Wikidata [4]. In May 2019, Wikidata enabled a new namespace, which allows Wikidata editors to store and collaboratively edit ShEx schemas. Editors can subsequently use these schemas for validation of Wikidata items. 4 Updated statistics are available from the Ordia tool: https://tools.wmflabs.org/ ordia/statistics. Language statistics are available at https://tools.wmflabs. org/ordia/language/. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Here we describe our initial experiences with using ShEx for validation of linguistic data on Wikidata, focusing on Danish Wikidata lexemes. This is a convenient method for a user to quickly gain an overview of the current status of the data with regard to a specific schema, entirely using tools in the Wikidata ecosystem. Apart from discovering missing or incorrect data, we identify cases where ShEx pinpoints issues for discussion about a ‘data model’ for the Danish lexicographic data, as well as cases where ShEx validation is difficult. 2 Related work Numerous mechanisms and tools are available that allow Wikidata users to vali- date entered lexeme data [4]. Some of the mechanisms that work across Wikidata entities are: 1) Datatype restrictions per-property: The definition of properties specifies which datatype is allowed in the value field, such that, e.g., a free-form text (literal) cannot be used where a Wikidata item (‘IRI’) is required. 2) Lit- eral value restrictions: Wikidata also sets up constraints on literal data values via regular expressions defined in the P1793 Wikidata property. 3) Property constraints: Each property may also be associated with constraints that provide hints to editors about what values are expected, and what values are not. 4) Identifying patterns using SPARQL: With the SPARQL-based Wikidata Query Service (WDQS), users can formulate queries that can show inconsistency in the lexeme data, e.g., ‘find every lexeme without any form’ as suggested on the Ideas of queries wiki page.5 Scholia, a SPARQL-based web application and Wikidata frontend focusing on scholarly data [6], dynamically creates so-called subaspects with information about (possibly) missing data in Wikidata. For instance, the ‘missing’ subaspect for an author displays author name strings that could be resolved to Wikidata items and authored publications that are missing specification of one or more main subjects. Ordia is a web application for Wikidata lexemes [5]. Currently it has no validation, but browsing its various dynamically generated web pages a user may discover unusual patterns, e.g., the web page https://tools.wmflabs. org/ordia/lexical-category/ displays values used as lexical categories for lexemes where rare categories can contain errors. An example is L45350 which has the lexical category centralized version control system, — an obvious error. In addition to [9], ShEx and Wikidata have also been described in connection with schema inference [11]. The Wikidata ShEx Inference tool can be used to automatically suggest a schema for a type of resource based on properties that have been used on similar items in Wikidata. This is the first application of ShEx validation to the lexeme namespace in Wikidata. Validation using ShEx provides more comprehensive validation than has previously been possible with Wikidata tools. Datatype restrictions and lit- eral value restrictions can be expressed in ShEx and identified alongside property 5 https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Ideas_of_ queries. constraints for multiple properties at the same time. This offers more complete information, by allowing users to test situations that previously could only be approached piece by piece according to the above rules. The constraint system is useful for communicating expectations and potential issues to users on a per- property basis. Validating entity data against a schema describing a data shape can span multiple properties. 3 Validation of Danish lexemes We wrote ShEx schemas for Danish lexemes with the identifiers E15 (Danish lexeme), E34 (Danish noun), E54 (lexeme), E56 (Danish verb), E62 (Danish pronoun) and E65 (Danish numeral) as well as a ShEx for Danish hyphenation E68.6 Many rules exist for Danish words and grammar, see, e.g., [1,3], and we coded several in ShEx wrt. forms, senses, hyphenation, conjugation class, DanNet identifier, grammatical gender, etc. Here we will discuss a few: 1. All Danish Wikidata lexemes should have one unique value for DanNet words, — either one unique identifier or no value. Proper Danish nouns, adverbs, pronouns and words from a number of other word classes should not have an associated DanNet identifier. 2. A Danish noun should have one single grammatical gender, either common gender or neuter. 3. Each part of a hyphenated representation should contain a vowel. 4. The grammatical gender of a compound should have the same grammatical gender as the final lexeme of the compound, except for compounds suffixed -fuld. For the first rule, related to DanNet identifiers, Wikidata has the ability to explicitly state ‘no value’ and ShEx can test for the presence of this ‘no value’ in the Wikidata property for DanNet (P6140) with “a [ wdno:P6140 ]”. In the cases where the DanNet identifier is present with a single value, we can test its format with a regular expression. Current DanNet identifiers conform to a format with 8 digits. The combined ShEx constraint for Wikidata DanNet statements then reads a [ wdno:P6140 ] | ps:P6140 /^[0-9]{8}$/”. A test for the uniqueness of DanNet identifiers is not currently possible in ShEx, but it is possible that this feature will be added [2]. The second rule for two grammatical genders is represented by the follow- ing ShEx schema: wdt:P5185 [ wd:Q1305037 wd:Q1775461 ]. Danish does not reveal grammatical gender in plural, so plurale tantum (word with only plural forms) and its derived lexemes have no obvious gender and no gender is specified in Wikidata. The same lack of obvious gender occurs for a word such as a druk 6 The Wikidata page with editing capabilities are, e.g., https://www.wikidata.org/ wiki/EntitySchema:E15 while the URL for ShEx import is https://www.wikidata. org/wiki/Special:EntitySchemaText/E68 for E15 and for proper nouns. Either we need to define the gender or modify the ShEx schema to accommodate the exceptions. We also note that a few nouns have multiple genders, and these nouns also require exceptions. For the third rule, we note that a Danish hyphenation rule can apply across most, if not all, word classes, thus we can write a general schema for hyphenation and use ShEx’s IMPORT feature to include shapes of the schema as part of another schema. The essential part in the current Danish hyphenation schema (E68) is: “a [ wdno:P5279 ] | ps:P5279 /.*[aeiouyæøå] .*\u2027.*[aeiouyæøå]. */ ;”. Here wdno:P5279 catches Wikidata’s ‘no value’ for forms without pos- sible hyphenation. The regular expression uses the hyphenation point unicode character. It could be further refined and extended to accommodate accented vowels and words with multiple possible hyphenation points. The fourth rule, related to the gender of compounds, requires a more elab- orate ShEx schema. Our current implementation enumerates over grammatical gender and over the number of compound parts as well as creates exceptions for the -fuld suffix and words that are not compounds. The result is a verbose ShEx. A ShEx schema is submitted to the ShEx2 Simple Online Validator7 with a list of Wikidata items to be tested, — most conveniently found by a SPARQL query to WDQS. The conformance reports produced in the ShEx validation process provide detailed feedback about which statements on which items have issues. Testing a subgraph for conformance to the ShEx schema, users can survey multiple items in a single validation session. This allows users to address issues across a set of items rather than reviewing them individually. By reading these conformance reports we discovered numerous issues. Most of the non-conformant items we discover are errors of omission, rather than errors of commission. In the former case, e.g., nouns missing grammatical gender, while in the latter case, e.g., nouns with a wrong gender. We corrected a number of the erroneous entries and added many new statements. 4 Discussion ShEx conformance reports can be used as a basis for concrete discussions among editors about approaches for improving the data. Communicating via a schema language allows editors to be explicit about their expectations and intentions. We note that some of the rules may be contested by Wikidata users. For in- stance, there has been discussion on which character should indicate hyphenation and how Wikidata should indicate that a word form cannot be hyphenated.8 For people preparing to write schemas for use in the Wikidata context, we offer the following practical recommendations. While writing schemas is an in- vestment of effort, the ability to quickly gain an overview of a subset of the Wiki- data graph is useful. It may be possible to reuse shapes from existing schemas, 7 https://tools.wmflabs.org/shex-simple/wikidata/packages/shex-webapp/ doc/shex-simple.html. 8 The hyphenation discussions have taken place on the so-called talk page of the property on Wikidata: https://www.wikidata.org/wiki/Property_talk:P5279. thus it may be helpful to explore existing schemas. Reading existing schemas is also a helpful way to get a sense of how others have expressed data models that could serve as examples. The opportunity to provide a link to a schema in the ShEx namespace of Wikidata supports collaborative refinement of data models, such as those for Danish lexemes. Thus, if some of the rules discussed in this paper require re- finement, editors of Wikidata will be able to point to schemas, to query maps indicating relevant sub-graphs of Wikidata, and to individual lexeme items which may need to be revised.9 References 1. Allan, R., Holmes, P., Lundskær-Nielsen, T.: Danish. Routledge (1995) 2. Baker, T., Prud’hommeaux, E.: Shape Expressions (ShEx) Primer (July 2017), http://shex.io/shex-primer-20170713/ 3. Hansen, E., Heltoft, L.: Grammatik over det Danske Sprog. University Press of Southern Denmark (February 2019) 4. Labra Gayo, J.E., Prud’hommeaux, E.G., Boneva, I., Kontokostas, D.: Validating RDF Data, vol. 7 (September 2017), http://book.validatingrdf.com/ 5. Nielsen, F.Å.: Ordia: A Web application for Wikidata lexemes. In: ESWC 2019 Posters & Demos (May 2019), http://www2.compute.dtu.dk/pubdb/views/edoc_ download.php/7137/pdf/imm7137.pdf 6. Nielsen, F.Å., Mietchen, D., Willighagen, E.: Scholia, Scientometrics and Wikidata. In: The Semantic Web: ESWC 2017 Satellite Events (October 2017), http://www2. imm.dtu.dk/pubdb/views/edoc_download.php/7010/pdf/imm7010.pdf 7. Pedersen, B.S., Nimb, S., Asmussen, J., Sørensen, N.H., Trap-Jensen, L., Lorentzen, H.: DanNet: the challenge of compiling a wordnet for Danish by reusing a monolingual dictionary. Language Resources and Evaluation 43, 269–299 (August 2009) 8. Prud’hommeaux, E.G., Labra Gayo, J.E., Solbrig, H.: Shape expressions: an RDF validation and transformation language. SEM ’14: Proceedings of the 10th Inter- national Conference on Semantic Systems (2014) 9. Thornton, K., Solbrig, H., Stupp, G., Labra Gayo, J.E., Mietchen, D., Prud’hommeaux, E.G., Waagmeester, A.: Using Shape Expressions (ShEx) to Share RDF Data Models and to Guide Curation with Rigorous Validation. In: The Semantic Web. pp. 606–620 (May 2019) 10. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com- munications of the ACM 57, 78–85 (October 2014), http://cacm.acm.org/ magazines/2014/10/178785-wikidata/fulltext 11. Werkmeister, L.: Schema Inference on Wikidata (October 2018), https: //github.com/lucaswerkmeister/master-thesis/releases/download/final/ master-thesis-Lucas-Werkmeister.pdf 9 This work is funded by the Innovation Fund Denmark through DABAI and ATEL projects. The third author is partially funded by the Spanish Ministry of Economy and Competitiveness (Society challenges: TIN2017-88877-R). We appreciate feed- back from Eric Prud’hommeaux on the ShEx schemas discussed in this paper.