Running a reconciliation service for Wikidata Antonin Delpeuch1[0000−0002−8612−8827] Department of Computer Science, University of Oxford, UK antonin.delpeuch@cs.ox.ac.uk Abstract. Data matching is a central part of many contribution workflows in Wikidata. We present a reconciliation service for Wikidata, implementing a stan- dard API that is supported by other data providers and consumed by multiple clients. We explain the technical choices behind the architecture of the service and review its usage patterns in 2019. Keywords: reconciliation · record linkage · web standard · discovery · Wikidata · OpenRefine 1 Introduction Aligning datasets which do not share common identifiers is a crucial step in many data integration workflows. This task is known in the literature under many names: data matching, record linkage, reconciliation and many others. This task is heuristic by na- ture and the techniques used to tackle it can vary widely depending on the application domain, but there are popular patterns for the overall architecture 1. In Wikidata 3, matching is an important task both for contributors and for reusers. Wikidata contributors constantly need to disambiguate between items, be it for manual editing or data imports. Wikidata reusers often match their own datasets to Wikidata, for instance to pull further data from the knowledge graph, or to uncover duplicates in their own database. The Wikidata ecosystem offers a range of tools to help with this process. First, Wikidata itself offers a search engine based on ElasticSearch, which can be used to look up items. This supports a range of search operators, such as boolean operators (AND, OR) or fuzzy search (Lovelaec˜ returns Ada Lovelace (Q7259) as first re- sult). Custom keywords can also be used to look up entities by their property values (haswbstatement:"P298=RUS" returns Russia (Q159) as only result). In addi- tion, Wikidata offers autocompletion for inputs which expect entities. This is based on a prefix search on labels and aliases. For instance, typing USA in such an input will propose United States of America (Q30) as first option in a drop-down menu. An- other useful service to discover entities is the Wikidata Query Service, which offers a SPARQL endpoint on a RDF view of Wikidata. This can be used to formulate more elaborate logical queries than what the search endpoint supports. The search API can be called from SPARQL queries using a SERVICE statement, making it possible to combine fuzzy search with advanced logical querying capabilities. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). 2 Antonin Delpeuch In addition to these official services, the community has built a range of tools to help with matching databases to Wikidata. A popular one is Mix’n’Match1 , a platform where Wikidata users can match external datasets to Wikidata items interactively. Datasets can be uploaded via a tabular format, or fetched from the target websites using user-defined scrapers. When the datasets are associated with Wikidata properties, users can directly add statements to the matched entities to store the third-party identifier in Wikidata as they match the datasets. The platform also supports creating new items for entities which are not yet represented in the knowledge graph. In this article, we present another such matching tool: a Wikidata reconciliation service.2 Reconciliation services offer a web API specifically designed to match third- party datasets to some database, such as Wikidata. We will present in Section 2 the queries supported by the service. In short, it is a search API specifically tailored for automated matching, which also offers convenience endpoints for field autocompletion and entity preview. The historic client for this API is OpenRefine, a data cleaning tool initially developed to carry out data imports in Freebase. Over the years, the API has been adopted by other data providers and client software. In 2019, a W3C Community Group was founded3 to steer the evolution of this API, which had remained mostly undocumented despite its growing adoption. 2 The reconciliation API The reconciliation API4 is a web API that data providers can offer, making it easier for clients to match their own data to the database that sits behind the service. Instead of querying a vendor-specific search API, clients can rely on this uniform interface to formulate search queries specifically tailored to data matching problems. For instance, say we are interested in matching the following dataset of films to Wikidata: Film title Director name The Escape Dominic Savage Les Ex Maurice Barthelemy La Douleur Emmanuel Finkiel Raid Dingue Dany Boon Fig. 1. An example table5 to be matched to Wikidata Film titles alone are ambiguous, therefore searching for each film title in Wikidata will not give very reliable results. We could concatenate the name of the director to our queries, but that will not let us specify that this name should be matched against the 1 https://mixnmatch.toolforge.org 2 https://github.com/wetneb/openrefine-wikibase 3 https://www.w3.org/community/reconciliation/ 4 https://reconciliation-api.github.io/specs/latest/ Running a reconciliation service for Wikidata 3 value of director (P57) only. As we do not know the Wikidata identifiers of the directors, we will not be able to use the haswbstatement syntax for that. Furthermore, there is no mechanism to filter results by type using the user-contributed type system made of instance of (P31) and subclass of (P279). Instead, we can use the reconciliation API to formulate a query, consisting of: – a name (for instance ”The Escape”) – a list of property pairs, each consisting of a property id (such as director (P57)) and a value (such as ”Dominic Savage”). – a type constraint (such as film (Q11424)) Sending this query to the service will return reconciliation candidates along with scores which quantify how well they match the query. In our running example we would get The Escape (Q39073801) as the first candidate, followed by namesakes such as The Escape (Q58814699). Reconciliation queries can be sent by batch, which speeds up the process of matching large datasets. Because matching is a heuristic process which often needs to be supervised by hu- mans, the reconciliation API offers auto-complete endpoints which let tools validate manual matching decisions by mapping user input to Wikidata identifiers interactively. It also provides a preview service, which lets clients display hovercards summariz- ing the contents of Wikidata entities. These aspects of the API are optional: services can decide to support them on an opt-in basis. The reconciliation test bench6 offers an overview of which aspects of the API are supported by a range of public reconciliation endpoints. Since the creation of the W3C Entity Reconciliation Community Group, the API has evolved thanks to feedback from users and service providers. For instance, the API orig- inally relied on JSONP, an old technique to perform cross-origin requests in Javascript. Services can now use CORS headers to this end. Other changes are being drafted, and we encourage other stakeholders to join the discussion to ensure that this API meets the diverse needs of the community at large. 3 Architecture of the service Our Wikidata reconciliation service is small wrapper on top of existing APIs. Its role is to translate the reconciliation queries to Wikibase’s own API, produce scores for the reconciliation candidates and return them to the user in the expected format. This wrapper architecture is used by many reconciliation services, but some data providers also provide an official reconciliation service on their own 2. 3.1 Query resolution Reconciliation queries can have various shapes, which influences the resolution process. Consider the example query of Section 2. To compute the corresponding reconciliation candidates, we proceed as follows: 6 https://reconciliation-api.github.io/testbench/ 4 Antonin Delpeuch – retrieve all the subclasses of the target type (film (Q11424)) using a SPARQL query, SELECT ?child WHERE { ?child wdt:P279* wd:Q11424 } – search for ”The Escape” both using Wikidata’s search service and auto-complete service; – retrieve all entities appearing in the search results; – filter out those which do not have one of the sublcasses of film (Q11424) as instance of (P31); – retrieve the items which appear as director (P57) of the remaining candidates; – score candidates by comparing their labels and aliases to ”The Escape”, and the labels and aliases of their director (P57) values to ”Dominic Savage”. The scores are linear combination of fuzzy-matching scores of these strings. – return the candidates to the user. The subclasses of target types and the contents of Wikidata entities are cached in a Redis database to speed up the processing. Because the target types used in reconcilia- tion queries generally remain constant over a long series of batches, this initial fetching of subclasses is normally amortized by further queries. Each reconciliation query yields two search queries via Wikidata’s API, via the action=wbsearchentities and the action=query&list=search endpoints. The reason for this is that none of the two endpoints can be trusted to surface the relevant candidates systematically. For instance, searching for ”USA” in action=wbsearchentities will return United States of America (Q30) as first result, but with the same query in action=query&list=search, this entity is not present in the first page of results. Conversely, searching for ”Lovelace, Ada” in action=query&list=search will return Ada Lovelace (Q7259), but will not yield any results with action=wbsearchentities. Since the reconciliation API supports batching of requests, the service performs some of these API calls in parallel to speed up the overall processing time. The service also recognizes some special values for which it avoids querying the search APIs and uses a custom processing instead. This is the case of Wikidata URIs (https://www.wikidata.org/wiki/Q42, https://www.wikidata.org/entity/Q42), bare Q-ids (Q42) and Wikipedia URLs (https://en.wikipedia.org/wiki/Douglas Adams) which are di- rectly resolved to the corresponding items after following redirects. Finally, when a unique identifier is provided as a property, the service first tries to retrieve the candidate items using this unique identifier (via a SPARQL query), and falls back to text search if no matches were returned. The properties supported in the service are not limited to Wikidata properties such as director (P57): these properties can be combined into property paths, in analogy to SPARQL’s property paths. The supported combinators are: – disjunction: P57|P58 returns the directors (P57) and screenwriters (P58) of a given item; – sequence: P57/P19 returns the date of birth (P19) of directors (P57); – repetition: P749* returns the network of parent organization (P749); – empty path: . returns the item itself, which can be useful with other combinators or to provide alternate labels during reconciliation; – labels: Len returns the label of an item in English; – descriptions: Dfr returns the description of an item in French; Running a reconciliation service for Wikidata 5 – aliases: Ade returns the aliases in German; – sitelinks: Sitwiki returns the sitelink in the Italian Wikipedia In addition, there are also syntaxes to extract various fields of the datatypes which are supported by Wikibase, by prepending them to the end of the property path. For instance, P625@lat will return the latitude of the coordinate location (P625). The full list of supported fields is as follows: – @lat and @lng for coordinates; – @year, @month and @day for dates, with @isodate to format dates in YYYY- MM-DD format; – @urlscheme, @netloc and @urlpath to return parts of URL (respectively “https”, “www.wikidata.org” and “/wiki/Wikidata:Main page” for Wikidata’s main page URL). 3.2 Suggest and preview services In the reconciliation API parlance, suggest services are the APIs that underpin the auto- complete inputs for entities, properties and types. In the Wikidata reconciliation ser- vice, these are just implemented by forwarding calls to the corresponding Wikidata endpoints, except for properties as we need to validate property paths as well. This validation is simply done by parsing the paths and returning it as drop-down option if parsing succeeded. The preview service renders a small HTML page intended to be embedded in the client as an iframe. In the Wikidata service, entities are previewed by displaying an im- age associated to them and their description in the target language. If no such descrip- tion is found, Magnus Manske’s autodesc service is used. This generates a description on the fly using a rule-based system. 3.3 Data extension Data extension allows the reconciliation client to pull data from the target data source, by specifying entity and property identifiers. This fetching is also done by batch. The wrapper uses the same fetching and caching strategy than for reconciliation itself. In addition, the service also supports suggesting properties to fetch for a particular domain. Given the Wikidata identifier of a particular type, the goal is to return prop- erties generally used on items of this type. For instance, for the type sovereign state (Q3624078), we could suggest properties such as capital (P36) or country calling code (P474). To compute these proposals, we rely on the property for this type (P1963) property, which was designed for this purpose. It is often the case that the requested type is not annotated with any such statement. To mitigate this, we also include types for super- classes of the requested type. This is done using a SPARQL query which relies on the GAS (Gather Apply Scatter) service in BlazeGraph. This lets us order properties by relevance, from the most specific ones (annotated on the type itself) to more generic ones (on more abstract superclasses). 6 Antonin Delpeuch SELECT ?prop ?propLabel ?depth WHERE { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" . gas:program gas:in wd:Q3624078 . gas:program gas:out ?out . gas:program gas:out1 ?depth . gas:program gas:maxIterations 10 . gas:program gas:maxVisited 100 . gas:program gas:linkType wdt:P279 . } SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } ?out wdt:P1963 ?prop . } ORDER BY ?depth LIMIT 50 4 Statistics In this section, we give a brief overview of the nature of queries sent to the Wikidata rec- onciliation service in 2019. About 17.6 million queries were processed by the wrapper over this period. The service takes a mandatory language parameter for all its queries, which deter- mines in which language labels are returned, but also in which language the wbsearchentities calls are made (which can influence the query results). As Figure 2 shows, queries are overwhelmingly made in English, perhaps because it is the default language in Open- Refine, which does not make it clear from its user interface that the language used for reconciliation can be changed (because it is infered from the UI language itself). Figure 3 gives an overview of the types used the most frequently. Because of the hierarchical structure on types, we also include the total count of queries to subclasses of each type, to give a better idea of the popularity of a given domain. Language Frequency English (en) 73.5% German (de) 8.0% French (fr) 7.9% Spannish (es) 5.9% Dutch (nl) 1.9% Japanese (ja) 0.6% Other languages 2.2% Fig. 2. Proportion of languages used for querying Running a reconciliation service for Wikidata 7 Target type Direct uses Uses of subclasses human (Q5) 4,033,719 4,033,892 organization (Q43229) 2,043,513 4,491,792 entity (Q35120) 1,172,760 14,132,144 Identificadores (Q21169908) 770,820 770,820 business (Q4830453) 463,202 604,466 scholarly article (Q13442814) 335,624 335,627 city (Q515) 319,695 433,434 film (Q11424) 282,052 322,892 taxon (Q16521) 223,538 223,547 village in India (Q56436498) 214,534 214,534 railway station (Q55488) 203,648 213,386 commune of France (Q484170) 170,390 170,629 album (Q482994) 164,653 165,267 branch post office (Q61443690) 144,909 144,909 university (Q3918) 139,461 151,569 title (Q783521) 139,411 139,419 family name (Q101352) 125,531 125,671 educational institution (Q2385804) 111,726 429,278 State Bank of India branch (Q65954115) 110,508 110,508 airport (Q1248784) 106,165 107,361 comune of Italy (Q747074) 101,629 101,852 video game (Q7889) 99,180 99,180 periodical (Q1002697) 91,813 221,330 software (Q7397) 91,772 121,857 district of India (Q1149652) 86,730 86,730 Fig. 3. Number of times the most 25 popular types were reconciled against. The second column indicates the number of queries against the type or any of its subclasses. 8 Antonin Delpeuch 5 Conclusion This service and the underlying API could be improved in many ways. Although the reliance on existing search APIs from Wikidata makes it simple to deploy the service and keep it synchronous with Wikidata, it also limits our capacity to adapt the indexing profile to the needs of our users (which leads us to run two search queries for each reconciliation query as explained in Section 3). Running our own search index on top of Wikidata could help and it would also let us submit search queries in batch rather than individually. Ideally, the service could be offered by the Wikibase instance itself, for instance as a MediaWiki extension. The existing service can be used on other Wikibase instances as long as they provide a SPARQL endpoint. References Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science & Business Media (2012) Delpeuch, A.: A survey of OpenRefine reconciliation services. arXiv:1906.08092 [cs] (Aug 2019) Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledge base. Communications of the ACM (2014). https://doi.org/10.1145/2629489