WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
                                 Daniel Garijo                                                                     Pedro Szekely
                          dgarijo@isi.edu                                                               pszekely@isi.edu
       Information Sciences Institute, University of Southern                         Information Sciences Institute, University of Southern
                            California                                                                     California
                      Los Angeles, California                                                        Los Angeles, California

ABSTRACT                                                                                 Recent initiatives such as Data.world,2 Google data search [2]
Scientific observations and other open data are usually made avail-                   and DataCommons3 aim to facilitate linking, searching and access-
able online in a tabular manner as CSVs and spreadsheets. However,                    ing some of the contents of these knowledge graphs. However, much
users of these data face three main challenges when attempting to                     work remains to automatically link, extend and integrate tabular
use these products: finding which datasets are related to a topic of                  data into existing open knowledge graphs. In this paper we pro-
interest; determining which existing information can be used to                       pose WDPlus, a framework designed to address these challenges by
extend a given dataset; and how to share their integrated dataset                     leveraging Wikidata [5],4 an open crowdsourced knowledge graph
results with the rest of the community. In this paper we present                      with over 60 million entities, over 700 million statements describing
WDPlus, a framework designed to address these challenges by                           those entities and a thriving community of curators.
leveraging Wikidata. WDPlus allows searching for heterogeneous
datasets, facilitates completing tabular data using Wikidata and pro-                 2     WDPLUS FRAMEWORK
poses a mechanism to extend Wikidata in a decentralized manner.                       WDPlus is a framework designed to explore and build large multi-
                                                                                      domain knowledge graphs using the wealth of structured data
KEYWORDS                                                                              available on the web. Our approach shifts the burden of semantic
Knowledge Graphs, Entity Linking, Wikidata, RDF                                       linking away from data publishers to communities that have an
                                                                                      incentive to do so in a collaborative manner.
                                                                                          An overview of WDPlus can be seen in Figure 2. At its core,
1     INTRODUCTION                                                                    WDPlus relies on Wikidata [5]. A series of extensions to Wikidata,
Today, data about any domain can be found on the web in data                          i.e., the Wikidata satellites (represented as red circles in the figure)
repositories, web APIs and millions of spreadsheets and CSV files.                    surround the core enabling different organizations to create multi-
These data comes in a myriad of formats, layouts, terminology and                     domain knowledge graphs tailored to their needs. A metadata index
cleanliness that make them difficult to integrate together.                           stores table metadata to facilitate retrieving related datasets and
   Users of these data face three main challenges. The first one is                   other candidates with potential to expand or become a Wikidata
finding datasets related to a feature or topic of interest. For example,              satellite.
climate scientists often look for years of observational data from                        New data from the web may be processed through our proposed
authoritative sources when estimating the climate of a region. The                    WDPlus toolkit, designed to automatically create prototype table
second challenge is how to complete a given dataset with existing                     models that can be extended with existing Wikidata knowledge
knowledge: machine learning applications are data hungry and                          and be refined into full Wikidata extensions. We provide more
require as many data points and features as possible to improve their                 information about our toolkit, table metadata index and satellites
predictions, which often requires integrating data from different                     below.
sources. The final challenge is sharing integrated results: once
several datasets have been merged together, how to make them                          2.1      The WDPlus Toolkit
available to the rest of the community?
                                                                                      In order to process and link tables and spreadsheets, we created a
   Knowledge graphs have become the preferred technology to ad-
                                                                                      toolkit with the following capabilities:
dress these challenges. Large organizations, including search engine
providers, shopping giants and finance institutions are investing in                        • Entity Linking: One of the main challenges when linking
large knowledge graphs to integrate and retrieve heterogeneous                                a dataset to existing knowledge graphs is to find whether
data. However, data integration pipelines are usually created man-                            the entities described in the dataset are already defined in
ually, require significant expertise, and are seldom available to the                         a target knowledge graph. In order to facilitate this task,
general public. Similarly, linking to existing datasets in the the                            we have created a CSV Wikifier (inspired by [1]), to link
Linked Open Data Cloud1 usually requires the expertise of a knowl-                            tabular data to existing Wikidata entities and disambiguate
edge engineer to properly identify the appropriate target instances                           them based on their most common class. Early results on
to link to in other datasets.                                                                 the ISWC 2019 cell-entity annotation challenge5 rank our

1 https://lod-cloud.net/
                                                                                      2 https://data.world/
                                                                                      3 https://datacommons.org/
                                                                                      4 http://wikidata.org/
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                    5 https://bit.ly/2NefkLq
SciKnow19, November 19th, 2019, Marina del Rey, Los Angeles                                                                        Garijo and Szekely


                                                    Figure 1: Overview of the WDPlus framework


        system within the top performers, with over 0.9 precision          title, data download, license, website source, variables, etc. Given a
        and over 0.85 F-1 score.                                           table of interest, WDPlus executes an automated entity linking pro-
      • Interactive Table Understanding: Our toolkit includes              cess using our wikifier and creates an entry in the metadata index.
        a GUI and defines a mapping language [4] to assimilate             Each entry includes a record of the main distinct Wikidata entities
        data from the web as entities and statements compatible            identified in the entity linking process and their labels, which serve
        with Wikidata. Figure 2 shows a snapshot of the application,       to inform the search of related datasets.
        which depicts the table to map on the left side of the figure,         The metadata index connects heterogeneous tables to Wikidata,
        and the currently specified mapping on the right side. The         and therefore we use it for data augmentation. For example, a user
        toolkit allows capturing all the qualifiers associated with        with a table containing demographic information by country may
        a statement, representing provenance, location, timeliness,        be interested in adding climate observations to find correlations
        units and roles of each assertion. These are represented us-       between population and temperature. Thus, given a dataset to aug-
        ing different colors in the figure (the statement values are       ment (e.g., CSV with city name and population) and a search term
        represented in green, qualifiers in red and the source of the      (e.g., temperature), the metadata index returns a ranked selection
        statements in blue). Table models used for converting tabular      of datasets that may complement the target dataset with additional
        data are saved and linked as part of the WDPlus framework.         metadata. Once a user selects a result dataset, WDPlus automatically
      • RDF Generation: Given a table model and a target table,            adds a new column in the dataset to complete, filling in information
        our toolkit includes the means to generate RDF triples that        for every row from Wikidata or materializing the search results.
        follow the Wikidata data model. These triples can then be          Missing values are currently not imputed. The API for the metadata
        browsed and loaded in a Wikidata satellite. WDPlus also            index can be found online.7
        mints new triples when new entities or properties do not
        exist in Wikidata.                                                 2.3      Augmenting Wikidata with Satellites
   The WDPlus toolkit is available online with a MIT License.6             Users may want to contribute tabular datasets to WDPlus as an
                                                                           extension of Wikidata entities. This process can be accomplished
2.2     Creating a Metadata Index of Tabular Data                          in two steps. The first one consists on an interactive table under-
We have created an index to store table metadata, where each record        standing step, where the user is presented the GUI shown in Figure
is an instance of the Wikidata Dataset class (Q1172284). The ratio-        2 with entity linking candidates for all cells included in the target
nale for the metadata index is to contain datasets that may not be         table. The GUI then helps users define a table model, indicating
necessarily materialized as a knowledge graph, but that would be in-       how to map a target table to the Wikidata data model. We collect
teresting resources to link and extend other datasets. Our metadata        all qualifiers for each assertion (e.g., source, point in time and space,
schema relies on Wikidata and Schema.org [3], using terms such as          units, etc.) as these are crucial for ensuring data quality and trust.

6 https://github.com/usc-isi-i2/t2wml                                      7 http://dsbox02.isi.edu:9000/apidocs/
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data                                     SciKnow19, November 19th, 2019, Marina del Rey, Los Angeles


                        Figure 2: Interactive table understanding interface to link datasets to Wikidata records.


    The second step generates RDF triples from the table, storing          3    CONCLUSIONS
them in a Wikidata satellite using the WDPlus toolkit. WDPlus              This paper introduces WDPlus, a framework that leverages Wiki-
defines a special range of URIs for each satellite, which are separated    data to link, search and augment tabular data. While the framework
in graphs. These graphs may be stored in separate triplestores,            is still under development, our WDPlus prototype integrates het-
allowing to grow Wikidata in a decentralized manner while keeping          erogeneous data from crime, economic and education domains,
all related pointers in the same metadata index. Wikidata satellites       illustrating the feasibility of our approach.
may define new entities and properties that are not part of Wikidata.
However, our early entity linking process minimizes the creation           REFERENCES
of duplicate entities, ensuring that all satellites are linked together.   [1] Brank, J., Leban, G., and Grobelnik, M. Annotating documents with relevant
    A key aspect of WDPlus is that we store all curated table models           wikipedia concepts. In Proceedings of the Slovenian Conference on Data Mining and
                                                                               Data Warehouses (SiKDD 2017) (2017).
in our metadata index for each transformed dataset. This helps             [2] Brickley, D., Burgess, M., and Noy, N. Google dataset search: Building a search
keeping the provenance of all results in a satellite and may inform            engine for datasets in an open web ecosystem. In The World Wide Web Conference
the transformation of other tabular data with very similar structure.          (New York, NY, USA, 2019), WWW ’19, ACM, pp. 1365–1375.
                                                                           [3] Guha, R. V., Brickley, D., and Macbeth, S. Schema.org: Evolution of structured
For example, in the US, demographic data is usually provided for               data on the web. Commun. ACM 59, 2 (Jan. 2016), 44–51.
each county as a CSV file. A single table model for one county can         [4] Szekely, P., Garijo, D., Pujara, J., Bhatia, D., and Wu, J. T2wml: A cell-based
be used to transform the county CSVs of the whole country.                     language to map tables into wikidata records. In To appear in Proceedings of the
                                                                               2019 International Semantic Web Conference (2019).
                                                                           [5] Vrandečić, D., and Krötzsch, M. Wikidata: A free collaborative knowledgebase.
                                                                               Commun. ACM 57, 10 (Sept. 2014), 78–85.