WDPlus: Leveraging Wikidata to Link and Extend Tabular Data Daniel Garijo Pedro Szekely dgarijo@isi.edu pszekely@isi.edu Information Sciences Institute, University of Southern Information Sciences Institute, University of Southern California California Los Angeles, California Los Angeles, California ABSTRACT Recent initiatives such as Data.world,2 Google data search [2] Scientific observations and other open data are usually made avail- and DataCommons3 aim to facilitate linking, searching and access- able online in a tabular manner as CSVs and spreadsheets. However, ing some of the contents of these knowledge graphs. However, much users of these data face three main challenges when attempting to work remains to automatically link, extend and integrate tabular use these products: finding which datasets are related to a topic of data into existing open knowledge graphs. In this paper we pro- interest; determining which existing information can be used to pose WDPlus, a framework designed to address these challenges by extend a given dataset; and how to share their integrated dataset leveraging Wikidata [5],4 an open crowdsourced knowledge graph results with the rest of the community. In this paper we present with over 60 million entities, over 700 million statements describing WDPlus, a framework designed to address these challenges by those entities and a thriving community of curators. leveraging Wikidata. WDPlus allows searching for heterogeneous datasets, facilitates completing tabular data using Wikidata and pro- 2 WDPLUS FRAMEWORK poses a mechanism to extend Wikidata in a decentralized manner. WDPlus is a framework designed to explore and build large multi- domain knowledge graphs using the wealth of structured data KEYWORDS available on the web. Our approach shifts the burden of semantic Knowledge Graphs, Entity Linking, Wikidata, RDF linking away from data publishers to communities that have an incentive to do so in a collaborative manner. An overview of WDPlus can be seen in Figure 2. At its core, 1 INTRODUCTION WDPlus relies on Wikidata [5]. A series of extensions to Wikidata, Today, data about any domain can be found on the web in data i.e., the Wikidata satellites (represented as red circles in the figure) repositories, web APIs and millions of spreadsheets and CSV files. surround the core enabling different organizations to create multi- These data comes in a myriad of formats, layouts, terminology and domain knowledge graphs tailored to their needs. A metadata index cleanliness that make them difficult to integrate together. stores table metadata to facilitate retrieving related datasets and Users of these data face three main challenges. The first one is other candidates with potential to expand or become a Wikidata finding datasets related to a feature or topic of interest. For example, satellite. climate scientists often look for years of observational data from New data from the web may be processed through our proposed authoritative sources when estimating the climate of a region. The WDPlus toolkit, designed to automatically create prototype table second challenge is how to complete a given dataset with existing models that can be extended with existing Wikidata knowledge knowledge: machine learning applications are data hungry and and be refined into full Wikidata extensions. We provide more require as many data points and features as possible to improve their information about our toolkit, table metadata index and satellites predictions, which often requires integrating data from different below. sources. The final challenge is sharing integrated results: once several datasets have been merged together, how to make them 2.1 The WDPlus Toolkit available to the rest of the community? In order to process and link tables and spreadsheets, we created a Knowledge graphs have become the preferred technology to ad- toolkit with the following capabilities: dress these challenges. Large organizations, including search engine providers, shopping giants and finance institutions are investing in • Entity Linking: One of the main challenges when linking large knowledge graphs to integrate and retrieve heterogeneous a dataset to existing knowledge graphs is to find whether data. However, data integration pipelines are usually created man- the entities described in the dataset are already defined in ually, require significant expertise, and are seldom available to the a target knowledge graph. In order to facilitate this task, general public. Similarly, linking to existing datasets in the the we have created a CSV Wikifier (inspired by [1]), to link Linked Open Data Cloud1 usually requires the expertise of a knowl- tabular data to existing Wikidata entities and disambiguate edge engineer to properly identify the appropriate target instances them based on their most common class. Early results on to link to in other datasets. the ISWC 2019 cell-entity annotation challenge5 rank our 1 https://lod-cloud.net/ 2 https://data.world/ 3 https://datacommons.org/ 4 http://wikidata.org/ Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 5 https://bit.ly/2NefkLq SciKnow19, November 19th, 2019, Marina del Rey, Los Angeles Garijo and Szekely Figure 1: Overview of the WDPlus framework system within the top performers, with over 0.9 precision title, data download, license, website source, variables, etc. Given a and over 0.85 F-1 score. table of interest, WDPlus executes an automated entity linking pro- • Interactive Table Understanding: Our toolkit includes cess using our wikifier and creates an entry in the metadata index. a GUI and defines a mapping language [4] to assimilate Each entry includes a record of the main distinct Wikidata entities data from the web as entities and statements compatible identified in the entity linking process and their labels, which serve with Wikidata. Figure 2 shows a snapshot of the application, to inform the search of related datasets. which depicts the table to map on the left side of the figure, The metadata index connects heterogeneous tables to Wikidata, and the currently specified mapping on the right side. The and therefore we use it for data augmentation. For example, a user toolkit allows capturing all the qualifiers associated with with a table containing demographic information by country may a statement, representing provenance, location, timeliness, be interested in adding climate observations to find correlations units and roles of each assertion. These are represented us- between population and temperature. Thus, given a dataset to aug- ing different colors in the figure (the statement values are ment (e.g., CSV with city name and population) and a search term represented in green, qualifiers in red and the source of the (e.g., temperature), the metadata index returns a ranked selection statements in blue). Table models used for converting tabular of datasets that may complement the target dataset with additional data are saved and linked as part of the WDPlus framework. metadata. Once a user selects a result dataset, WDPlus automatically • RDF Generation: Given a table model and a target table, adds a new column in the dataset to complete, filling in information our toolkit includes the means to generate RDF triples that for every row from Wikidata or materializing the search results. follow the Wikidata data model. These triples can then be Missing values are currently not imputed. The API for the metadata browsed and loaded in a Wikidata satellite. WDPlus also index can be found online.7 mints new triples when new entities or properties do not exist in Wikidata. 2.3 Augmenting Wikidata with Satellites The WDPlus toolkit is available online with a MIT License.6 Users may want to contribute tabular datasets to WDPlus as an extension of Wikidata entities. This process can be accomplished 2.2 Creating a Metadata Index of Tabular Data in two steps. The first one consists on an interactive table under- We have created an index to store table metadata, where each record standing step, where the user is presented the GUI shown in Figure is an instance of the Wikidata Dataset class (Q1172284). The ratio- 2 with entity linking candidates for all cells included in the target nale for the metadata index is to contain datasets that may not be table. The GUI then helps users define a table model, indicating necessarily materialized as a knowledge graph, but that would be in- how to map a target table to the Wikidata data model. We collect teresting resources to link and extend other datasets. Our metadata all qualifiers for each assertion (e.g., source, point in time and space, schema relies on Wikidata and Schema.org [3], using terms such as units, etc.) as these are crucial for ensuring data quality and trust. 6 https://github.com/usc-isi-i2/t2wml 7 http://dsbox02.isi.edu:9000/apidocs/ WDPlus: Leveraging Wikidata to Link and Extend Tabular Data SciKnow19, November 19th, 2019, Marina del Rey, Los Angeles Figure 2: Interactive table understanding interface to link datasets to Wikidata records. The second step generates RDF triples from the table, storing 3 CONCLUSIONS them in a Wikidata satellite using the WDPlus toolkit. WDPlus This paper introduces WDPlus, a framework that leverages Wiki- defines a special range of URIs for each satellite, which are separated data to link, search and augment tabular data. While the framework in graphs. These graphs may be stored in separate triplestores, is still under development, our WDPlus prototype integrates het- allowing to grow Wikidata in a decentralized manner while keeping erogeneous data from crime, economic and education domains, all related pointers in the same metadata index. Wikidata satellites illustrating the feasibility of our approach. may define new entities and properties that are not part of Wikidata. However, our early entity linking process minimizes the creation REFERENCES of duplicate entities, ensuring that all satellites are linked together. [1] Brank, J., Leban, G., and Grobelnik, M. Annotating documents with relevant A key aspect of WDPlus is that we store all curated table models wikipedia concepts. In Proceedings of the Slovenian Conference on Data Mining and Data Warehouses (SiKDD 2017) (2017). in our metadata index for each transformed dataset. This helps [2] Brickley, D., Burgess, M., and Noy, N. Google dataset search: Building a search keeping the provenance of all results in a satellite and may inform engine for datasets in an open web ecosystem. In The World Wide Web Conference the transformation of other tabular data with very similar structure. (New York, NY, USA, 2019), WWW ’19, ACM, pp. 1365–1375. [3] Guha, R. V., Brickley, D., and Macbeth, S. Schema.org: Evolution of structured For example, in the US, demographic data is usually provided for data on the web. Commun. ACM 59, 2 (Jan. 2016), 44–51. each county as a CSV file. A single table model for one county can [4] Szekely, P., Garijo, D., Pujara, J., Bhatia, D., and Wu, J. T2wml: A cell-based be used to transform the county CSVs of the whole country. language to map tables into wikidata records. In To appear in Proceedings of the 2019 International Semantic Web Conference (2019). [5] Vrandečić, D., and Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 57, 10 (Sept. 2014), 78–85.