<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>deepschema.org: An Ontology for Typing Entities in the Web of Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michele Catasta EPFL</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Switzerland Michele.Catasta@epfl.ch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amit Gupta EPFL</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Class Hierarchy</institution>
          ,
          <addr-line>Taxonomy, Ontology, Wikidata, schema.org, Data Extraction, Data Integration, Entity Typing</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Karl Aberer EPFL</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Panayiotis Smeros EPFL</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Discovering the appropriate type of an entity in the Web of Data is still considered an open challenge, given the complexity of the many tasks it entails. Among them, the most notable is the de nition of a generic and cross-domain ontology. While the ontologies proposed in the past function mostly as schemata for knowledge bases of di erent sizes, an ontology for entity typing requires a rich, accurate and easily-traversable type hierarchy. Likewise, it is desirable that the hierarchy contains thousands of nodes and multiple levels, contrary to what a manually curated ontology can o er. Such level of detail is required to describe all the possible environments in which an entity exists in. Furthermore, the generation of the ontology must follow an automated fashion, combining the most widely used data sources and following the speed of the Web. In this paper we propose deepschema.org, the rst ontology that combines two well-known ontological resources, Wikidata and schema.org, to obtain a highly-accurate, generic type ontology which is at the same time a rstclass citizen in the Web of Data. We describe the automated procedure we used for extracting a class hierarchy from Wikidata and analyze the main characteristics of this hierarchy. We also provide a novel technique for integrating the extracted hierarchy with schema.org, which exploits external dictionary corpora and is based on word embeddings. Finally, we present a crowdsourcing evaluation which showcases the three main aspects of our ontology, namely the accuracy, the traversability and the genericity. The outcome of this paper is published under the portal: http://deepschema.github.io.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Information systems ! Information integration; Data
extraction and integration;
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).</p>
      <p>LDOW ’17. 3 April, 2017. Perth, WA, Australia.
c 2017 Copyright held by the owner/author(s).</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>The de nition of a generic and cross-domain ontology that
describes all the types of the entities of the Web is considered
as a very challenging task. In the past, many approaches
that tried to address this problem proposed either manually
curated ontologies or static schemata extracted from existed
knowledge bases. However, both of these approaches have
their de ciencies. A proper ontology for entity typing
requires a rich, accurate and easily-traversable type hierarchy.
Likewise, it is desirable that this hierarchy contains
thousands of nodes and multiple levels. Such level of detail is
required to describe all the possible environments in which
an entity exists. Furthermore, the generation of the
ontology must follow an automated fashion, combining the most
widely used data sources and following the speed of the Web.</p>
      <p>
        Currently, the most well-supported knowledge base and
schema providers are Wikidata1 and schema.org2.
Wikidata is an initiative of Wikimedia Foundation for serving as
the central repository for the structured data of its projects
(e.g., for Wikipedia). Wikidata is also supported by Google
which decided to shutdown its related project (Freebase3)
in the middle of 2015 and since then has put a lot of
effort on migrating the existing knowledge to Wikidata [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
On the other hand, schema.org is an initiative of four
sponsoring companies (Google, Microsoft, Yahoo and Yandex),
supported by W3C as well, that aims on creating schemata
that describe structured data on the web.
      </p>
      <p>
        Both of these projects are trying to handle the plethora
of heterogeneous, structured data that can be found on the
web. Wikidata acts as a centralized data repository with
a decentralized, community-controlled schema with millions
of daily updates4. By contrast, schema.org proposes a very
strict and rarely-updated schema, which is widely used by
billions of pages across the web [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These two proposed
approaches are considered complementary5. By bringing them
      </p>
      <sec id="sec-2-1">
        <title>1http://wikidata.org</title>
        <p>2http://schema.org
3http://freebase.com
4http://en.wikipedia.org/wiki/Wikipedia:Statistics
5http://meta.wikimedia.org/wiki/Wikidata/Notes/
Schema.org and Wikidata
closer and unifying them, we form a rich, multi-level class
hierarchy that can describe millions of entities in the Web
of Data.</p>
        <p>
          Such class hierarchy would be a very useful tool for many
applications. For instance, in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] the authors propose
TRank, an algorithm for ranking entity types given an
entity and its context. In the heart of TRank, a reference type
hierarchy is traversed and the appropriate set of types for
each entity is obtained. This type hierarchy combines
information mostly from YAGO6 and DBpedia7. However, none
of these two data sources seems to su ce for the speci c
task of entity typing.
        </p>
        <p>On one hand, YAGO's taxonomy inherits the
class modeling of its sources (i.e., Wikipedia
Categories8 and WordNet9). Thus, nodes like
wikicat People murdered in British Columbia and
wordnet person 100007846 are included in the taxonomy10,
making it inadequate to be traversed. DBpedia's ontology
on the other hand, has a manually-curated and meaningful
class hierarchy. Its volume though (only 685 classes) makes
it inappropriate for describing accurately the millions of
entities existing on the Web.</p>
        <p>
          In another recent work, a knowledge graph named
VoldemortKG was proposed [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. VoldemortKG aggregates
entities scattered across several web pages, which have both
schema.org annotations and text anchors pointing to their
Wikipedia page. Since entities are always accompanied by a
schema, an ontology which contains the combined class
hierarchy of the aforementioned data sources would complement
this knowledge graph and increase its value.
        </p>
        <p>In this paper we propose deepschema.org, the rst
ontology that combines two well-known ontological resources,
Wikidata and schema.org, to obtain a highly-accurate,
generic type ontology which is at the same time a rst-class
citizen in the Web of Data.</p>
        <p>The main contributions of this paper are the following:
the automated extraction procedure of the class
hierarchy of Wikidata which is based on RDFS entailment
rules
the analysis of the main characteristics of this
hierarchy, namely the structure, the instances, the language
and the provenance
the novel technique for the integration of the
extracted hierarchy with schema.org which exploits
external dictionary corpora and is based on word
embeddings
The crowd-sourced evaluation of the uni ed ontology
which showcases the three main aspects of our
ontology, namely the accuracy, the traversability and the
genericity</p>
        <p>The source code, the produced ontology and all the details
about reproducing the results of this paper are published
under the portal: http://deepschema.github.io.</p>
      </sec>
      <sec id="sec-2-2">
        <title>6http://www.yago-knowledge.org</title>
        <p>7http://dbpedia.org
8http://en.wikipedia.org/wiki/Wikipedia:Categorization
9http://wordnet.princeton.edu
10http://resources.mpi-inf.mpg.de/yago-naga/yago/
download/yago/yagoTaxonomy.txt</p>
        <p>The structure of the rest of the paper is organized as
follows. In Section 2 we survey the related work while in
Section 3 we provide more details on the Wikidata class
hierarchy. In Section 4 we present the methods that we use
for integrating Wikidata and schema.org. In Section 5 we
describe the implementation and analyze the basic
characteristics of the uni ed ontology. Finally, in Section 6 we
evaluate the proposed methods and in Section 7 we conclude
this work by discussing future directions.
2.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>The related work of this paper includes knowledge bases
of general purpose, whose schemata comprise class
hierarchical information as well as approaches that integrate such
knowledge bases.</p>
      <p>
        As mentioned above, Wikidata [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is a community-based
knowledge base i.e., users can collaboratively add and edit
information. Wikidata is also multilingual, with the labels,
aliases, and descriptions of its entities to be provided in more
than 350 languages. A new dump of Wikidata is created
every week and is distributed in JSON and experimentally
in XML and RDF formats. All structured data from the
main and the property namespace is available under the
Creative Commons Public Domain Dedication License version
1.0 (CC0 1.0).
      </p>
      <p>
        On the other hand, schema.org [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] provides a vocabulary
which is widely used for annotating web pages and emails.
This vocabulary is distributed in various formats (e.g.,
RDFa, Microdata and JSON-LD). The sponsors' copyrights
in the schema are licensed to website publishers and other
third parties under the Creative Commons
AttributionShareAlike License version 3.0 (CC BY-SA 3.0).
      </p>
      <p>
        The most well-know component of the LOD cloud is
DBpedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It contains information which is automatically
extracted mainly from the infobox tables of Wikipedia pages.
Since it plays a central role in LOD cloud, DBpedia is the
main hub in which many other datasets link to. The dataset
is updated almost once every year, whereas there is also a
live version [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] that continuously synchronizes DBpedia with
Wikipedia. The data format that is used is the RDF and
the publishing license is CC BY-SA 3.0.
      </p>
      <p>
        Freebase [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a user-contributed knowledge base which
integrates data from various data sources including Wikipedia
and MusicBrainz. As stated before, Freebase has now
shutdown and partially integrated to Wikidata. All the dumps
of the dataset are published in the RDF format under the
Creative Commons Attribution Generic version 2.5 (CC BY
2.5) license.
      </p>
      <p>
        Another dataset that comprises information extracted
from Wikipedia, WordNet and Geonames is YAGO [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
The current version of YAGO has knowledge of more than
10 million entities (like persons, organizations, cities, etc.)
assigned to more than 350; 000 classes. All the 4 dumps
created by YAGO are distributed in the RDF and TSV data
formats under the license CC BY-SA 3.0.
      </p>
      <p>
        Wibi, the Wikipedia Bitaxonomy project [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] also induces
a large-scale taxonomy for categories from the Wikipedia
categories network. Wibi is based on the idea that
information contained in Wikipedia pages is bene cial towards
the construction of a taxonomy of categories and vice-versa.
The most recent e ort towards taxonomy induction over
Wikipedia [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposes a uni ed taxonomy from Wikipedia
with pages as leaves and categories as higher-level nodes
usAxiomatic
      </p>
      <p>Triples
Entailment</p>
      <p>Rules
(rdf s:subClassOf; rdf s:domain; rdf s:Class)
(rdf s:subClassOf; rdf s:range; rdf s:Class)
(rdf :type; rdf s:range; rdf s:Class)
rdfs2: ((A; rdf s:domain; B) ^ (C; A; D)) ) (C; rdf :type; B)
rdfs3: ((A; rdf s:range; B) ^ (C; A; D)) ) (D; rdf :type; B)
rdfs9: ((A; rdf s:subClassOf; B) ^ (C; rdf :type; A)) ) (C; rdf :type; B)
ing a novel set of high-precision heuristics.</p>
      <p>
        Regarding the integration of such knowledge bases, many
approaches have been proposed [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. One interesting work
that combines two of the aforementioned datasets is PARIS
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In this work the authors present some probabilistic
techniques for the automatic alignment of ontologies not
only in the instance but also in the schema level. The
precision they achieve when they interconnect DBpedia and
YAGO reaches 90%.
      </p>
      <p>
        The authors of YAGO [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] also proposed a technique
for constructing an augmented taxonomy, derived from
Wikipedia and WordNet. The Wikipedia categories have
a hierarchical structure which contains more thematic
than ontological information (e.g., the category Football in
France). Hence, the authors extract only the leaf categories,
that semantically are closer to the notion of ontology classes.
Then they align these categories with WordNet terms using
string similarity methods which have precision of around
95%. Finally, they exploit the WordNet relation hyponym
in order to construct the uni ed ontology.
      </p>
      <p>The integration technique that we propose is based on
word embeddings (Section 4) and despite its simplicity it
discovers alignments with accuracy that is comparable to
the one achieved by the two above methods (91%).</p>
    </sec>
    <sec id="sec-4">
      <title>WIKIDATA</title>
      <p>Wikidata is the main data source that we employ in
deepschema.org. In this section we describe the methods for
extracting a class hierarchy from Wikidata and we analyze
the characteristics of this hierarchy. The described
methods are not tightly coupled with a speci c version of the
data source, however, in the context of this paper we use
Wikidata 20160208 JSON dump.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Class Hierarchy Extraction</title>
      <p>The Wikidata JSON dump does not contain explicit
information about the schema that accompanies the data.
Every line of the dump consists of a unique entity and its
attributes, described in a compact form11. The entities that
represent classes and the entities that represent instances of
classes are not distinguished in the dataset. Thus we have
to apply semantic rules in order to extract them.
3.1.1</p>
      <sec id="sec-5-1">
        <title>Semantic Rules</title>
        <p>The rules that we apply to extract the taxonomy are based
on the three axiomatic RDFS triples and the RDFS
entailment rules 2 and 3 provided in Table 1. Intuitively, these
rules imply that if X is of type Y, then Y is a class and if Z is
a subclass of W, then Z and W are classes and the subclass
relation holds between them.</p>
        <p>Wikidata does not contain any rdf s:subClassOf or
rdf :type properties, but it considers properties P 279
(subclass of ) and P 31 (instance of ) as equivalents of them (i.e.,
the have the same semantics). Hence we can apply the
previous rules on these properties in order to extract the
hierarchy.
3.1.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Filtering Phase</title>
        <p>As we will explain below, the raw form of the extracted
hierarchy does not satisfy our requirements. Hence, we
introduce a ltering phase in which we focus on two main
aspects: i) the domain speci c data sources and ii) the
nonEnglish labeled classes.</p>
        <p>Domain Speci c Data Sources. One of the main
challenges for deepschema.org is genericity. Since data sources
that apply to very narrow domains are imported to
Wikidata, we introduce a lter in which we cleanse our hierarchy
from such domain-speci c information. As we discuss below,
we had to drop more than the 75% of the extracted
information in favor of keeping the hierarchy satisfyingly generic.</p>
        <p>In order to track the provenance of the classes, we
exploit the respective properties supported by Wikidata. The
most widely-used provenance properties are the P 143
(imported from) and the P 248 (stated in). What we discovered
11More details about the data model of Wikidata can
be found here: http://www.mediawiki.org/wiki/Wikibase/
DataModel/Primer.
80%
60%
40%
20%
0%
50K
40K
30K
20K
10K
en
fr
de
es
ru
it
nl
ja
pl
pt
is that many classes were imported to Wikidata from
biological, chemical and mineral knowledge bases (e.g., NCBI12,
UniProt13, Ensembl14 and Mindat15). We consider these
classes as very domain-speci c, in terms of the objective of
our hierarchy, and thus we apply a lter that prunes them.
Non-English Labeled Classes. Another lter that we
apply is based on the language of the label of the extracted
classes. As stated above, schema.org is expressed using the
English language whereas Wikidata is multilingual.
Including multilingual classes from Wikidata that do not contain
English labels reduces tangibly the accuracy of our
integration techniques (described in Section 4). Hence, we eliminate
classes that do not ful ll this condition.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Analysis of the Class Hierarchy</title>
      <p>Analyzing the characteristics of the extracted class
hierarchy we focus on four main aspects: i) what is the structure
of the hierarchy, ii) how the classes are populated with
instances, iii) what is the distribution of their labels' language
and iv) what is their provenance.
3.2.1</p>
      <sec id="sec-6-1">
        <title>Structure</title>
        <p>12http://www.ncbi.nlm.nih.gov
13http://www.uniprot.org
14http://www.ensembl.org
15http://www.mindat.org</p>
        <p>The overall statistics of the hierarchy are summarized in</p>
      </sec>
      <sec id="sec-6-2">
        <title>3.2.2 Instances</title>
        <p>A large amount of Wikidata classes are accompanied by
instances. If a class contains instances, then these are
inherited to all its superclasses because of the transitive property
of the relation subclass of.</p>
        <p>Based on this property and the RDFS entailment rule 9
(Table 1) and assuming that P 31 (instance of ) and rdf :type
relations are equivalent, we managed to extract the
instances, direct and inherited, of the Wikidata classes.</p>
        <p>However, this approach does not discover all the
underlying instances, because not all the existing classes are linked
to their instances with the relation P 31. For example,
Quentin Tarantino's Wikidata entry16 is instance of the
class Human, whereas it is also connected with the class
lm director with the relation P 106 (occupation). Hence,
we observe that, subclasses of the class Human that denote
occupation, are not as well-populated as their superclass.</p>
        <p>On the other hand, if we try to add instances in classes
independently of the relation that interconnects them, we
include a lot of noise in the extracted hierarchy. In the same
example, Quentin Tarantino would be an instance of the
class English because he is connected with it by the relation
P1413 (languages spoken, written or signed ). One solution
to this problem is to involve domain expert users to the
procedure. These experts would decide or verify the relations
that are eligible for interconnecting classes with instances
(e.g., the relation occupation). However, the fact that our
hierarchy contains thousands of classes and relations, deriving
from many di erent domains, makes this solution inviable.
Also this involvement would cancel the automated fashion
in which we want to build our hierarchy.</p>
        <p>Some interesting statistics about the instances of the
Wikidata classes are presented in Figure 1 and summarized
bellow:</p>
        <p>The class with the most instances is the class Entity17
(15M instances). Entity, as well as the following
top50 classes, is an abstract class, which means that most
of its instances are inherited from subclasses based on
the aforementioned rule.</p>
        <p>From the classes with direct instances, Human is the
top one with 3M instances. What is remarkable is
the fact that Wikidata uses the Human and not the
Person class for people. Person is anything that can
bear a personality, e.g. an arti cial agent, etc.</p>
        <p>Other well-populated classes are the Animal class (3M
instances), the Organization class (2:5M instances)
including businesses, clubs and institutions, and the Art
Work class (1:5M instances) including music albums
and movies.
3.2.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Language</title>
        <p>
          Wikidata follows a language-agnostic model according to
which the identi ers of the entities intentionally consist of
a character and a number. The multilingual support lies on
the labels of these entities which are expressed in di erent
languages. Currently, Wikidata contains entities in more
than 350 languages [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>In Figure 2 we can see the coverage of the label languages
of the classes that were extracted from Wikidata. As we
explained above, as a design choice we discard classes that
do not have an English label. Thus the coverage of the
English language is 100%. Interestingly enough, the next
language is French with only 55% coverage. Thus, since
English is the dominant language of our hierarchy, if we
choose to export it in any other language (e.g., export only
classes that have French label), we loose at least around one
half of the information that we have acquired.
16http://www.wikidata.org/wiki/Q3772
17http://www.wikidata.org/wiki/Q35120
3.2.4</p>
      </sec>
      <sec id="sec-6-4">
        <title>Provenance</title>
        <p>Provenance information is very useful for crowdsourcing
knowledge bases like Wikidata, because we can easily discard
needless parts (as we did in the ltering phase above). As
we can see in Figure 3 the main external contributor for
the class hierarchy of Wikidata is Freebase with more than
40K classes. Then we have English Wikipedia with almost
30K classes, and Wikipedias and libraries in many other
languages. DBpedia and schema.org have a few equivalent
class links to Wikidata, whereas almost half of the classes
do not comprise provenance information and thus we don't
have a clue what is their source.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>4. INTEGRATION WITH schema.org</title>
      <p>In this section we describe the process of integrating the
aforementioned Wikidata class hierarchy with the schema
provided by schema.org. In the context of this paper we
used 2.2 JSON release of schema.org.</p>
      <p>
        We introduce several heuristics to perform the integration
between Wikidata and schema.org (Figure 4). Each
heuristic returns a candidate set of pairs of Wikidata nodes and
schema.org nodes which are considered either as equivalent
or the one as a subclass of the other. Heuristics use
distributed vector representations of words computed by Glove
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to compute a measure of similarity between words. The
heuristics are described below:
      </p>
      <p>Exact Match. Maps a Wikidata node to a node in
schema.org if they have the same labels. For example,
the wikidata node with label \hospital" is mapped to
schema.org node with label \Hospital".</p>
      <p>
        Lemma Match. Maps a Wikidata node to a node in
schema.org if they have the same labels after
lemmatization. WordNet [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is used as a source for providing
lemmatizations. For example, label \Cricket players"
is converted after lemmatization into the label \cricket
player". If the label of a node contains more than one
word, then the node is lemmatized per token.
      </p>
      <p>Single-word Similarity. Maps a Wikidata node W
to a schema.org node S if labels of both W and S have
only one word and the cosine similarity between their
glove vectors is greater than a xed threshold (Ts).
For example, Wikidata node with label \warehouse"
is mapped to schema.org node with label \Store"
because the cosine similarity between glove vectors for
\warehouse" and \store" is greater than Ts = 0:8.
Exact Head Match. Maps a Wikidata node to a
schema.org node if the head18 of the label of Wikidata
node matches the label of schema.org node exactly or
after lemmatization. For example, wikidata node with
label \Kalapuyan languages" is mapped to schema.org
as a subclass of the node with label \Language".</p>
      <p>Head Similarity. Maps a Wikidata mode to a
schema.org node, if the cosine similarity between glove
vectors of the heads of their labels is greater than Ts.
For example, wikidata node with label \survey motor
boat" is mapped to schema.org as a subclass of the
18Head is computed as the last token in the title, before "of"
e.g., head of \national football leagues" is \leagues" and head
of \national football leagues of south" is \leagues" as well.
schema.org
Wikidata
subClass
equivalentClass
node with label \Vessel" based on the cosine similarity
between \boat" and \vessel".</p>
      <p>Instance Similarity. Maps a Wikidata node W to a
schema.org node S, if the average cosine similarity
between the instances of W and the label of S is greater
than Ts. This heuristic improves coverage of our
approach by mapping nodes which would be otherwise
unrelated based on their corresponding labels.</p>
      <p>Subclass Similarity. Similar to the previous
heuristic, maps a Wikidata node W to a schema.org node S,
if the average cosine similarity between the subclasses
of W and the label of S is greater than Ts.</p>
      <p>These heuristics result in pairs of Wikidata and
schema.org nodes, which are mapped to each other. In
Table 3 we can see the di erent number of pairs with respect
to the di erent values of the threshold Ts.</p>
    </sec>
    <sec id="sec-8">
      <title>5. IMPLEMENTATION</title>
      <p>For the processing of the JSON dump of Wikidata and the
extraction of the class hierarchy, we extended the Wikidata
Toolkit19 that is o cially released and supported by
Wikidata. In order to decrease the number of iterations through
the dump, we follow a light-weight, in-memory approach in
which we keep maps with the ids and the labels of the
discovered classes and instances as well as with the relations
among them. We also pipeline, where it is possible, the
Extraction and Filtering phases. The user can choose the
Wikidata dump which will be processed, turn on/o the
various lters described above, and decide whether the output
of the process will be in JSON, RDF or TSV format.</p>
      <p>In order to analyze and compute various statistics about
the hierarchy, we then process it as a graph using the Apache
Spark GraphX library20 and the various analytics functions
that it supports.</p>
      <p>For the integration step, we used Word2Vec21 a two-layer
neural network that processes text. We also downloaded and
used the GloVe vectors trained from Wikipedia 2014 corpus
and Gigaword corpus22.
19https://github.com/Wikidata/Wikidata-Toolkit
20http://spark.apache.org/graphx
21http://deeplearning4j.org/word2vec
22http://nlp.stanford.edu/projects/glove</p>
      <p>Distribution. The license and the output format under
which deepchema.org is distributed are described as follows:
License. As mentioned in Section 2, Wikidata is
distributed under the CC0 1.0 License and schema.org
under the CC BY-SA 3.0 License. Since we combine
the two datasets we chose to keep the most restrictive
license. Thus, deepschema.org is distributed under the
CC BY-SA 3.0 License.</p>
      <p>Output Format. deepschema.org is published
under various formats (JSON, RDF and TSV) which are
compatible with the most well-known ontology
engineering tools (e.g., with Protege23).</p>
      <p>Releases. Since deepschema.org is generated
automatically, the tools described above can be, in
principle, executed with any underline version of Wikidata
and schema.org. Wikidata is updated weekly, whereas
schema.org more rarely, thus, we can potentially
release a new deepschema.org version every week.
6.</p>
    </sec>
    <sec id="sec-9">
      <title>EVALUATION</title>
      <p>In this section we evaluate deepschema.org. Speci cally,
with the approach that we follow we focus on three main
aspects of our ontology, namely i) the accuracy, ii) the
traversability and iii) the genericity. The platform that we
use in order to perform our crowdsourcing experiments is
CrowdFlower24.
6.1</p>
    </sec>
    <sec id="sec-10">
      <title>Accuracy</title>
      <p>In order to evaluate the accuracy we conducted a two-fold
experiment. Both of the tasks of the experiment (which we
describe in detail below) were designed to validate relations
between classes. In the rst task we validate internal
relations within the employed data sources while in the second
task we evaluate interlinks that we generated during the
integration phase (Section 4). We asked around 100 people
and the results were collated with majority voting (2 out of
3).
23http://protege.stanford.edu
24http://www.crowd ower.com</p>
      <p>Each question is a multiple choice question in which we
provide the classes to be connected, along with the
suggested relation. Then, we request from the user to verify
the correctness of the provided relation. In order to avoid
ambiguities in the classes, we provide an additional
description and a web link to each class. One example question
that we asked the crowd can be shown in Figure 5 where we
request the veri cation of the subClassOf relation for the
classes Google driveless car from Wikidata and Car from
schema.org.
6.1.1</p>
      <sec id="sec-10-1">
        <title>Wikidata accuracy</title>
        <p>On the rst crowdsourcing task we assess the edge-level
accuracy of the hierarchy we extracted from Wikidata. Since
schema.org is generated and evaluated manually from
domain experts, we consider it as 100% accurate and we do
not involve the crowd for its assessment.</p>
        <p>For Wikidata, we extracted at random 1000 edges, and
asked the crowd if the relation between them (i.e.,
subclassOf ) was meaningful. The reported accuracy that we
obtained was 92%. In order to validate these results, we asked
3 ontology experts to evaluate part of the same 1000 edges.
The average accuracy reported by the experts con rmed the
results from our crowdsourcing task.</p>
      </sec>
      <sec id="sec-10-2">
        <title>6.1.2 Integration accuracy</title>
        <p>On the second task, we evaluated the output of the
integration phase, as described in Section 4. The output
consists of one class of Wikidata, one class of schema.org and
the relation that we discovered between them. Since we
used various similarity thresholds in the integration phase,
we validated the accuracy for every one of them.</p>
        <p>The results are summarized in Figure 6. As expected,
while the threshold increases, the integration heuristics
discover more accurate pairs of classes. For the thresholds 0:8
and 0:9 the accuracy reaches the 91%. The output of this
experiment was also veri ed by the domain experts.
6.2</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Traversability</title>
      <p>In order to evaluate the traversability of deepschema.org
we measure the amount of Wikidata leaf classes which have a
direct path to the root of schema.org. Since schema.org has
a tree structure, the problem is reduced to nding a path to
100%
80%
60%
40%
73%
73%
any node of schema.org25. As we can see in Figure 6, lower
similarity thresholds lead to the generation of more links and
thus more paths which connect Wikidata and schema.org, at
the expenses of accuracy.</p>
      <p>The overall coverage we achieve is fairly low and this is
explained mainly by the non-elegant structure of Wikidata.
Both the schema and the instance information of Wikidata
are controlled by the crowd and thus many of the classes
that are not covered by schema.org are in fact noise (i.e.,
they are incorrectly annotated as classes or the partOf
relation is mistakenly interpreted into the subClassOf relation).
For example, the List of NGC objects (5501-5750)26, which
has been characterized as class, is actually a part of the
List of NGC objects, which was imported to Wikidata from
Wikipedia27.</p>
      <p>Furthermore, in some other cases, Wikidata classes were
found to be more general and thus there was no actual
superclass from schema.org to cover them besides the top
classes like Thing (e.g., the class Child Abuse28). An easy
workaround would be to connect every \orphan" Wikidata
class to Thing. This would give us 100% coverage but it was
out of the scope of the paper. Our goal was to construct
deepschema.org with deep, traversable and meaningful paths
at the cost of low coverage.
6.3</p>
    </sec>
    <sec id="sec-12">
      <title>Genericity</title>
      <p>Another goal for deepschema.org was to make it generic
and applicable to multiple domains. One way to evaluate
this characteristic was to employ a widely-used English
dictionary and measure the coverage of its most frequent words
that denote classes. In our experiment we used the Oxford
3000 subset of the Oxford English Dictionary29.</p>
      <p>The Oxford 3000 is a list of the 3000 most important
English words. The keywords of the Oxford 3000 have been
carefully selected by a group of language experts and
experienced teachers as the words which should receive priority
in vocabulary study because of their importance and
usefulness. Despite its educational nature, Oxford 3000 gives
25By de nition, there is always a unique path from every
node of a tree to its root.
26http://www.wikidata.org/wiki/Q836200
27http://en.wikipedia.org/wiki/List of NGC objects
28http://www.wikidata.org/wiki/Q167191
29http://www.oxfordlearnersdictionaries.com/wordlist/
english/oxford3000
a good insight for the most commonly-used words in the
English language.</p>
      <p>Since this dictionary contains all the parts of speech
(verbs, adjectives, etc.), and since classes are naturally
described by nouns or noun phrases, we manually ltered the
content of Oxford 3000 and kept only the words annotated
as nouns and noun phrases.</p>
      <p>The coverage of the ltered dictionary by deepschema.org
is 81%. The latter con rms the generic nature and high
coverage of our ontology.</p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this paper we proposed deepschema.org, the rst
ontology that combines two well-known ontological resources,
Wikidata and schema.org, to obtain a highly-accurate,
generic type ontology which is at the same time a rst-class
citizen in the Web of Data. We described the automated
procedure we used for extracting a class hierarchy from
Wikidata and analyzed the main characteristics of this hierarchy.
We also provided a novel technique for integrating the
extracted hierarchy with schema.org, which exploits external
dictionary corpora and is based on word embeddings. The
overall accuracy of deepschema.org, reported by the
crowdsourcing evaluation, is more than 90%, comparable to the
accuracy of similar approaches that we have discussed in
Section 2. Also, the evaluation of the traversability and the
genericity showed very encouraging results by ful lling the
requirements that we had set up in the beginning.</p>
      <p>
        Future work concentrates on employing more data sources
as components of deepschema.org (e.g., Facebook's Open
Graph30). By adding such data sources, deepschema.org
will be established as the most generic and cross-domain
class hierarchy. As we showcase in our evaluation, in spite
of the ltering phase that we introduced, our ontology still
contains a lot of noise. As future work we will extend these
lters in order to further cleanse the noise that is imported
mainly from Wikidata. Moreover, we will leverage the
richness of the multilingual labels in Wikidata to produce
versions of deepschema.org in multiple languages, although, as
we have discussed in Section 3, the knowledge included in
these versions will be limited. Finally, we will employ
deepschema.org in real-world use-cases, like the one presented in
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], in which we will showcase the improvement obtained
by the usage of our ontology.
      </p>
    </sec>
    <sec id="sec-14">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work was partially supported by the project
\Exploring the interdependence between scienti c and public
opinion on nutrition through large-scale semantic analysis"
from the Integrative Food Science and Nutrition Center
(http://nutritioncenter.ep .ch).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <surname>Z. Ives.</surname>
          </string-name>
          <article-title>DBpedia: A nucleus for a Web of open data</article-title>
          .
          <source>In Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          , volume
          <volume>4825</volume>
          LNCS, pages
          <volume>722</volume>
          {
          <fpage>735</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bollacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paritosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sturge</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          . Freebase.
          <source>In Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08, page 1247</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fa</surname>
          </string-name>
          rber,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Menne</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rettinger</surname>
          </string-name>
          .
          <article-title>A comparative survey of dbpedia, freebase, opencyc, wikidata, and yago</article-title>
          .
          <source>Semantic Web Journal, July</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Flati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vannella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pasini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <article-title>Two is bigger (and better) than one: the wikipedia bitaxonomy project</article-title>
          .
          <source>In ACL (1)</source>
          , pages
          <fpage>945</fpage>
          {
          <fpage>955</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guha</surname>
          </string-name>
          .
          <article-title>Introducing schema. org: Search engines come together for a richer web</article-title>
          .
          <source>Google O cial Blog</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piccinno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kozhevnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pasca</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Pighin</surname>
          </string-name>
          .
          <article-title>Revisiting taxonomy induction over wikipedia</article-title>
          .
          <source>COLING</source>
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Wordnet: a lexical database for english</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>38</volume>
          (
          <issue>11</issue>
          ):
          <volume>39</volume>
          {
          <fpage>41</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          .
          <article-title>Dbpedia and the live extraction of structured data from wikipedia</article-title>
          .
          <source>Program</source>
          ,
          <volume>46</volume>
          (
          <issue>2</issue>
          ):
          <volume>157</volume>
          {
          <fpage>181</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Norvig</surname>
          </string-name>
          .
          <article-title>The semantic web and the semantics of the web: Where does meaning come from</article-title>
          ?
          <source>In Proceedings of the 25th International Conference on World Wide Web, WWW '16</source>
          , pages
          <issue>1{1</issue>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T. Pellissier</given-names>
            <surname>Tanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandecic</surname>
          </string-name>
          , S. Scha ert, T. Steiner, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Pintscher</surname>
          </string-name>
          . From Freebase to Wikidata:
          <article-title>The Great Migration</article-title>
          .
          <source>In Proceedings of the 25th International Conference on World Wide Web</source>
          , pages
          <volume>1419</volume>
          {
          <fpage>1428</fpage>
          . International World Wide Web Conferences Steering Committee,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In EMNLP</source>
          , volume
          <volume>14</volume>
          , pages
          <fpage>1532</fpage>
          {
          <fpage>43</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Shvaiko</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Euzenat</surname>
          </string-name>
          .
          <article-title>Ontology matching: state of the art and future challenges</article-title>
          .
          <source>IEEE Transactions on knowledge and data engineering</source>
          ,
          <volume>25</volume>
          (
          <issue>1</issue>
          ):
          <volume>158</volume>
          {
          <fpage>176</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Senellart</surname>
          </string-name>
          . PARIS :
          <article-title>Probabilistic Alignment of Relations , Instances , and Schema</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <volume>157</volume>
          {
          <fpage>168</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , G. Kasneci, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum. Yago</surname>
          </string-name>
          .
          <source>Proceedings of the 16th international conference on World Wide Web - WWW '07, page 697</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tonon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Catasta</surname>
          </string-name>
          , G. Demartini,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cudre-Mauroux</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Aberer. TRank:
          <article-title>Ranking entity types using the web of data</article-title>
          .
          <source>Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          ,
          <source>8218 LNCS(PART 1):</source>
          <volume>640</volume>
          {
          <fpage>656</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tonon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Felder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Difallah</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cudre-Mauroux</surname>
          </string-name>
          .
          <article-title>Voldemortkg: Mapping schema. org and web entities to linked open data</article-title>
          .
          <source>Proceedings of the 15th International Semantic Web Conference</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandecic</surname>
          </string-name>
          and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krotzsch. Wikidata: A free collaborative knowledgebase</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>57</volume>
          (
          <issue>10</issue>
          ):
          <volume>78</volume>
          {
          <fpage>85</fpage>
          ,
          <string-name>
            <surname>Sept</surname>
          </string-name>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>