<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Pro ling Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge Graphs on the Web</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Mannheim, Data and Web Science Group</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge Graphs, such as DBpedia, YAGO, or Wikidata, are valuable resources for building intelligent applications like data analytics tools or recommender systems. Understanding what is in those knowledge graphs is a crucial prerequisite for selecing a Knowledge Graph for a task at hand. Hence, Knowledge Graph pro ling - i.e., quantifying the structure and contents of knowledge graphs, as well as their di erences - is essential for fully utilizing the power of Knowledge Graphs. In this paper, I will discuss methods for Knowledge Graph pro ling, depict crucial di erences of the big, well-known Knowledge Graphs, like DBpedia, YAGO, and Wikidata, and throw a glance at current developments of new, complementary Knowledge Graphs such as DBkWik and WebIsALOD.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        task. Depending on the domain and task at hand, some KGs might be better
suited than others. However, there are no guidelines or best practices on how
to choose a knowledge graph which ts a given problem. Previous works mostly
report global numbers, such as the overall size of knowledge graphs, such as [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
and focus on other aspects, such as data quality [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], we have taken a
more in-depth look, showing detailed results for di erent classes.
2
      </p>
      <p>Measures and Methods for Knowledge Graph Pro ling
In general, the aim of knowledge graph pro ling is to understand whether a
given knowledge graph suits a certain purpose. For example, for building an
application for a speci c domain, backed by a knowledge graph, requires that
this knowledge graph contains a reasonable amount of information about the
entities in that domain, and describes them at a suitable level of detail.</p>
      <p>
        There is a large body of work on measures and methods for dataset pro ling
for knowledge graphs and Linked datasets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For analyzing knowledge graphs,
we use the following three classes of metrics:
Global Measures describe the knowledge graph as a whole,
Class-based Measures describe the characteristics of entities in a given class,
and
Overlap Measures describe the di erence between two or more knowledge
graphs.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Global Measures</title>
      <p>The most basic question to ask about a knowledge graph is: How large is it?
Hence, we can count the number of instances and assertions, i.e., relations
between two entities, or relations of an entity to a literal value.</p>
      <p>Second, we are often interested in the level of detail at which entities are are
described in a knowledge graph. This is usually computed as the average degree,
i.e., the average number of ingoing and/or outgoing edges of a node. Aside from
looking at averages, it is often interesting to also consider the median, which
may give a more realistic picture of the level of detail for an average entity.</p>
      <p>For entities, relations, and degrees, one can often nd di erent numbers in
di erent reports. This is due to the fact that there are some methodological
di erences. For entities, some reports only count explicitly typed resources, while
others count all nodes in the graph. For relations, some reports count literal
assertions as well, while others do not. Furthermore, reports may di er in taking
into account special relations (e.g., owl:sameAs), while others do not, which may
have an in uence on the degrees reported. Finally, some reports treat schema
and instances separately, while others count, e.g., classes and instances alike
when reporting the number of entities in a graph.</p>
      <p>Finally, the timeliness of a knowledge graph can also be relevant. While
some knowledge graphs are no longer developed any further, meaning that their
contents become more and more outdated, others have regular { shorter or longer
{ release cycles, or even provide live data.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Class-based Measures</title>
      <p>For class-based measures, the same metrics as for global measures can be used,
i.e., how many entities exist in a certain class, and at which level of detail are
they described?</p>
      <p>
        There are two problems that may arise when counting the number of
instances in a class, and reporting that number as There are are N entities of
type X in this knowledge graph. First, the type assertions in a knowledge graph
are not guaranteed to be complete. In fact, in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], we presented an estimate of
the number of missing type assertions in DBpedia. By comparing two
knowledge graphs { i.e., DBpedia and YAGO { and counting the untyped instances in
DBpedia that have a type in YAGO which has a corresponding type in the
DBpedia ontology, we found that DBpedia has at least 2.6M missing type assertions.
Hence, counting type instances based on type assertions often underestimates
the actual counts.
      </p>
      <p>The second problem occurs with modeling issues. For example, instances are
usually counted based on asserted types, but di erent knowledge graphs have
di erent modeling paradigms. For example, DBpedia and YAGO de ne classes
for occupations of people (e.g., Actor or Politician), while Wikidata models those
as a relation linking a person to a profession, while the person is only assigned
the less speci c type Person. Those complex mappings are not always easy to
obtain and utilize when comparing the number of entities in a given class across
knowledge graphs.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Overlap Measures</title>
      <p>
        To quantify the similarity and di erence of knowledge graphs, one has to
analyze their overlap, i.e., the amount of instances they have in common. Although
many knowledge graphs are served as Linked Open Data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], using interlinks on
instance level with owl:sameAs, those interlinks are not necessarily complete,
i.e., the Open World Assumption, which holds to Web knowledge graphs in
general, also holds for their interlinks. Hence, they cannot be utilized directly as a
measure for quantifying the overlap between two knowledge graphs. For
example, from the fact that 2,000 cities in knowledge graph A are linked to cities in
knowledge graph B, we cannot simply conclude that this is the number of cities
contained in the intersection of A and B.
      </p>
      <p>
        In order to estimate the actual overlap based on explicit interlinks, we use an
approach rst described in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. We rst nd interlinks between two knowledge
graphs using an arbitrary linkage rule, e.g., interlinking all entities with the same
name.
      </p>
      <p>Then, using the existing interlinks, we compute the quality of a linking
approach in terms of recall and precision. Given that the actual number of links
is C, the number of links found by a linkage rule is F , and that the number of
correct links in F is F +, recall and precision are de ned as</p>
      <p>R := jF +j
jCj
(1)
P := jF +j</p>
      <p>jF j
jCj = jF j P
By resolving both to jF +j and combining the equations, we can estimate jCj as
Note that in the latter formula, all variables on the right hand side { the total
number of interlinks found by a linkage rule, as well as its recall and precision
{ are known (which is not true for F + in the two formulas above). For a stable
estimate, we use a variety of di erent linkage rules, and average their estimates.</p>
      <p>As for class-based measures, we can quantify the overlap per class, e.g.,
nding all persons that are contained in two knowledge graphs. Again, this may be
biased by missing type statements in the knowledge graphs, but usually provides
a decent approximation.
3</p>
      <p>Global Measures: Overall Size and Shape of Knowledge
Graphs
For the analysis in this paper, we focus on the public knowledge graphs DBpedia,
YAGO, Wikidata, OpenCyc, and NELL.3;4 For those ve KGs, we used the most
recent available versions at the time of this analysis, as shown in Table 1.</p>
      <p>We can observe that DBpedia and YAGO have roughly the same number
of instances, which is not surprising, due to their construction process, which
creates an instance per Wikipedia page. Wikidata, which uses additional sources
plus a community editing process, has about tree times more instances. It is
remarkable that YAGO and Wikidata have roughly the same number of axioms,
although Wikidata has three times more instances. This hints at a higher level
of detail in YAGO, which is also re ected in the degree distributions.</p>
      <p>OpenCyc and NELL are much smaller. NELL is particularly smaller w.r.t.
axioms, not instances, i.e., the graph is less dense. This is also re ected in the
degree of instances, which depicts that on average, each instance has less than
seven connections. The other graphs are much denser, e.g., each instance in
Wikidata has about 50 connections on average, each instance in DBpedia has
about 60, and each instance in YAGO has even about 120 connections on average.</p>
      <p>The number of entities and the degrees are not independent. There are certain
e ects caused by the distribution of entities contained in the di erent graphs:
While OpenCyc contains mostly head entities, DBpedia, YAGO, and Wikidata
have a larger coverage of tail entities as well. The head entities are actually
described in the larger knowledge graphs at much more detail than in the smaller
ones, but the overall degree distribution is rather skewed, which leads to lower
averages.
3 Freebase was discarded as it is discontinued, and non-public KGs were not
considered, as it is impossible to run the analysis on non-public data.
4 Scripts are available at https://github.com/dringler/KnowledgeGraphAnalysis.
The schema sizes also di er widely. In particular the number of classes are
very di erent. This can be explained by di erent modeling styles: YAGO
automatically generates very ne-grained classes, based on Wikipedia categories.
Those are often complex types encoding various facts, such as \American Rock
Keyboardists". KGs like DBpedia or NELL, on the other hand, use well-de ned,
manually curated ontologies with much fewer classes.</p>
      <p>
        Since Wikidata provides live updates, it is the most timely source (together
with DBpedia Live, which is a variant of DBpedia fed from an update stream of
Wikipedia [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]). From the non-live sources, NELL has the fastest release cycle,
providing a new release every few days. However, NELL uses a xed corpus of
Web pages, which is not updated as regularly. Thus, the short release cycles do
not necessarily lead to more timely information. DBpedia has biyearly releases,
and YAGO and OpenCyc have update cycles longer than a year.
4
      </p>
      <p>Class-based Measures: Looking into Details
When building an intelligent, knowledge graph backed application for a speci c
use case, it is important to know how t a given knowledge graph is for the
domain and task at hand. To answer this question, we have picked 25 popular
classes in the ve knowledge graphs and performed an in-depth comparison. For
those, we computed the total number of instances in the di erent graphs, as well
as the average in and out degree. The results are depicted in gure 2.</p>
      <p>While DBpedia and YAGO, both derived from Wikipedia, are rather
comparable, there are notable di erences in coverage, in particular for events, where
the number of events in YAGO is more than ve times larger than the number
in DBpedia. On the other hand, DBpedia has information about four times as
many settlements (i.e., cities, towns, and villages) as YAGO. Furthermore, the
level of detail provided in YAGO is usually a bit larger than DBpedia.</p>
      <p>The other three graphs di er a lot more. Wikidata contains twice as many
persons as DBpedia and YAGO, and also outnumbers them in music albums and
books. Furthermore, it provides a higher level of detail for chemical substances
and particularly countries. On the other hand, there are also classes which are</p>
      <p>NELL</p>
      <p>DBpedia</p>
      <p>Open
Cyc
hardly represented in Wikidata, such as songs.5 As far as Wikidata is concerned,
the di erences can be partially explained by the external datasets imported into
the knowledge graph.</p>
      <p>OpenCyc and NELL are generally smaller and less detailed. However, NELL
has some particularly large classes, e.g., actor, song, and chemical substance,
and for government organizations, it even outnumbers the other graphs. On the
other hand, there are classes which are not covered by NELL at all.
5</p>
      <sec id="sec-4-1">
        <title>Overlap of Knowledge Graphs</title>
        <p>We follow the approach discussed above in section 2.3. For our analysis, we
use 16 combinations of string metrics and thresholds on the instances' labels:
string equality, scaled Levenshtein (thresholds 0.8, 0.9, and 1.0), Jaccard (0.6,
0.8, and 1.0), Jaro (0.9, 0.95, and 1.0), JaroWinkler (0.9, 0.95, and 1.0), and
MongeElkan (0.9, 0.95, and 1.0). Furthermore, to speed up the computation, we
exploit token-based blocking in a preprocessing step (where each instance is only
assigned to the block of the least frequent token), and discarding blocks larger
than 1M pairs.</p>
        <p>As incomplete link sets for estimating recall and precision, we use the links
between the knowledge graphs, if present. If there are no links, we exploit
transitivity and symmetry of owl:sameAs, and follow the link path through DBpedia
(see Fig. 1). NELL has no direct links to the other graphs, but links to Wikipedia
pages corresponding to DBpedia instances, which we use to create links to
DBpedia (indicated by the dashed line in the gure).</p>
        <p>Fig. 3 depicts the pairwise overlap of the knowledge graphs, using the 25
classes also inspected above, according to two measures: potential gain by joining
the two knowledge graphs (i.e., the relation of the union to the larger of the two
graphs), and the overlap relative to the existing KG interlinks.</p>
        <p>Overall, we can observe that merging two graphs would usually lead to a 5%
increase of coverage of instances, compared to using one KG alone. The largest
5 As discussed above, the reason why so few politicians, actors, and athletes are listed
for Wikidata is that they are usually not modeled using explicit classes.</p>
        <p>D Y W O N</p>
        <p>D Y W O N</p>
        <p>D Y W O N
3M 0
(a) Number of instances
70k
(b) Average indegree
0</p>
        <p>
          1k
(c) Average outdegree
0
Fig. 2: Number of instances (a), avg. indegree (b) and avg. outdegree (c) of
selected classes. D=DBpedia, Y=YAGO, W=Wikidata, O=OpenCyc, N=NELL.
[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]
potential gain most often comes from merging the larger knowledge graphs with
NELL. We can therefore conclude that NELL is rather complementary to most
of the other KGs under consideration. The most complementary classes, with an
average gain of more than 10% across all pairs of knowledge graphs, are political
parties and chemical substances. When looking at the overlap relative to the
number of existing links, NELL has the weakest degree of interlinking: e.g., for
YAGO and NELL, the estimated overlap is more than eight times larger than
the number of interlinks. The classes with the weakest degree of interlinking
are countries (32 times larger overlap than explicit interlinks), movies (13 times
larger), and companies (10 times larger).6
6 Note that it is not necessary that the linking approach is particularly good, as long as
we can estimate its quality reasonably well. In our experiments, the agreement about
the estimated overlap is rather high, showing an intra-class correlation coe cient
(ICC) of 0.969. In contrast, the size of the actual alignments found by the di erent
approaches di ers a lot more, showing an ICC of only 0.646.
1
(a) Overlap as potential gain
0
        </p>
        <p>200
(b) Overlap relative to existing links
0</p>
        <p>Summary of the Comparison of DBpedia, YAGO,
Wikidata &amp; co.</p>
        <p>We have compared the coverage, level of detail, and overlap for 25 popular
classes. Some key ndings from this comparison include:
{ For person data, Wikidata is the most suitable source, containing twice as
many instances as DBpedia or YAGO, at a similar level of detail.
{ Organizations, such as companies, are best described in YAGO.
{ DBpedia contains more places than the other KGs, including almost four
times more cities, villages etc. than YAGO.
{ While DBpedia and YAGO contain much more countries than Wikidata (due
to the inclusion of historic countries, such as the Roman Empire), Wikidata
holds the most detailed information about countries.
{ Overall, DBpedia contains the largest number of artistic works, although
details di er for subclasses: Wikidata contains more music albums and movies,
while YAGO contains more songs. The most detailed information about
artistic works is provided by YAGO.
{ Cars and spacecraft are best covered in YAGO, while DBpedia is the better
resource for ships.
{ For events, YAGO is the most suitable source, both in terms of coverage and
level of detail.
{ NELL contains the largest number of chemical substances. The highest level
of degree for chemicals, however, is provided in Wikidata.</p>
        <p>{ YAGO contains the largest number of astronomical objects.</p>
        <p>Note that those numbers are not exhaustive, they merely demonstrate the need
for a careful analysis of KGs before exploiting them for a project at hand.</p>
        <p>In addition to the question which knowledge graph serves a certain task best,
another question is whether it makes sense to use more than one combined. Here,
we have observed that there is often a considerable complementarity. Especially
NELL is very complementary to the other KGs, although a lot less rich in details.
Thus, the coverage can often be extended signi cantly by combining di erent
KGs. This, however, requires re nement of the interlinking, since the interlinks
are usually incomplete.</p>
        <p>When combining multiple knowledge graphs, we observe that, although a
lot of interlinks have been established between the public KGs, the estimated
overlap is often much higher. In some cases, the estimated overlap exceeds the
number of explicitly set links by a factor of more than 20. Hence, for combining
KGs, improving the interlinking has to be a key step.</p>
        <p>
          Depending on the task at hand, other aspects may be important as well.
Reliability and correctness of the data in di erent KGs may be crucial for some
tasks, for which other studies should be consulted as well, e.g., [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Furthermore,
timeliness of the data, as discussed above, may be more important for some tasks
than for others.
7
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>New Developments of Knowledge Graphs</title>
        <p>
          From the observations above, we can see that DBpedia, YAGO, and Wikidata
have a similar coverage, while OpenCyc and NELL are much smaller in their
coverage, and less detailed. Hence, alternatives to the \big three" knowledge
graphs are rare. However, for many applications, having detailed information
also about long tail instances would be desirable. Examples include, but are not
limited to
Recommender systems that also work well on less well-known artists and/or
works, [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]
Named entity recognition and linking systems that also recognize long-tail
entities, [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
Data mining applications backed by knowledge graphs [
          <xref ref-type="bibr" rid="ref21 ref23">21, 23</xref>
          ] that work on
domains and/or entities not well covered in DBpedia and others.
Hence, new developments of knowledge graphs should focus on di erent sets of
entities than those which are already well described in the existing ones. In the
following, we will brie y discuss two new developments, i.e., DBkWik [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and
WebIsALOD [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
7.1
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>DBkWik</title>
      <p>The reason for the strong similarity of the big public knowledge graphs, i.e.,
DBpedia, YAGO, and Wikidata, is that they are either extracted from or strongly
DBkWik
Linked
Data
Endpoint
5</p>
      <p>Dump
Downloader</p>
      <p>Extraction 2</p>
      <p>Framework
MediaWiki Dumps Extracted RDF</p>
      <p>Consolidated
Knowledge Graph</p>
      <p>Internal Linking</p>
      <p>Instance
Matcher
Schema
Matcher
4</p>
      <p>Interlinking
Instance
Matcher
Schema
Matcher
3
oriented at Wikipedia. Hence, their coverage is very close to that of Wikipedia
{ and to that of each other.</p>
      <p>At the same time, there are thousands of Wikis on the Web. Fandom powered
by Wikia7 is one of the most popular Wiki Farms8, containing more than 385,000
individual Wikis comprising more than 350 million articles. WikiApiary reports
more than 20,000 public installations of the MediaWiki framework, which also
underlies Wikipedia9.</p>
      <p>Since those Wikis are technically very similar to Wikipedia, the same tool
stack which is used to create a knowledge graph like DBpedia can also be applied
to extract a knowledge graph from any other Wiki as well. With DBkWik, we
have shown that the extraction of a joint knowledge graph from many Wikis is
technically feasible. Fig. 4 shows the process.</p>
      <p>While for DBpedia, mappings from infobox de nitions in Wikipedia to a
common ontology are collected in a crowd-sourced process, for DBkWik, neither
a common ontology nor such mappings exist. In contrast, the ontologies (for
each Wiki) need to be created on the y in DBkWik.</p>
      <p>
        To create a uni ed knowledge graphs from those individual graphs, we have
to reconcile both the instances (i.e., perform instance matching) as well as the
schemas (i.e., perform schema matching). Since pairwise matching of the
individual graphs would not be feasible due to its quadratic complexity, we follow
a two-step approach: the extracted Wikis are rst linked to DBpedia (which is
linear in the number of Wikis). The links to DBpedia are then used as blocking
keys [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for matching the graphs among each other to reduce the complexity.
      </p>
      <p>As a proof of concept, we have, so far, extracted data from 248 Wikis from
Wiki dumps from the Fandom Wiki farm, using the DBpedia Extraction
Frame7 http://fandom.wikia.com/
8 http://www.alexa.com/topsites/category/Computers/Software/Groupware/</p>
      <p>
        Wiki/Wiki_Farms
9 https://wikiapiary.com/wiki/Statistics
work.10 The resulting dataset comprises 4,375,142 instances, 7,022 classes, and
43,428 (likely including duplicates). Out of those, 748,294 instances, 973 classes,
and 19,635 properties are mapped to DBpedia. To match the knowledge graph
to DBpedia, we use string matching on labels using surface forms [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for
entities, manually ltering out non-entity pages like list pages, and simple string
matching for classes and properties. The resulting knowledge graph encompasses
a total of 26,694,082 RDF triples.11
7.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>WebIsALOD</title>
      <p>
        While Wikis are fairly easy to process, mainly since the tool stacks for creating
Wikipedia-based knowledge graphs already exist, the ultimate goal of knowledge
graph creation would be to create a knowledge graph from the entire Web. In [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ],
we have focused on a particular generic relation for information extraction, i.e.,
the hypernymy relation. That relation holds both between classes (e.g., industrial
metal band is a hypernym of band ), as well as for instance-class relations (e.g.,
industrial metal band is a hypernym of Nine Inch Nails ).
      </p>
      <p>
        The approach sketched in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] uses Hearst like patterns to identify
hypernymy relations. For example, the pattern X, such as Y can be used to infer a
hypernymy relation between X and Y (e.g., in the sentence fragment Industrial
metal bands, such as Nine Inch Nails. The original approach uses more than
50 such patterns to extract hypernymy relations from the Common Crawl 12, a
large-scale open crawl from the Web. The result of this extraction is the IsADB,
a database of 400 million hypernymy relations.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], we have provided the resulting dataset as a Linked Data knowledge
graph, enriched with rich provenance metadata, con dence scores computed
using a machine learning approach, and interlinks to DBpedia and YAGO. The
nal resulting dataset consists of the original 400M hypernymy relations,
together with a con dence score and metadata, as well as 2,593,181 instance links
to DBpedia and 23,771 class links to YAGO. All in all, the dataset consists of
5.4B triples.
      </p>
      <p>In order to obtain a rst content pro le, we analyzed the fraction of instances
which are linked to and typed in DBpedia, and analyzed the type hierarchy in
DBpedia to estimate the distribution of those entities. That resulting distribution
is depicted in Fig. 5. We can observe that about half of the information is about
persons and organizations. Places, works, and species make up for 18%, 12%,
and 5%, respectively, while the rest is a mix of other types.</p>
      <p>There are various challenges for the WebIsALOD dataset. Examples of
ongoing and future work include the learning of better scoring models and the
induction of a type hierarchy, where the latter also includes the subtask of
automatically distinguishing subclass of and instance of relations. Further, we aim
at extracting relations from pre- and post modi ers of the terms. For example,
10 https://github.com/dbpedia/extraction-framework
11 http://dbkwik.webdatacommons.org
12 https://commoncrawl.org</p>
      <p>
        Fig. 5: Type breakdown of the instances in the WebIsALOD dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
in the hypernymy relation between Industrial metal band and Nine Inch Nails,
Industrial metal is a pre-modi er for the head noun Band. Hence, we could
infer two additional axioms here: in general, the head noun is a hypernym of the
compound, i.e., Band is a hypernym for Industrial metal band. Second, using
the information that Industrial metal is also a genre, we can heuristically create
the axiom that Industrial metal is the genre of Nine Inch Nails, similar to the
approach sketched in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Another crucial issue is the identi cation of homonyms in the dataset. Given
the two assertions Bauhaus is a goth band and Bauhaus is a German school,
it is clear that the subjects are two disjoint instances, while Bauhaus is a goth
band and Bauhaus is a post-punk band are not. Identifying such homonyms is
an ongoing e ort. Here, we will rely both on clustering related hypernyms, as
well as linking the type hierarchy to upper ontologies, like it is done for DBpedia
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
8
      </p>
      <sec id="sec-6-1">
        <title>Conclusion</title>
        <p>In this paper, we have given an in-depth look at knowledge graphs on the
Semantic Web. We have seen that, although they are often conceived as comparable,
there are measurable di erences between DBpedia, YAGO, and Wikidata.
Furthermore, we have shown how to estimate the actual overlap between knowledge
graphs.</p>
        <p>Despite their commonalities, one characteristic shared by the big knowledge
graphs is their focus on head entities. We have introduced two prototypes of
works in progress { i.e., DBkWik and WebIsALOD { which also encompass long
tail entities. Although still in their infancy, those new knowledge graphs can
grow to become a strong complement for the established ones.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Acknowledgements</title>
        <p>The author would like to thank Daniel Ringler for his contributions to measuring
knowledge graphs, Alexandra Hofmann, Jan Portisch, and Samresh Perchani for
their work on DBkWik, Julian Seitner, Christian Bizer, Kai Eckert, Stefano
Faralli, Robert Meusel, and Simone Ponzetto for their contributions to the original
WebIsA database, and Sven Hertling for his works on DBkWik and WebIsALOD.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked Data { The Story So Far</article-title>
          .
          <source>International journal on semantic web and information systems 5(3)</source>
          ,
          <volume>1</volume>
          {
          <fpage>22</fpage>
          (
          <year>2009</year>
          ), http://dx.doi.org/10.4018/jswis.2009081901
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Blanco</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambazoglu</surname>
            ,
            <given-names>B.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torzec</surname>
          </string-name>
          , N.:
          <article-title>Entity Recommendations in Web Search</article-title>
          .
          <source>In: The Semantic Web{ISWC 2013. LNCS</source>
          , vol.
          <volume>8219</volume>
          , pp.
          <volume>33</volume>
          {
          <issue>48</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bryl</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Gathering alternative surface forms for dbpedia entities</article-title>
          .
          <source>In: Workshop on NLP&amp;DBpedia</source>
          . pp.
          <volume>13</volume>
          {
          <issue>24</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Carlson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betteridge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hruschka</surname>
            <given-names>Jr</given-names>
          </string-name>
          , E.R., Mitchell, T.M.:
          <article-title>Coupled semi-supervised learning for information extraction</article-title>
          .
          <source>In: Proceedings of the third ACM international conference on Web search and data mining</source>
          . pp.
          <volume>101</volume>
          {
          <issue>110</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>X.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gabrilovich</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heitz</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horn</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lao</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strohmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Zhang, W.:
          <article-title>Knowledge Vault: A Web-scale approach to probabilistic knowledge fusion</article-title>
          .
          <source>In: 20th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <volume>601</volume>
          {
          <issue>610</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Elle</surname>
            ,
            <given-names>M.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellahsene</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breslin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demidova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szymanski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Todorov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Rdf dataset pro ling - a survey of features, methods, vocabularies and applications</article-title>
          .
          <source>Semantic Web</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ipeirotis</surname>
            ,
            <given-names>P.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verykios</surname>
            ,
            <given-names>V.S.:</given-names>
          </string-name>
          <article-title>Duplicate record detection: A survey</article-title>
          .
          <source>IEEE Transactions on knowledge and data engineering 19(1)</source>
          ,
          <volume>1</volume>
          {
          <fpage>16</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. van Erp,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.N.</given-names>
            ,
            <surname>Paulheim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Ilievski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rizzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Waitelonis</surname>
          </string-name>
          , J.:
          <article-title>Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job</article-title>
          .
          <source>In: LREC</source>
          . vol.
          <volume>5</volume>
          , p.
          <year>2016</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Farber,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Menne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Rettinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bartscherer</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago</article-title>
          . Semantic Web (to appear) (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Heist</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Language-agnostic relation extraction from wikipedia abstracts</article-title>
          . In: International Semantic Web Conference (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stadler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Dbpedia live extraction</article-title>
          .
          <source>On the Move to Meaningful Internet Systems: OTM</source>
          <year>2009</year>
          pp.
          <volume>1209</volume>
          {
          <issue>1223</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hertling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Webisalod: Providing hypernymy relations extracted from the web as linked open data</article-title>
          . In: International Semantic Web Conference (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perchani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Portisch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hertling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Dbkwik: Towards knowledge graph creation from thousands of wikis</article-title>
          .
          <source>In: International Semantic Web Conference (Posters and Demos)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            , M., van Kleef,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>DBpedia { A Largescale, Multilingual Knowledge Base Extracted from Wikipedia</article-title>
          .
          <source>Semantic Web Journal</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ) (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lenat</surname>
          </string-name>
          , D.B.:
          <article-title>CYC: A large-scale investment in knowledge infrastructure</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <issue>11</issue>
          ),
          <volume>33</volume>
          {
          <fpage>38</fpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Knowledge graph re nement: A survey of approaches and evaluation methods</article-title>
          .
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ),
          <volume>489</volume>
          {
          <fpage>508</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Type Inference on Noisy RDF Data</article-title>
          .
          <source>In: The Semantic Web{ISWC</source>
          <year>2013</year>
          ,
          <article-title>LNCS</article-title>
          , vol.
          <volume>8218</volume>
          , pp.
          <volume>510</volume>
          {
          <fpage>525</fpage>
          . Springer, Berlin Heidelberg (
          <year>2013</year>
          ), http://dx.doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -41335-3_
          <fpage>32</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Serving dbpedia with dolce{more than just adding a cherry on top</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          . pp.
          <volume>180</volume>
          {
          <fpage>196</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Pellissier</given-names>
            <surname>Tanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Vrandecic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            , Scha ert, S.,
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Pintscher</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>From freebase to wikidata: The great migration</article-title>
          .
          <source>In: Proceedings of the 25th International Conference on World Wide Web</source>
          . pp.
          <volume>1419</volume>
          {
          <issue>1428</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Ringler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>One knowledge graph to rule them all? analyzing the di erences between dbpedia, yago, wikidata &amp; co</article-title>
          .
          <source>In: 40th German Conference on Arti cial Intelligence</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Mining the web of linked data with rapidminer</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>35</volume>
          ,
          <issue>142</issue>
          {
          <fpage>151</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menc</surname>
            <given-names>a</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.L.</given-names>
            ,
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>A hybrid multi-strategy recommender system using linked open data</article-title>
          .
          <source>In: Semantic Web Evaluation Challenge</source>
          . pp.
          <volume>150</volume>
          {
          <fpage>156</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Semantic web in data mining and knowledge discovery: A comprehensive survey</article-title>
          .
          <source>Web semantics: science, services and agents on the World Wide Web 36</source>
          ,
          <issue>1</issue>
          {
          <fpage>22</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Schmachtenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Adoption of the Linked Data Best Practices in Di erent Topical Domains</article-title>
          . In: International Semantic Web Conference. LNCS, vol.
          <volume>8796</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Seitner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eckert</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faralli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meusel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponzetto</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A large database of hypernymy relations extracted from the web</article-title>
          .
          <source>In: Language Resources and Evaluation Conference</source>
          , Portoroz, Slovenia (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia</article-title>
          .
          <source>In: 16th international conference on World Wide Web</source>
          . pp.
          <volume>697</volume>
          {
          <issue>706</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: a Free Collaborative Knowledge Base</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <volume>78</volume>
          {
          <fpage>85</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>