=Paper= {{Paper |id=Vol-3262/paper8 |storemode=property |title=Enriching Wikidata with Linked Open Data |pdfUrl=https://ceur-ws.org/Vol-3262/paper8.pdf |volume=Vol-3262 |authors=Bohui Zhang,Filip Ilievski,Pedro Szekely |dblpUrl=https://dblp.org/rec/conf/semweb/ZhangIS22 }} ==Enriching Wikidata with Linked Open Data== https://ceur-ws.org/Vol-3262/paper8.pdf
Enriching Wikidata with Linked Open Data
Bohui Zhang1 , Filip Ilievski1 and Pedro Szekely1
1
    Information Sciences Institute, University of Southern California


                                         Abstract
                                         Large public knowledge graphs, like Wikidata, provide a wealth of knowledge that may support real-
                                         world use cases. Yet, practice shows that much of the relevant information that fits users’ needs is still
                                         missing in Wikidata, while current linked open data (LOD) tools are not suitable to enrich large graphs
                                         like Wikidata. In this paper, we investigate the potential of enriching Wikidata with structured data
                                         sources from the LOD cloud. We present a novel workflow that includes gap detection, source selection,
                                         schema alignment, knowledge retrieval, and semantic validation. We evaluate our enrichment method
                                         with two complementary LOD sources: a noisy source with broad coverage, DBpedia, and a manually
                                         curated source with a narrow focus on the art domain, Getty. Our experiments show that our workflow
                                         can enrich Wikidata with millions of novel statements from external LOD sources with high quality.
                                         Property alignment and data quality are key challenges, whereas entity alignment and source selection
                                         are well-supported by existing Wikidata mechanisms.




1. Introduction
Wikidata [1], the largest public knowledge graph (KG), has nearly 1.5B statements about 90M
entities. This breadth of information inspires many use cases: museum curators can use
Wikidata to describe their art collections comprehensively, while movie critics could quickly
query and aggregate statistics about recent movies, and analyze them based on their genre.
However, while Wikidata’s data model allows for this information to be present, practice
shows that much of the relevant information is still missing. For example, around half of the
artists in Wikidata have a date of birth, and only 1.88% of the movies recorded in 2020 have
information about their cost. Thus, if a user wants to analyze the cost of the films produced
in 2020, Wikidata will provide cost for only 60 out of its 2,676 recorded films. This triggers
hunger for knowledge [2], activating a goal for the user to seek the missing information. As this
information is unlikely to be found in an aggregated form, and gathering information about
thousands of films is tedious, one must rely on automated tools that can enrich Wikidata with
relevant information. Such information might be available in external linked open data (LOD)
sources like DBpedia [3] or LinkedMDB [4]; yet, no existing methods can enrich Wikidata with
LOD information. Conversely, link prediction [5] accuracy is not sufficient to use it to impute
missing information with representation learning.
   The traditional LOD workflow includes declarative language tools for schema mapping and
templated rewritting of CSV files into RDF [6, 7, 8]. Subsequent ontology alignment [9] can be
employed to discover owl:sameAs links between the two datasets. Yet, merely collapsing nodes

Wikidata’22: Wikidata workshop at ISWC 2022
Envelope-Open bohuizha@usc.edu (B. Zhang); ilievski@isi.edu (F. Ilievski); pszekely@isi.edu (P. Szekely)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Example statements found in DBpedia.


        DBpedia statement                       Wikidata mapping                 Assessment
        dbr:Lesburlesque dbp:genre
                                                Q6530279 P136 Q217117            Correct
        dbr:Burlesque
        dbr:Amanda_de_Andrade                   Q15401730 P413 'Left
                                                                                 Wrong datatype
        dbp:position 'Left back'en              back'en
        dbr:Diego_Torres_(singer)
                                                Q704160 P2701 Q9764              Wrong semantic type
        dbp:genre dbr:Flamenco
        dbr:Eternal_Moment
                                                Q5402674 P4608 Q8070394          Logical, inaccurate
        dbp:director dbr:Zhang_Yibai




based on owl:sameAs relations is not sufficient, as this may merge related but dissimilar nodes
(e.g., Barack Obama and the US government) [10]. Schema alignment between two sources
has been the main focus so far, but it may be less critical when enriching Wikidata, which
provides 6.8k external ID properties that explicitly link to entities in other sources.1 A key
challenge for Wikidata is data quality [11], as external knowledge may be noisy, contradictory,
or inconsistent with existing knowledge in Wikidata. Table 1 presents candidate enrichment
statements retrieved from DBpedia: only one out of four statements is correct, whereas the
other three have an incorrect datatype, semantic type, or veracity. It is unclear how to employ
principled solutions for large-scale, high-quality enrichment of Wikidata.
   In this paper, we investigate how to enrich Wikidata with freely available knowledge from the
LOD cloud. We introduce a novel LOD workflow to study the potential of the external identifier
mechanism to facilitate vast high-quality extensions of Wikidata. We start by aligning the
entities automatically through external identifiers, and we infer the property in the external
source based on structural and content information. This schema alignment yields candidate
statements, whose quality is rigorously checked through datatype comparison, semantic con-
straints, and custom validators. We assess the effectiveness of the enrichment workflow on two
LOD knowledge sources: Getty [12], a manually curated domain-specific KG with a narrow
focus on art, and DBpedia [3], a broad coverage KG which has been automatically extracted
from Wikipedia. Extensive experiments on these sources show that our method facilitates vast
and high-quality extension of Wikidata with missing knowledge.
   We make our code and data available to facilitate future work.2


2. Related work
Schema mapping languages, like R2RML [6], Karma [7], and RML [8] enable users to map
relational databases into RDF, optionally followed by a semi-automatic ontology alignment
step. Recent ontology alignment methods [9] typically align two ontologies in the vector space.
   1
       https://www.wikidata.org/wiki/Category:Properties_with_external-id-datatype, accessed March 2, 2022.
   2
       https://github.com/usc-isi-i2/hunger-for-knowledge
Ontology enrichment deals with noise, incompleteness, and inconsistencies of ontologies, by
discovering association rules [13], or by extracting information from WWW documents [14].
Ontology evolution seeks to maintain an ontology up to date with respect to the domain that it
models, or the information requirements that it enables [15].
   Prior work has attempted to enrich Wikidata with satellite data for a given domain, including
frame semantics [16], biodiversity data [17, 18], and cultural heritage data [19]. The commu-
nity’s commonly adopted way to automatically edit Wikidata is to develop bots that inject
the knowledge from the external source into Wikidata [17, 18]. By learning how to represent
entities in a KG like Wikidata, link prediction models [5] can predict missing values by associ-
ating them to known statements. Our work complements prior efforts that enrich Wikidata
with domain-specific data or predict missing links, as we aim to devise a method for generic,
large-scale enrichment of Wikidata with external LOD sources.
   Identity links in the LOD cloud could be explored to combine information from different
sources. However, as we show in this paper, identity links between entities are insufficient for
KG enrichment, as they do not account for the quality and the compatibility of the data. In
that sense, our work relates to prior work that explores identity links in the LOD cloud [10] or
devises mechanisms to discover latent identity links [20]. The goal of our work is different - to
enrich Wikidata by finding high-quality knowledge in well-connected sources.
   The automatic validation of our method relates to efforts that analyze the quality of large
KGs. Beek et al. [21] propose a framework for analyzing the quality of literals on the LOD
Laundromat [22], but it is unclear how to generalize this framework to entity nodes. Prior
work has studied the quality of Wikidata [23] and compared it to the quality of other KGs, like
YAGO [24] and DBpedia [25]. Shenoy et al. [26] apply five semantic constraints on Wikidata in
order to measure statement quality. None of these works has investigated how to automatically
validate external statements that can be used to enrich Wikidata.
   Wikidata’s property constraint pages define existing property constraints and report num-
ber of violations for a single dump.3 Recoin [27] computes relative completeness of entity
information by comparing the available information for an entity against other similar enti-
ties. Objective Revision Evaluation Service (ORES),4 provides AI-based quality scores (e.g., for
vandalism) for revisions in Wikidata. Our work complements these efforts by Wikidata, by
providing mechanisms for automatic alignment and semantic validation of knowledge from the
LOD cloud before it is submitted to Wikidata.


3. Method
Our enrichment method is shown in Figure 1. Given a user query, our method queries Wikidata
(𝑊), obtaining a set of statements 𝑆𝑤 = {𝑠𝑤 = (𝑒𝑤 , 𝑝, 𝑜𝑤 ) | 𝑒𝑤 ∈ 𝐸𝑤 } for known subjects 𝐸𝑤 , and
a set of subjects 𝐸𝑢 with unknown values, for industry in the example. We call this step gap
detection, as it generates a set of entities for which we seek missing knowledge in order to
satisfy the user’s query needs. Considering that the LOD cloud contains other sources that
are likely to contain information about the same entities, we perform a manual KG selection

   3
       https://www.wikidata.org/wiki/Help:Property_constraints_portal
   4
       https://www.wikidata.org/wiki/Wikidata:ORES
Figure 1: Our enrichment method, illustrated on enriching Wikidata with additional knowledge from
DBpedia about the query “industry of companies”.


step to determine a relevant set of KGs (𝐺) to consult for the entities in 𝐸𝑢 . Here, the sources 𝐺
(DBpedia in this example) are assumed to overlap with 𝑊 in terms of the entities they describe.
Our schema alignment step consolidates the entities and properties of 𝑊 with those of each 𝐺,
since their entity and property identifiers are generally different. Each pair (𝑒𝑤 , 𝑜𝑤 ) is aligned
to a DBpedia-valued pair (𝑒𝑤 ′ , 𝑜𝑤 ′ ). Similarly, the unknown entities 𝑒𝑢 ∈ 𝐸𝑢 are mapped in
the same way to DBpedia entities 𝑒𝑢 ′ . Then, the external KG DBpedia is queried for property
paths which correspond to the known subject-object pairs (𝑒𝑤 ′ , 𝑜𝑤 ′ ). In the set of DBpedia
property 𝑝𝑔 , path 𝑝 ′ (dbp:industry ) that corresponds to the Wikidata property 𝑝 is discovered
with our method. After aligning the two schemas, the knowledge retrieval step obtains values
from DBpedia, by querying for (𝑒𝑢 ′ , 𝑝 ′ ) pairs comprised of DBpedia entities and property paths,
resulting in a set of newly found statements 𝑆𝑔 = {𝑠𝑔 = (𝑒𝑢 ′ , 𝑝 ′ , 𝑜𝑔 )}. A semantic validation step
is employed in order to ensure that the semantics of the newly found statements in 𝑆𝑔 after
mapping back to 𝑊 correspond to the semantics intended by Wikidata. The set of validated
statements 𝑆𝑒 = {𝑠𝑒 = (𝑒𝑢 , 𝑝, 𝑜𝑒 )} ⊆ 𝑆𝑔 is finally used to enrich 𝑊. This procedure yields a more
complete response to the user query consisting of a union of the original and the enriched
statements, formally: 𝑆𝑡𝑜𝑡𝑎𝑙 = 𝑆𝑤 ∪ 𝑆𝑒 .

3.1. Gap Detection
We consider a structured query against a target knowledge graph 𝑊 for a query property 𝑝
(e.g., industry). The gap detection step generates a set of subject entities 𝐸𝑤 for which the value
for the property 𝑝 is known in Wikidata, and a set of entities 𝐸𝑢 for which the value of the
property 𝑝 is missing in 𝑊. 𝐸𝑤 and 𝐸𝑢 are subsets of the overall set of target entities and they
are mutually disjoint, formally: 𝐸𝑤 ⊆ 𝐸, 𝐸𝑢 ⊆ 𝐸, 𝐸 = 𝐸𝑘 ⋃ 𝐸𝑢 , and 𝐸𝑤 ⋂ 𝐸𝑢 = ∅.
   This work focuses on finding values for entities in 𝐸𝑢 that have zero values for a property in
Wikidata. We note that it is possible that the statements in 𝑆𝑤 do not fully answer the query
for the entities 𝐸𝑤 , as these entities may have multiple values for 𝑝, e.g., a politician may have
several spouses throughout their life. Enriching multi-valued properties will be addressed in
future work.

3.2. Graph Selection
The graph selection step manually selects from a set of LOD sources 𝐺 that can be used to enrich
the results of the query. In this work, we consider an automatically extracted general-domain
KG (DBpedia) and a domain-specific curated KG (Getty). We experiment with using both KGs,
or selecting one based on the posed query.
DBpedia [3] is an open-source KG derived from Wikipedia through information extraction.
DBpedia describes 38 million entities in 125 languages, while its English subset describes 4.58
millions of entities. Large part of the content in DBpedia is based on Wikipedia’s infoboxes:
data structures containing a set of property–value pairs with information about its subject
entity, whose purpose is to provide a summary of the information about that subject entity. We
use DBpedia infoboxes in order to enrich Wikidata, as this data is standardized, relevant, and
expected to be extracted with relatively high accuracy.
Getty [12] is a curated LOD resource with focus on art. In total, Getty contains entities covering
324,506 people and 2,510,774 places. It consists of three structured vocabularies: (1) Art &
Architecture Thesaurus (AAT) includes terms, descriptions, and other information (like gender
and nationality) for generic concepts related to art and architecture; (2) the Getty Thesaurus of
Geographic Names (TGN) has 321M triples with names, descriptions, and other information
for places important to art and architecture; and (3) the Union List of Artist Names (ULAN)
describes names, biographies, and other information about artists and architects, with 64M
statements.

3.3. Schema Alignment
As Getty and DBpedia have a different data model compared to Wikidata, we first align their
schemas to Wikidata in order to query them based on missing information in Wikidata. The
schema alignment consists of two sequential steps:
1. Entity resolution maps all known subject entities in Wikidata, 𝑒𝑤 , unknown subjects 𝑒𝑢 , and
the known objects 𝑜𝑤 to external identifiers 𝑒𝑤′ , 𝑒𝑢′ , and 𝑜𝑤′ , respectively. We map Wikidata nodes to
nodes in external KGs automatically, by leveraging external identifiers and sitelinks available in
Wikidata. In total, Wikidata contains 6.8K external-id properties. Wikidata contains 46,595,392
sitelinks, out of which 5,461,631 link to the English Wikipedia pages. Our method leverages
sitelinks to map Wikidata entities to DBpedia nodes,5 while for Getty we use vocabulary-specific
external-id properties in Wikidata: AAT ID (P1014 ) for AAT items, TGN ID (P1667 ) for TGN
items, and ULAN ID (P245 ) for ULAN items. We note that, while the entity mapping is automatic,
    5
    Wikipedia page URIs can trivially be translated to DBpedia URIs.
the selection of the external-id property itself in the current method is manual, e.g., the user
has to specify that P1667 should be used for TGN identifiers.
2. Property alignment maps the property 𝑝 from Wikidata to a corresponding property path
𝑝 ′ in 𝐺 by combining structural and content information. We query 𝐺 for property paths 𝑝𝑔 with
maximum length 𝐿 that connect the mapped known pairs (𝑒𝑤′ , 𝑜𝑤′ ). We aggregate the obtained
results, by counting the number of results for each 𝑝𝑔 . We preserve the top-10 most common
property paths, and we use string similarity to select the optimal one. Here, we select the
top-10 candidates by using Python built-in function based on Gestalt Pattern Matching [28].6
If the most similar property path has similarity above a threshold (0.9), then we select it as a
mapped property 𝑝 ′ , otherwise we select the top-1 most frequent property. In the example in
Figure 1, the Wikidata property P452 would map to dbp:industry in DBpedia, which is both
the top-1 most frequent property in the aligned results, and it has maximum string similarity
with 𝑝. We expect that string similarity and value frequency can complement each other. For
example, the top-1 most frequent property for P149 (architectural style) is dbp:architecture ,
while string similarity filter the right mapping dbp:architecturalStyle from the candidates.
The property chain 𝑝 ′ can be quite complex, e.g., Wikidata’s place of birth (P19) maps to
a 4-hop path in Getty: foaf:focus → gvp:biographyPreferred → schema:birthPlace →
skos:exactMatch .


3.4. Knowledge Retrieval
The schema alignment step produces external identifiers for the unknown entities 𝑒𝑢′ and
an external property path 𝑝 ′ that corresponds to the property in the original query. The
user can query the external graph 𝐺 with the mapped subject-property pair (𝑒𝑢′ , 𝑝 ′ ), in order
to automatically retrieve knowledge. In Figure 1, an example of (𝑒𝑢′ , 𝑝 ′ ) pair is (dbr:WOWIO,
dbp:industry ). We denote the candidate objects found in 𝐺 with this step with 𝑜𝑔 . As the
candidate object identifiers belong to the external graph, the user can perform inverse entity
resolution by following sitelinks or external identifiers from the external graph 𝐺 to 𝑊. This
step results in newly found Wikidata objects 𝑜𝑔′ for the unknown entities 𝑒𝑢 , completing their
statements (𝑒𝑢 , 𝑝, 𝑜𝑔′ ).

3.5. Semantic Validation
Despite the schema alignment, the candidate objects 𝑜𝑔′ may be noisy: they may have a wrong
datatype (e.g., date instead of a URI), an incorrect semantic type (e.g., a nationality instead of a
country), or a literal value that is out of range (e.g., death year 2042). We trim noisy objects
with three validation functions.
1. Datatypes Each Wikidata property has a designated datatype that the candidate objects
have to conform to. For instance, the spouses are expected to be Qnodes, while movie costs
should be numeric values with units (e.g., 4 million dollars). To infer the expected datatype of a
property, we count the datatypes of the known objects 𝑜𝑤 , and select the top-1 most common


    6
    Our empirical study showed that this function leads to comparable accuracy like Levenshtein distance, but it is
more efficient.
datatype. This function returns a subset of statements 𝑆𝑣1 with candidate objects that belong to
the expected datatype.
2. Property constraints We validate the object values further based on property constraints
defined in Wikidata. Specifically, we use value-type constraints to validate the semantic type
of the objects. Value type constraints are similar to property range constraints [26], but they
provide a more extensive definition that includes exception nodes and specifies whether the
type property is: P31 (instance of), P279 (subclass of), or both. Figure 2 in Appendix shows an
example of a value type constraint for the property P452 (industry). We automatically validate
the value type of all statements for a property, by comparing their object value to the expected
type. Following [26], we encode the value type constraint for a property as a KGTK [29] query
template. Each template is instantiated once per property, allowing for efficient constraint
validation in parallel. Constraint violations for a property are computed in a two-step manner:
we first obtain the set of statements that satisfy the constraint for a property, and then we
subtract this set from the overall number of statements for that property. The constraint
validation function yields a set of validated statements 𝑆𝑣2 .
3. Literal range We validate date properties (e.g., date of birth) by ensuring that they do not
belong to the future, i.e., that every recorded date (excluding expected dates) is earlier than the
current date. This function outputs a set of valid statements 𝑆𝑣3 .
   The set of validated statements is the intersection of the results returned from the validation
functions: 𝑆𝑣 = 𝑆𝑣1 ∩𝑆𝑣2 for Qnodes and 𝑆𝑣1 ∩𝑆𝑣2 ∩𝑆𝑣3 for date values. This validated statements in
𝑆𝑒 have the form (𝑒𝑢 , 𝑝, 𝑜𝑒 ). The total set of statements for the user query becomes 𝑆𝑡𝑜𝑡𝑎𝑙 = 𝑆𝑤 ⋃ 𝑆𝑒 .


4. Experimental Setup
Knowledge graphs We experiment with batch enrichment for 955 Wikidata properties that
have value-type constraints. We use the Wikidata 2021-02-15 dump in a JSON format and the
2021-10-27 sitelinks file. We use the 2021-12-01 DBpedia infobox file in a Turtle (.ttl) format.7
We select its cannonicalized version, as it ensures that its subjects and objects can be mapped to
English Wikipedia pages. For the Getty vocabularies, we download the 2021-09-18 dump in
N-Triples (.nt) from their website.8
Evaluation To evaluate the quality of the overall enrichment, we randomly sample candidate
statements and annotate their validity manually. Specifically, we sample 100 statements from
the DBpedia set and 30 from Getty. Two annotators first annotate 130 statements independently
by searching each one on Internet and identify the correctness, then discuss to resolve two
conflicts. In the sampled subset, 20/100 of the DBpedia statements and 11/30 of Getty are correct,
while the rest are incorrect. We label three reasons for incorrect statements: wrong datatype,
wrong semantic type, and inaccurate information. Out of the 80 incorrect DBpedia statements,
66 have incorrect datatype, 7 incorrect semantic type, and 7 are inaccurate. For Getty, 0 are
incorrect datatype, 17 are incorrect semantic type, and 2 are inaccurate.
   To investigate the quality of our property alignment, we take property mappings provided by
owl:equivalentProperty in DBpedia and P1628 (equivalent property) in Wikidata as ground

    7
        https://databus.dbpedia.org/dbpedia/generic/infobox-properties/
    8
        http://vocab.getty.edu/
Table 2
Batch enrichment results when using DBpedia, Getty, and both KGs. |𝑆∗ | shows numbers of statements.
In total, we consider 955 properties. |𝑝 ′ | shows the numbers of properties mapped to each of the KGs.


                            |𝑝 ′ |          |𝑆𝑤 |          |𝑆𝑔 |          |𝑆𝑒 |         |𝑆𝑡𝑜𝑡𝑎𝑙 |
               DBpedia      582      106,104,551    41,309,864     21,023,187     127,127,738
                Getty         3          195,153        10,518          5,766         200,919
                Both        582      106,104,551    41,320,382     21,028,953     127,328,657



truth. We dub this data Equivalence . In the ground truth pairs we collected, each Wikidata
property is mapped to one or multiple DBpedia properties. In total, 88 Wikidata properties are
mapped to 101 DBpedia properties. We formulate a task where the goal is to map a Wikidata
property to at least one of its corresponding DBpedia properties. Besides correct and incorrect
mappings, we annotate an intermediate category of close match for properties that match
partially. We show two F1-values of our method: hard, which only counts exact matches, and
soft, which includes partial matches. We compare our method to three baselines. The string
matching and frequency matching baselines are ablations of our method that only consider
string similarity or frequency, but not both. The third baseline embeds all DBpedia labels with
BERT [30] and uses cosine distance to select the closest property label.


5. Results
In this Section, we present five experimental findings that concern the potential of our method,
its overall performance per component, and its consistency when the external information is
covered by Wikidata.
Finding 1: Our method can enrich Wikidata with millions of statements about millions
of entities. Table 2 shows the results of our method for all properties in Wikidata that have a
value-type constraint. Out of 955 Wikidata properties, our method is able to align 582 properties
with DBpedia and 3 with Getty. For all properties aligned with DBpedia and Getty, we gather 21
million statements enriching the original Wikidata knowledge by 16.54%. Interestingly, while
the original Wikidata focuses on a broad coverage of entities, our method is able to enrich more
properties for a smaller set of entities, signifying a higher density and a more narrow focus.
The |𝑆𝑔 | column shows that our method collects 41 million candidate statements from DBpedia
and Getty, out of which 21 million pass the semantic validation. The median number of novel
statements per property is 982. For 161 (27.66%) properties, our method provides double or
more statements relative to the original set of statements. The relative increase of knowledge
is the lowest for the properties P538 (fracturing), P209 (highest judicial authority), and P1283
(filmography). Meanwhile, the properties P66 (ancestral home), P500 (exclave of), and P3179
(territory overlaps) are relatively sparse in Wikidata, and receive many more statements from
DBpedia. Comparing the two external KGs, we observe that DBpedia overall contributes many
more statements than Getty, and brings a higher enrichment per property, averaging at 16.54%
Table 3
Evaluation results on 130 candidate triples: 100 from DBpedia and 30 from Getty. We show the accuracy,
precision, recall and F1-score of our semantic validation on this subset. Getty does not have values with
a wrong datatype (‘-’).


    KG         Accuracy     Precision     Recall    F1-score                 Accuracy per category
                                                                Correct     Datatype Sem. type Inaccurate
 DBpedia         87.00%        61.54%    84.21%       71.11%     80.00%        96.97%        71.43%         28.57%
  Getty          93.33%        84.62%    100.0%       91.67%     100.0%              -        100%              0%
  Both           88.46%        69.23%    90.00%       78.26%     87.10%        96.97%        91.67%         22.22%



vs 2.87% for Getty.
Finding 2: The overall quality of the enriched statements is relatively high. Table 3
shows that our method can distinguish between correct and incorrect statements with a relatively
high accuracy of over 88%. As the majority of the candidates are incorrect, we observe that
the F1-score is lower than the accuracy. The precision of our method is lower than the recall
on both DBpedia and Getty, which indicates that most of the disagreement of our method
with human annotators is because of false positives, i.e., incorrect statements identified as
correct. This indicates that our semantic validation is accurate but it is not complete, and it
could benefit from additional validators. This observation is further supported by the relatively
higher precision and recall of our method on Getty in comparison to DBpedia. As Getty is
manually curated and enforces stricter semantics, it has a smaller range of data quality aspects
to address, most of which are already covered by our method. The quality issues in the case of
DBpedia are heterogeneous, as a result of its automatic extraction and lack of curation. We find
that out of 130 triples, 52 had incorrect property mappings, and the semantic validation is able
to correct 44 of them. For example, Wikidata property P208 (executive body) got mapped to
dbp:leaderTitle in DBpedia, and its value Q30185 was not allowed by the Wikidata constraint
for P208 . We evaluate property alignment and semantic validation in more detail later in this
Section.
Finding 3: Property mapping performance is relatively high, but sparse properties
are difficult. Table 2 showed that around 40% of the target Wikidata properties had no match
found in DBpedia with our method. To investigate whether this is because of misalignment
between the two schemas or a limitation of our method, we evaluate the property alignment
step on the Equivalence data. The results are shown in Table 4. Our method achieves the best
F1-score for both the soft match cases (89%) and the hard match cases (66%).9 As frequency and
string matching are ablations from our method, their lower performance supports our decision
to combined them to get the best of both worlds. For instance, frequency matching tends to
prefer more general properties over specific ones, mapping P30 (continent) to dbp:location .
Thanks to string matching, our method predicts the right property dbp:continent in this
case. Conversely, string matching is easily confused by cases where the labels are close but the
    9
     We also manually evaluate the property matching methods on a separate randomly chosen set of 20 properties,
and observe similar results.
Table 4
Evaluation results on known aligned properties between DBpedia and Wikidata. We compare against
exact match and language model baselines. We also show the performance of our method per property
quartile, where the quartiles are based on the number of examples for a property.


             Method                      Hard match                       Soft match
                             Precision      Recall F1-score   Precision      Recall F1-score
       BERT Embedding          47.73%      47.73%   47.73%      72.73%     72.73%    72.73%
      Frequency Matching       52.33%      51.14%   51.72%      81.40%     79.55%    80.46%
        String Matching        52.27%      52.27%   52.27%      86.36%     86.36%    86.36%
          Our Method           66.28%      64.77%   65.52%      89.53%     87.50%    88.51%
        Our method (Q1)         71.43%     68.18%    69.77%     90.48%      86.36%    88.37%
        Our method (Q2)         63.64%     63.64%    63.64%     100.0%      100.0%    100.0%
        Our method (Q3)         81.82%     81.82%    81.82%     86.36%      86.36%    86.36%
        Our method (Q4)         47.62%     45.45%    46.51%     80.95%      77.27%    79.07%



actual meaning is not related, e.g., it maps P161 (cast member) to dbp:pastMember , whereas
our method correctly maps it to the ground truth result dbp:starring owing to its frequency
component. As our method still largely relies on the frequency of statements, we hypothesize
that its performance decreases for properties with fewer known statements. To investigate
this hypothesis, we divide the ground truth properties into four quartiles (of 22 properties)
based on the descending size of their original Wikidata statements. We evaluate the accuracy
or our property matching per quartile. We note that the performance of the first three quartiles
with larger number of statements is relatively better than the last quartile, which indicates
that the precision of our method is positively correlated with the size of known statements.
This limitation can be addressed in the future with more robust methods, e.g., based on learned
property representations.
Finding 4: Semantic validation can detect wrong datatypes and semantic types. Table 2
shows that the semantic validation has a large impact on the results: out of 41 million candidate
statements initially found by our method, around half of them satisfy our validation function.
The compatibility ratios are similar for both Getty and DBpedia (50.89% to 54.82%), which is
surprising, considering that DBpedia has been largely extracted automatically and is error-prone,
whereas Getty is well-curated and considered an authority. To study the precision and recall of
our semantic validation, we annotate three reasons for incorrect statements: wrong datatype,
wrong semantic type, and inaccurate information. We found that (Table 2) our method performs
well on identifying correct statements (F1-score 93.10%), as well as detecting errors due to
wrong datatypes (F1-score 96.97%) and incorrect semantic types (F1-score 91.67%). It performs
relatively worse when the statements satisfy the validation but are inaccurate. For example, the
enriched statement (Q6712846 P19 Q49218 ) for the property P19 (place of birth) from Getty is
logical but inaccurate, since the value Q49218 satisfies the value-type constraints while it is not
the actual birth place of Q6712846 . Among 130 triples, our method produces 7 false positive
cases that are factually incorrect. These results are expected, given that our method is designed
Table 5
Evaluation results for data consistency. Numbers of Wikidata statements, enriched statements, over-
lapping entity-property values, agreeing statements, and disagreeing statements are counted. The
agreement ratio 𝑟𝑎𝑔𝑟𝑒𝑒 is calculated by |𝑆𝑎𝑔𝑟𝑒𝑒 |/|𝑆𝑜𝑣𝑒𝑟𝑙𝑎𝑝 |.


            KG (Property)         |𝑆𝑤 |       |𝑆𝑒 |   |𝑆𝑜𝑣𝑒𝑟𝑙𝑎𝑝 |    |𝑆𝑎𝑔𝑟𝑒𝑒 |   |𝑆𝑑𝑖𝑠𝑎𝑔𝑟𝑒𝑒 |     𝑟𝑎𝑔𝑟𝑒𝑒
           DBpedia (P19)    2,711,621     467,976     884,078       461,089      422,989        52.15%
           DBpedia (P20)    1,080,900     119,161     219,447       128,523       90,924        58.57%
            Getty (P19)        65,411       2,939      16,304        13,607        2,697        83.46%
            Getty (P20)        50,295       2,556      14,722        12,594        2,128        85.55%



to filter out illogical information, while analyzing veracity is beyond its current scope.
Finding 5: Most results for functional properties are consistent between external
graphs and Wikidata, many disagreements are due to different granularities. The
analysis so far focused on the novel statements, i.e., provide novel object values for subject-
property pairs that do not have them in Wikidata. Here, we measure consistency between the
validated statements of the known entities from the external graph with the known statements
from Wikidata. We select two functional properties: P19 (place of birth) and P20 (place of
death), and count the agreements and disagreements for both DBpedia and Getty. The results
in Table 5 for P19 and P20 show that 52-59% of the overlapping statements between Wikidata
and DBpedia, and 83-86% of the overlapping statements between Wikidata and Getty coincide.
Qualitative inspection of the disagreements reveals that many of the disagreements are due to
different granularity choices between the two graphs. For instance, in Wikidata, the place of
death of Q1161576 (Daniel Lindtmayer II) is Q30978 (Central Switzerland) which is a region,
while Getty provides the specific city Q4191 (Lucerne).


6. Discussion and Future Work
Our enrichment method has been shown to quickly retrieve millions of novel property values
in the LOD cloud for entities in Wikidata. As some of the LOD knowledge is extracted in an
automatic way, ensuring quality is important - our semantic validation based on datatypes and
constraints found around half of the candidate statements to be invalid. Analysis of a subset
of the enriched statements revealed that the accuracy of our method is close to 90%, which is
reasonably high. Still, our method is merely a step towards the ambitious goal of addressing the
notorious challenge of sparsity of today’s large KGs [31]. Here, we discuss three key areas of
improvement for our method:
1. Enrichment with more LOD KGs - We showed that our method is effective with two
external KGs: a general-domain and automatically extracted knowledge graph, DBpedia, and a
domain-specific, well-curated knowledge graph, Getty. As Wikidata is still largely incomplete
after this enrichment, we can use the 6.8k external identifier properties provided by Wikidata
to enrich with thousands other sources. While we expect that our method can be applied on
these thousands of sources, an in-depth investigation of the potential and the quality of this
enrichment is beyond the scope of the current paper.
2. Semantic validation - Our method validates candidate statements through datatype and
value type constraints. Value type constraints ensure semantic type compatibility of the retrieved
statements, yet they are only one of the 30 property constraint types defined in Wikidata. Other
Qnode constraints in Wikidata can be employed to generalize our method. For instance, Qnode-
valued statements can be further validated via constraints like one-of (Q21510859) , whereas
literals can be checked with range (Q21510860) .
3. Validating veracity Table 3 shows that our method performs relatively well at detecting
statements with incorrect datatype or semantic type, whereas it is usually unable to detect inac-
curate statements. As discussed by Piscopo [11], veracity is a key aspect of quality of knowledge
in KGs. Our method can be further enhanced with models that detect KG vandalism [32] or
estimate trust of sources (e.g., through references) [23] to estimate veracity.


7. Conclusions
Recognizing the notorious sparsity of modern knowledge graphs, such as Wikidata, and the
promise of linked data information, like external identifiers, to facilitate enrichment, we pro-
posed a method which consisted of five steps (gap detection, external graph selection, schema
alignment, knowledge retrieval, and semantic validation) for enriching Wikidata with external
KGs found in the LOD cloud. Our experiments showed that the LOD-based method can enrich
Wikidata with millions of new high-quality statements from DBpedia and Getty. High-quality
enrichment is achieved based on large-scale automated semantic validation and a hybrid algo-
rithm for property alignment. A key future direction is evaluating the generalization of our
method on thousands of LOD sources that Wikidata points to, which opens novel challenges of
source selection, more extensive semantic validation, and trust.


References
 [1] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications
     of the ACM 57 (2014) 78–85.
 [2] F. Ilievski, P. Vossen, M. Van Erp, Hunger for contextual knowledge and a road map to
     intelligent entity linking, in: International Conference on Language, Data and Knowledge,
     Springer, Cham, 2017, pp. 143–149.
 [3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, Dbpedia: A nucleus for a
     web of open data, in: The semantic web, Springer, 2007, pp. 722–735.
 [4] O. Hassanzadeh, M. P. Consens, Linked movie data base, in: LDOW, 2009.
 [5] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings
     for modeling multi-relational data, Advances in neural information processing systems 26
     (2013).
 [6] S. Das, R2rml: Rdb to rdf mapping language, http://www. w3. org/TR/r2rml/ (2011).
 [7] C. A. Knoblock, P. Szekely, J. L. Ambite, A. Goel, S. Gupta, K. Lerman, M. Muslea,
     M. Taheriyan, P. Mallick, Semi-automatically mapping structured sources into the semantic
     web, in: Extended Semantic Web Conference, Springer, 2012, pp. 375–390.
 [8] A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, R. Van de Walle, Rml:
     a generic language for integrated rdf mappings of heterogeneous data, in: Ldow, 2014.
 [9] S.-C. Chu, X. Xue, J.-S. Pan, X. Wu, Optimizing ontology alignment in vector space, Journal
     of Internet Technology 21 (2020) 15–22.
[10] W. Beek, J. Raad, J. Wielemaker, F. Van Harmelen, sameas. cc: The closure of 500m owl:
     sameas statements, in: European semantic web conference, Springer, 2018, pp. 65–80.
[11] A. Piscopo, E. Simperl, What we talk about when we talk about wikidata quality: a literature
     survey, in: Proceedings of the 15th International Symposium on Open Collaboration, 2019,
     pp. 1–11.
[12] P. Harpring, Development of the getty vocabularies: Aat, tgn, ulan, and cona, Art
     Documentation: Journal of the Art Libraries Society of North America 29 (2010) 67–72.
[13] C. d’Amato, S. Staab, A. G. Tettamanzi, T. D. Minh, F. Gandon, Ontology enrichment
     by discovering multi-relational association rules from ontological knowledge bases, in:
     Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016, pp. 333–338.
[14] A. Faatz, R. Steinmetz, Ontology enrichment with texts from the www, Semantic Web
     Mining 20 (2002).
[15] F. Zablith, G. Antoniou, M. d’Aquin, G. Flouris, H. Kondylakis, E. Motta, D. Plexousakis,
     M. Sabou, Ontology evolution: a process-centric survey, The knowledge engineering
     review 30 (2015) 45–75.
[16] H. Mousselly-Sergieh, I. Gurevych, Enriching wikidata with frame semantics, in: Pro-
     ceedings of the 5th Workshop on Automated Knowledge Base Construction, 2016, pp.
     29–34.
[17] A. Waagmeester, L. Schriml, A. Su, Wikidata as a linked-data hub for biodiversity data,
     Biodiversity Information Science and Standards 3 (2019) e35206.
[18] S. Burgstaller-Muehlbacher, A. Waagmeester, E. Mitraka, J. Turner, T. Putman, J. Leong,
     C. Naik, P. Pavlidis, L. Schriml, B. M. Good, A. I. Su, Wikidata as a semantic framework
     for the Gene Wiki initiative, Database: The Journal of Biological Databases and Curation
     2016 (2016) baw015. doi:10.1093/database/baw015 .
[19] G. Faraj, A. Micsik, Enriching wikidata with cultural heritage data from the courage
     project, in: Research Conference on Metadata and Semantics Research, Springer, 2019, pp.
     407–418.
[20] J. Raad, N. Pernelle, F. Saïs, Detection of contextual identity links in a knowledge base, in:
     Proceedings of the knowledge capture conference, 2017, pp. 1–8.
[21] W. Beek, F. Ilievski, J. Debattista, S. Schlobach, J. Wielemaker, Literally better: Analyzing
     and improving the quality of literals, Semantic Web 9 (2018) 131–150.
[22] W. Beek, L. Rietveld, H. R. Bazoobandi, J. Wielemaker, S. Schlobach, Lod laundromat:
     a uniform way of publishing other people’s dirty data, in: International semantic web
     conference, Springer, 2014, pp. 213–228.
[23] A. Piscopo, L.-A. Kaffee, C. Phethean, E. Simperl, Provenance information in a collaborative
     knowledge graph: an evaluation of wikidata external references, in: International semantic
     web conference, Springer, 2017, pp. 542–558.
[24] H. Turki, D. Jemielniak, M. A. H. Taieb, J. E. L. Gayo, M. B. Aouicha, M. Banat, T. Shafee,
     E. Prud’Hommeaux, T. Lubiana, D. Das, D. Mietchen, Using logical constraints to validate
     information in collaborative knowledge graphs: a study of COVID-19 on Wikidata, 2020.
     doi:10.5281/zenodo.4445363 .
[25] M. Färber, F. Bartscherer, C. Menne, A. Rettinger, Linked data quality of dbpedia, freebase,
     opencyc, wikidata, and yago, Semantic Web 9 (2018) 77–129.
[26] K. Shenoy, F. Ilievski, D. Garijo, D. Schwabe, P. Szekely, A study of the quality of wikidata,
     Journal of Web Semantics (2021).
[27] V. Balaraman, S. Razniewski, W. Nutt, Recoin: relative completeness in wikidata, in:
     Companion Proceedings of the The Web Conference 2018, 2018, pp. 1787–1792.
[28] J. W. Ratcliff, D. E. Metzener, Pattern matching: The gestalt approach, Dr. Dobb’s Journal
     (1988) 46.
[29] F. Ilievski, D. Garijo, H. Chalupsky, N. T. Divvala, Y. Yao, C. Rogers, R. Li, J. Liu, A. Singh,
     D. Schwabe, P. Szekely, Kgtk: a toolkit for large knowledge graph manipulation and
     analysis, in: International Semantic Web Conference, Springer, Cham, 2020, pp. 278–293.
[30] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. arXiv:1810.04805 .
[31] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun,
     W. Zhang, Knowledge vault: A web-scale approach to probabilistic knowledge fusion, in:
     Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery
     and data mining, 2014, pp. 601–610.
[32] S. Heindorf, M. Potthast, B. Stein, G. Engels, Vandalism detection in wikidata, in: Pro-
     ceedings of the 25th ACM International on Conference on Information and Knowledge
     Management, 2016, pp. 327–336.
Appendix
A. Value-type Constraint Example




Figure 2: Example value-type constraint of the property industry (P452 ). The values associated with this
property should belong to one of the following types: [industry, industry, economic activity,
economic sector, infrastructure] , whose respective Qnodes are [Q8148, Q268592, Q8187769,
Q3958441, Q121359] . The type can be either encoded as an instance-of (P31 ) or a subclass-of (P279 )
property. There are no entities in Wikidata which are exceptions for which this property constraint.



B. Method Implementation
We implement our method using the Knowledge Graph ToolKit (KGTK) [29]. For Getty we
obtain paths with maximum length of 𝐿 = 4, for DBpedia 𝐿 = 1. In the schema alignment step,
we count property frequency based on a sample up to 200,000 pairs of known subjects and
objects (𝑒𝑤 , 𝑜𝑤 ).

C. Enrichment of Literals
Our method can also be used to enrich literal information about entities. We run our method
on two functional properties: P569 (date of birth) and P570 (date of death). We compare the
obtained results from the external graphs to those found in Wikidata, for entity-property pairs
where both Wikidata and the external graph have a value. We compare the results on the finest
granularity provided by the the two graphs, which is dates for DBpedia and years to Getty. The
results for property P570 in Figure 3 show a clear trend of the points in scatter plots distributed
along the line of 𝑦 = 𝑥, which shows high consistency of the date data between the Wikidata
and the external KGs.10 Specifically, we observe that the agreement rate with Wikidata values
is 89.28% (1,271,862 out of 1,424,526) for DBpedia and 82.39% for Getty (125,913 out of 152,824).
From Getty and DBpedia, our method can enhance Wikidata with novel P569 values for 35,459
entities and novel P570 values for 20,664 entities.




   10
    We observe a similar trend for P569 .
                 (a) DBpedia                                                (b) Getty
Figure 3: Scatter plots of literal consistency for P570 (date of death) between Wikidata and the external
graph: DBpedia (a) and Getty (b). The plots show the subject-property pairs for which both Wikidata
and the external KG have a value. Ticks on both the X- and the Y-axis represent years.