Analysis of Consistency between Wikidata and
Wikipedia Categories
Leila Feddoul1,2 , Frank Löffler1,3 and Sirko Schindler2
1
  Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Jena, Germany
2
  Institute of Data Science, German Aerospace Center DLR, Jena, Germany
3
  Competence Center for Digital Research, Michael Stifel Center, Jena, Germany


                                         Abstract
                                         Wikipedia categories play a significant role in organizing articles by topic. They form a hierarchy,
                                         which groups related articles into larger collections. Wikidata provides a corresponding item for each
                                         category and allows to define membership of other items to the specific category by a SPARQL query
                                         or by specifying classes and properties. This provides us with multiple, redundant sources of category
                                         membership which may deviate quite substantially. In this paper, we investigate inconsistencies between
                                         Wikipedia and Wikidata category members and analyze possible reasons. We propose a candidate
                                         category generation and evaluation workflow that traverses the category hierarchy of Wikipedia in all
                                         available languages and compares the results with information obtained from Wikidata. This workflow
                                         can be executed either online using the publicly available endpoints or offline based on the provided
                                         dumps. Furthermore, we formulate concrete suggestions to harmonize category membership definitions
                                         between Wikipedia and Wikidata.

                                         Keywords
                                         Wikidata, Wikipedia, Wikipedia Category


1. Introduction
Wikipedia has grown to be a valuable source of semi-structured information, written and
maintained by a large community and provided for everyone to use. As of 2022, it contains over
6.5 million articles in its English section1 , but is also available in 329 other languages2 . It has a
community of about 280, 000 active editors and more than 100 million registered users. The
basic building blocks of Wikipedia are articles that are interlinked among each other.
   Wikidata [1] is, like Wikipedia, free and open, but instead of a collection of articles that are
intended primarily to be read by humans, it is a knowledge base that is intended to be read and
edited by humans and machines. Wikidata is a source of open data that other projects, including
Wikipedia, can use to enrich their services. The basic building block of Wikidata is an item,
which represents any kind of real-world topic, concept, or entity that is uniquely identified.

Wikidata’22: Wikidata workshop at ISWC 2022
$ leila.feddoul@uni-jena.de (L. Feddoul); frank.loeffler@uni-jena.de (F. Löffler); sirko.schindler@dlr.de
(S. Schindler)
 0000-0001-8896-8208 (L. Feddoul); 0000-0001-6643-6323 (F. Löffler); 0000-0002-0964-4457 (S. Schindler)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                    https://en.wikipedia.org/wiki/Wikipedia:Statistics
                  2
                    https://en.wikipedia.org/wiki/Wikipedia
                                  Target
                                                   Qualifiers


                                                                          SPARQL


Figure 1: Wikidata category: Association football video games (Q13199045).


    Wikipedia established ways to structure its building blocks (articles): Categories, i.e. sets
of articles or subcategories, are among them. They play an important role since they support
finding sets of articles having the same characteristics without knowing individual articles
beforehand. The Wikipedia category structure has also been exploited for other tasks like entity
retrieval [2] or document classification [3]. Wikipedia’s categories3 group articles with similar
topics. E.g., Category: Former countries groups a set of articles related to the concept of a former
country. This does not only include articles about the respective former countries like Inca
Empire, but also subcategory pages, e.g., Category: Former countries in fiction. Categories can
contain subcategories, but the resulting data structure is not a tree, but a more general graph
because articles and subcategories can be members in multiple parent categories and, while
discouraged, even loops can exist4 .
    There is a quite close connection between categories in Wikipedia and Wikidata. In general,
for each page (article, category, or otherwise) in Wikipedia, there exists a corresponding
Wikidata item that is unique for all languages [4]. Wikidata category items are instances of
Wikimedia category (Q4167836). This type of Wikidata item has specific properties, some of
which describe a criterion for membership of a given Wikidata item to the considered category:
(i) Category contains (P4224) is described as category contains elements that are instances of this
item and consists of a value together with qualifiers5 if available. The property value refers to
the type of items contained and is referred to as target in this paper. (ii) Wikidata SPARQL query
equivalent (P3921) is described as SPARQL code that returns a set of entities that correspond with
this category or list. Figure 1 shows the Wikidata item for the category Association football video
games (Q13199045), with video game (Q7889) as a target, genre (P136) together with sport (P641)
values as qualifiers, and a corresponding SPARQL query. In addition, a list of corresponding
Wikipedia articles is linked. This exemplifies the multiple sources for category membership used


    3
       https://en.wikipedia.org/wiki/Wikipedia:Categorization
    4
       https://en.wikipedia.org/wiki/Wikipedia:FAQ/Categorization
     5
       Qualifiers provide additional information about a specific statement that may not be represented in a single
triple statement. For more details, kindly refer to https://www.wikidata.org/wiki/Help:Qualifiers.
in Wikipedia and Wikidata. If more than one source is given, the resulting category members
could in theory differ. As we will show later, this is often the case in practice and poses a
possible consistency problem.
    To the best of our knowledge, no previous work has analyzed the (in)consistencies between
Wikipedia category members and items retrieved using the SPARQL queries or targets attached
to the respective Wikidata categories and proposed a solution on how to reduce inconsistencies.
In this paper, we analyze them by comparing their content, elaborating on possible reasons
for and against making all sources consistent, and suggesting some potential future research
directions. The key contributions of this paper are: (i) A workflow for the automatic generation of
candidate categories together with their SPARQL Wikidata and Wikipedia members (mapped to
Wikidata) derived from traversing the Wikipedia category hierarchy in all available languages.
(ii) An analysis of inconsistencies within Wikidata categories and between Wikipedia and
Wikidata. (iii) An automatic investigation of possible reasons for inconsistency.
    The source code for the dataset generation is publicly available [5, 6] under an MIT License
and works on both online Wikipedia/Wikidata public endpoints and offline SQL/JSON dumps.
All generated data [7], cache files [8] containing data retrieved from dumps as well as experiment
results [9] are published on Zenodo. This makes the whole analysis fully reproducible (on
dumps of historic versions of the sources) as well as reusable (assuming the underlying items
and articles in Wikipedia and Wikidata do not change too much).


2. Related Work
Various works have investigated different aspects of leveraging Wikidata and Wikipedia con-
tent. H. Turki et al. [10] focus on explaining how Wikipedia and Wikidata can be processed
using existing techniques for data parsing and querying. Furthermore, they raise awareness
about the usefulness of the integration of Wikipedia and Wikidata categories for different
semantic applications and provide some ideas to enhance the quality of both sources (e.g.,
removing non-transitive relations from the Wikipedia category graph through the analysis of
Wikidata statements). Driven by the observation that a large number of Wikidata entities lack
corresponding Wikipedia articles in some languages (orphans), N. Ostapuk et al. [4] propose
a pipeline to map Wikidata orphan entities to Wikipedia articles’ sections. Their goal is to
enrich orphans with additional facts and properties that are derived from their corresponding
textual description in Wikipedia. As a result they provide a dataset consisting of a collection
of Wikidata entities together with their potential links to related Wikipedia pages in different
languages. I. Johnson [11] analyzed how Wikidata content is referenced within the English
Wikipedia and proposed a taxonomy that categorizes Wikidata transclusions based on the reader
impact. In the context of Wikidata enrichment from external sources, A. Boschin et al. [12]
proposed a method based on knowledge graph embeddings to predict new facts (e.g., triple
completion) using the hyperlinks between Wikipedia articles. P. Curotto et al. [13] proposed a
Wikipedia-based approach for automatic suggestion of authoritative references for Wikidata
statements. The goal is to support editors while referencing Wikidata claims. To evaluate the
accuracy of the automatic recommendations, they also provide a gold standard dataset of sample
claims and their corresponding external references in the English Wikipedia.
                  Dumps
                                     Cache               Candidate           Candidate
                                                                                                  Evaluation
                                    Population           Generation          Cleaning
                Endpoints


Figure 2: Workflow for candidate generation and evaluation.


3. Approach
As the base of our analysis, we need to retrieve categories and their members from Wikipedia
and Wikidata respectively. The corresponding pipeline is outlined in Figure 2. It can be executed
either using the public APIs offered by Wikipedia6 and Wikidata7 or the regularly provided
SQL/JSON dumps from both sites8 . We employ a cache to hold all information needed. When
using the public APIs, this cache is filled successively after each request. In case the data dumps
are used, a preprocessing step extracts all relevant data from the provided files and populates
the cache accordingly. In both cases, the cache allows to prevent redundant requests and speeds
up the processing considerably.
   We start the candidate generation by retrieving all items from Wikidata that correspond
to a Wikipedia category, i.e. instances of Wikimedia category (Q4167836). For each of those,
we further store: (i) The Wikidata identifier, (ii) the target given by the value of category
contains (P4224) including a list of subclasses, (iii) further qualifiers attached to the target,
(iv) the corresponding SPARQL query via Wikidata SPARQL query equivalent (P3921), if existing,
including the results after running the query, (v) and the corresponding Wikipedia category
pages in all languages. For each Wikipedia member, we also retrieve the direct types and all
their properties and their corresponding values9 . Some categories are removed from further
consideration as the corresponding SPARQL queries do not adhere to the structure of a single
target and associated qualifiers. Among the deviations, there are: the use of multiple targets,
lack of an instance of (P31) relation, and queries involving property paths.
   Next, we turn to Wikipedia. For each of the previously identified categories, we fetch the
members and traverse the hierarchy of subcategories if necessary. Wikipedia versions in
different languages are maintained independently from each other10 . Membership in categories
is maintained manually and, hence, also differs across languages. Hence, we have to traverse
categories for each language independently. For each member, we store the corresponding
Wikidata identifier. While traversing the hierarchy of subcategories using a Breadth-First Search,
we apply type checks using the target of the initial category: If fewer than 50% of member

    6
       https://en.wikipedia.org/w/api.php
    7
       https://query.wikidata.org/
     8
       https://dumps.wikimedia.org/backup-index.html and
https://dumps.wikimedia.org/wikidatawiki/entities/
     9
       We consider only object properties with unique values and they are used to automate the comparison during
the evaluation.
    10
       An exception are links between articles with similar topics across languages.
articles (excluding any subcategories) are instances of the target or any of its subclasses11 , the
traversal in this branch will end. After this step, we have acquired not only member items from
Wikidata through the provided SPARQL query but also the manually curated list of members
from Wikipedia.
    Finally, we apply a cleaning step that removes some categories from consideration. Categories
will be omitted if one of the following criteria applies: (i) The category has more than one target.
(ii) The category has no corresponding Wikipedia members. This may be due to, e.g., the type
check already failing for the members of the initial Wikipedia category. (iii) The corresponding
SPARQL query yielded no results. (iv) Multiple SPARQL queries were supplied.


4. Evaluation
The pipeline was executed using the Wikidata JSON dump of 2022-05-02 and the Wikipedia
SQL dumps of 2022-05-01. At that time, Wikidata contained roughly 4.99 million categories12 .
Out of these, only 2, 280 have a corresponding SPARQL query (P3921), 749, 385 have a target
(category contains P4224), and only 516 have both of them. Using the restrictions outlined in
Section 3, this leaves us with 206 categories used for evaluation.
   Our goal is to perform an analysis of the consistency between Wikipedia and Wikidata
with respect to the categories’ content and an automatic investigation of possible reasons.
For this purpose, we compare the two member sets by calculating the precision and recall of
items corresponding to Wikipedia articles {𝑊 𝐼𝐾𝐼} with respect to SPARQL query results
{𝑆𝑃 𝐴𝑅𝑄𝐿}:
                                         |{𝑆𝑃 𝐴𝑅𝑄𝐿} ∩ {𝑊 𝐼𝐾𝐼}|
                           𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                                                   (1)
                                                 |{𝑊 𝐼𝐾𝐼}|
                                        |{𝑆𝑃 𝐴𝑅𝑄𝐿} ∩ {𝑊 𝐼𝐾𝐼}|
                             𝑅𝑒𝑐𝑎𝑙𝑙 =                                                           (2)
                                             |{𝑆𝑃 𝐴𝑅𝑄𝐿}|
   Results reveal an average precision of ∼                0.65 and an average recall of
∼ 0.75. Figure 3a and Figure 3b show the distribution of both metrics for the 206 candi-
date categories. Based on Figure 3a, we observe that for 136 out of 206 categories, at least 80% of
items retrieved using SPARQL also appear as Wikipedia members, 88 categories share more than
90% of the items, and 19 of the categories have a low recall of below or equal to 30%. Figure 3b
shows a rather uniform distribution of the precision, except for categories having more than 90%
precision which applies to 61 out of 206. Overall, a rather high recall can be observed. Items
retrieved by SPARQL but not found via Wikipedia (causing lower recall) can be attributed to
one of the two reasons: Either the entity was not added to the category by any Wikipedia editor
or the traversal has been stopped too early and the respective subcategory was not visited.
   Since the overall precision provides a rather mixed picture, we conducted a more detailed
investigation into possible reasons. Precision gives insights about items that were found in
Wikipedia but not by SPARQL. We define the following possible reasons for an item not being
found via SPARQL queries:
   11
    SPARQL: ?entity wdt:P31/wdt:P279* ?target
   12
    Retrieved via the following query: SELECT (COUNT(DISTINCT ?cat) AS ?count) WHERE { ?cat wdt:P31
wd:Q4167836 .}
                       100                                                                                                                                                                   100
Number of categories


                                                                                                                                                                      Number of categories
                       75                                                                                                                                                                    75

                       50                                                                                                                                                                    50

                       25                                                                                                                                                                    25

                        0                                                                                                                                                                     0
                                   .0            .9            .8            .7            .6            .5            .4            .3            .2            .1                                     .0          .9          .8          .7          .6          .5          .4          .3          .2          .1
                                -1            -0            -0            -0            -0            -0            -0            -0            -0            -0                                      -1          -0          -0          -0          -0          -0          -0          -0          -0          -0
                            9             8             7             6             5             4             3             2             1             0                                       9           8           7           6           5           4           3           2           1           0
                         0.            0.            0.            0.            0.            0.            0.            0.            0.            0.                                      0.          0.          0.          0.          0.          0.          0.          0.          0.          0.

                                                                      (a) Recall.                                                                                                                                                  (b) Precision.
Figure 3: Precision and recall per category for Wikipedia member articles with respect to the Wikidata
items retrieved using corresponding SPARQL queries.


                       • Missing property (missingProp): The Wikidata item does not own a given property.

                       • Different property value (diffPropValue): The Wikidata item contains the property
                         but it has a different value.

                       • Usage of another property (otherPropUsage): The Wikidata item points to a specific
                         value but uses a different property.
We then analyzed the distribution of issues over all items that were not found by SPARQL (items
appearing only as Wikipedia members) for all categories. Based on Figure 4a, we notice that
the most spread issue type is the diffPropValue with ∼ 87% of not found items, followed by
missingProp with ∼ 15%, and otherPropUsage with ∼ 0.41%. Note that the fraction of items for
each issue does not sum up to 100% because the same item may be counted multiple times if
the SPARQL query contains multiple properties. E.g., it is possible that one property is missing
while the other one has a different value.
   We also analyzed the consistency of a SPARQL query with the target and qualifier information
available in the category contains (P4224) property of the Wikidata category. Here, the same
issue classes as before apply as well. Figure 4b shows the distribution of issues over all categories
in this case. We notice that the most wide-spread issue type is missingProp with 52 categories,
followed by otherPropUsage with 19 categories, and diffPropValue with 4 categories – the
remaining categories (131) show no issues. An example category with the otherPropUsage
issue is Category:Uruguayan beach volleyball players (Q22136982) with category contains (P4224)
consisting of the following qualifiers: <occupation (P106): beach volleyball player (Q17361156),
country for sport (P1532): Uruguay (Q77)>, and with a SPARQL query: ?item wdt:P31 wd:Q5;
wdt:P27 wd:Q77; wdt:P106 wd:Q17361156. In this case, the property used within the target’s
qualifiers, country for sport (P1532), has been replaced by a similar albeit not equal property,
country of citizenship (P27).
   For all categories, we further considered the correlation between the fraction of items with a
specific issue type and the number of Wikipedia items not found by SPARQL. Based on Figure 5a
and Figure 5b, we notice that categories with no issues mostly have a small number of Wikipedia
items not found by SPARQL, observing some outliers with a very low fraction of items with
the issue but with more than 1, 000 not found entities. Furthermore, categories with all items
having the issue are observed for categories with a rather small number of items not found.
Issue type            Missing property

               Different property value

             Usage of another property

                                          0%   10%    20%   30%     40%     50%     60%        70%   80%        90%   100%

                                                                  Fraction of items affected

                          (a) Comparing Wikipedia categories and Wikidata SPARQL queries.

                      Missing property
Issue type


               Different property value

             Usage of another property

                                          0          10       20              30           40              50           60

                                                              Number of categories affected

                   (b) Comparing target and associated qualifiers with SPARQL queries in Wikidata.
Figure 4: Relevancy of issue classes for different sources of category membership.


The remaining categories do not follow any specific trend since for categories with a similar
number of items not found, we observe a great variety for the fraction of items affected by
the issues. Figure 5c shows that most of the categories affected by the respective issue have a
number of Wikipedia items not found by SPARQL ranging from 100 to 10, 000 with a rather
low percentage of items affected since it is the less wide-spread issue. In general, we do not see
a clear trend for a correlation between the size of the Wikipedia items not found and the items
affected by the issue.


5. Conclusion and Future Work
We analyzed the consistency between Wikidata and Wikipedia categories and investigated
possible reasons. We also compared the available information within a Wikidata category
(SPARQL query with the target and qualifiers). For this, we proposed a workflow for automatic
generation of candidate categories. It traverses Wikipedia’s category hierarchy in all available
languages and retrieves corresponding members as long as certain conditions hold. Results
reveal differences of various degrees between all sources and show three possible causes.
   The underlying reason for the discovered inconsistencies are rooted in the manual curation of
three separate sources answering in essence the same question: Which items/articles should be
members of a given category? To increase consistency, we suggest to treat Wikidata’s category
contains (P4224) as the main source of truth. From an automation standpoint, this provides the
most structured information. From this, SPARQL queries could be automatically generated.
Finally using these queries, the members of Wikipedia’s categories can be derived. As Wikidata
                 100%                                                                                    100%
Items affected


                                                                                        Items affected
                 75%                                                                                     75%

                 50%                                                                                     50%

                 25%                                                                                     25%

                  0%                                                                                      0%
                    1e0       1e1     1e2              1e3          1e4    1e5    1e6                       1e0       1e1     1e2     1e3    1e4     1e5     1e6

                        Number of Wikipedia items not found using SPARQL                                        Number of Wikipedia items not found using SPARQL

                          (a) Different property value.                                                               (b) Missing property.
                                                             100%
                                            Items affected


                                                             75%

                                                             50%

                                                             25%

                                                              0%
                                                                1e0       1e1     1e2              1e3          1e4    1e5    1e6

                                                                    Number of Wikipedia items not found using SPARQL

                                                                     (c) Usage of another property.
Figure 5: Affected items over Wikipedia items not found using SPARQL. One point per category with
precision below 100% (173 categories). X-axis using a log scale.


albeit growing remains incomplete, we may further use the current category membership in
Wikipedia together with Wikidata’s category contains (P4224) to complete the information of
items of Wikidata.
   Two approaches are possible to improve this situation: First, new changes to any source are
verified against the information contained in the other two. Editors may get a warning if they
seemingly violate these constraints. The cause might not be their current action but a mismatch
with another source. So, editors may still overrule the warning and commit their change.
Second, we can create an interactive interface to review the changes proposed previously. As
we can not be certain which information is wrong or incomplete, here human editors may
verify the assumptions of an automated system and only verified changes will be propagated to
Wikipedia and Wikidata respectively. Both approaches will over time increase the consistency
and quality of both Wikipedia and Wikidata and as a consequence improve their usefulness in
other applications.
   We base our work on the assumption that the definitions of categories, i.e. their semantics,
are consistent across Wikidata and all languages within Wikipedia. This might fail, though.
Categories like Category:American singers (Q7063228) can be seen in at least two ways, both
of which are legitimate interpretations: based on the country of citizenship (P27) as Wikidata
currently does or based on the place of birth (P19). Although we do not have evidence of similar
divergences in existence, they would require a community process to converge on a common
interpretation before applying our suggestions.
Acknowledgments
This work has been partially funded by the German Aerospace Center (DLR). We thank Prof.
Dr. Birgitta König-Ries for the guidance and feedback.


References
 [1] D. Vrandečić, M. Krötzsch, Wikidata: A Free Collaborative Knowledgebase, Commun.
     ACM 57 (2014) 78–85. doi:10.1145/2629489.
 [2] R. Kaptein, J. Kamps, Exploiting the category structure of Wikipedia for entity ranking, Ar-
     tificial Intelligence 194 (2013) 111–129. doi:10.1016/j.artint.2012.06.003, artificial
     Intelligence, Wikipedia and Semi-Structured Resources.
 [3] J. F. Medeiros, B. P. Nunes, S. W. M. Siqueira, L. A. P. P. Leme, TagTheWeb: Using
     Wikipedia Categories to Automatically Categorize Resources on the Web, in: Lecture
     Notes in Computer Science, Springer International Publishing, 2018, pp. 153–157. doi:10.
     1007/978-3-319-98192-5_29.
 [4] N. Ostapuk, D. E. Difallah, P. Cudré-Mauroux, SectionLinks: Mapping Orphan Wikidata
     Entities onto Wikipedia Sections, in: Proceedings of the 1st Wikidata Workshop (Wikidata
     2020) co-located with 19th International Semantic Web Conference, Virtual Conference,
     volume 2773 of CEUR Workshop Proceedings, 2020.
 [5] L. Feddoul, S. Schindler, fusion-jena/wiki-category-consistency, 2022. URL: https://github.
     com/fusion-jena/wiki-category-consistency.
 [6] L. Feddoul, S. Schindler, fusion-jena/wiki-category-consistency v1.0.2, 2022. doi:10.5281/
     zenodo.6963599.
 [7] L. Feddoul, F. Löffler, S. Schindler, wiki-category-consistency-dataset, 2022. doi:10.5281/
     zenodo.6913282.
 [8] L. Feddoul, F. Löffler, S. Schindler, wiki-category-consistency-cache, 2022. doi:10.5281/
     zenodo.6913134.
 [9] L. Feddoul, F. Löffler, S. Schindler, wiki-category-consistency-eval, 2022. doi:10.5281/
     zenodo.6913332.
[10] H. Turki, M. A. H. Taieb, M. B. Aouicha, Coupling Wikipedia Categories with Wikidata
     Statements for Better Semantics, in: Proceedings of the 2nd Wikidata Workshop (Wikidata
     2021) co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual
     Conference, volume 2982 of CEUR Workshop Proceedings, 2021.
[11] I. Johnson, Analyzing Wikidata Transclusion on English Wikipedia, in: Proceedings of
     the 1st Wikidata Workshop (Wikidata 2020) co-located with 19th International Semantic
     Web Conference, Virtual Conference, volume 2773 of CEUR Workshop Proceedings, 2020.
[12] A. Boschin, T. Bonald, Enriching Wikidata with Semantified Wikipedia Hyperlinks, in:
     Proceedings of the 2nd Wikidata Workshop (Wikidata 2021) co-located with the 20th
     International Semantic Web Conference (ISWC 2021), Virtual Conference, volume 2982 of
     CEUR Workshop Proceedings, 2021.
[13] P. Curotto, A. Hogan, Suggesting Citations for Wikidata Claims based on Wikipedia’s
     External References, in: Proceedings of the 1st Wikidata Workshop (Wikidata 2020) co-
     located with 19th International Semantic Web Conference (OPub 2020), Virtual Conference,
     volume 2773 of CEUR Workshop Proceedings, 2020.