Expanding Wikidata’s Parenthood Information
by 178%, or How To Mine Relation Cardinalities

               Paramita Mirza1 , Simon Razniewski2 and Werner Nutt2
     1                                          2
         Max Planck Institute for Informatics       Free University of Bozen-Bolzano


         Abstract. While so far automated knowledge base construction has
         largely focused on fully qualified facts, e.g., hObama, hasChild, Maliai,
         the Web contains also extensive amounts of existential information in
         the form of cardinality assertions, e.g., that someone has two children
         without giving their names. In this paper we argue that the extraction
         of such information could substantially increase the scope of knowledge
         bases. For the sample of the hasChild relation in Wikidata, we show that
         simple regular-expression based extraction from Wikipedia can increase
         the size of the relation by 178%. We also show how such cardinality
         information can be used to estimate the recall of knowledge bases.


1   Introduction

General-purpose knowledge bases (KBs) such as Wikidata [6], YAGO [5] or
the Google Knowledge Vault [1] try to capture as much information about the
world as possible. While they usually have high precision (for instance >95% for
YAGO), their recall is generally much lower (e.g., only 6 out of 35 Dijkstra prize
winners are in DBpedia, or only about 0.02% of all living people are currently
in Wikidata), and in general hard to assess [3,4]. And even though extraction
techniques are continually improving, there exists the fundamental barriers to
high recall that many facts, for instance the favourite dishes of the authors of
this paper, are just not present on the web.
    But there is some hope. For a substantial set of topics, natural language
texts at least mention the existence of information via cardinality statements,
for instance “John wrote two books”, or “Mary has three children”. While such
cardinality assertions do not allow to recreate fully qualified facts, they still
carry interesting information, and can be useful for instance for directing KB
authors towards incomplete parts, for informing data consumers about missing
data, or for improving the precision of query results (e.g., a correct answer to
the query “Give me the average number of children per person” does not require
fully qualified facts).
    Most common data models support existential information, RDF for instance
via blank nodes, SQL via nulls, and OWL via cardinality constraints. Cardinality
information can also be found in Wikidata, which has a property called number
of children (P1971). It is scarcely used so far however, i.e., only 0.21% of humans
in Wikidata have it (6,740 in total).
    In this paper we exemplify the extraction and use of cardinality information
for the hasChild relation in Wikidata. Our technical contribution is threefold:
 1. We show that cardinality assertions exists numerously in Wikipedia, thus
    confirming the motivation for data models that allow to specify cardinality
    constraints, blank nodes, labeled nulls, and similar.
 2. We show that with simple filters, we can extract high quality cardinality
    assertions having >90% precision, which allow us to learn about the existence
    of 178% more children than there are currently in Wikidata.
 3. We show how this information can be used to assess the recall of existing
    KBs, finding for instance that child data is almost 10 times more complete
    for actors (2.42%) than for association football players (0.25%).
Our extracted cardinality assertions and the hand-crafted extraction patterns
used are available online.1

2     Extracting Cardinality Information
In natural language texts, cardinality information for children is expressed by
phrases such as:
 1. The couple had 6 children.
 2. He never had any children.
 3. They are the parents of three beautiful daughters.
 4. Barnes has 2 sons and one young daughter.
In this work, we use surface patterns via regular expressions to extract car-
dinalities. We manually constructed 30 patterns to find such sentences and to
determine the total number of children according to the cardinal numbers found
in the sentences. Our method is able to resolve, for instance, that according to
Sentence 2 the total number of children is zero, or three for Sentence 4. Existing
Open IE systems, such as ReVerb [2], fail to resolve such quantification.
    A major challenge in information extraction is entity resolution. We avoid this
challenge by working only on biographical articles in Wikipedia, and assuming
that children cardinalities mentioned in texts refer to the number of children of
the person the article is about. To reduce the number of incorrect assertions that
may result from this, we propose two filters:
 1. 1-statement filter. This filter removes all articles that contain more than one
    cardinality statement. The intuition is that even if cardinalities of multiple
    statements match, it is hard to decide whether one of the statements is
    just wrong or redundant, or whether they should be summed (frequently,
    articles would describe children counts from different marriages in separate
    sentences).
 2. 75%-shortest filter. This filter removes the 25% longest articles, based on
    the observation that longer articles frequently contain children information
    of parents or other relatives (“His son John is a successful lawyer that lives
    with his wife and two children in New Hampshire”).
1
    http://paramitamirza.com/other/cardinality-statements/
Table 1. Precision on 50 samples (gold) and the number of children property (silver).

                           #statements      gold standard         silver standard
                                         #stmts #correct prec. #stmts #correct prec.
      all statements           123,885       50       43 .860   3,156    2,626 .832
      1-statement filter       112,654       45       41 .911   2,815    2,496 .887
     75%-shortest filter        92,914       37       34 .919   1,612    1,416 .878
      both filters              86,227       35       33 .943   1,506    1,366 .907


Evaluation. We evaluate the precision of our extraction in two ways: (i) manual
evaluation on 50 random phrases expressing children cardinalities (gold stan-
dard) and (ii) comparison of the extracted cardinality statements with the val-
ues of the number of children property (silver standard). Table 1 shows the
evaluation results in which our unfiltered extraction achieves 86.0% and 83.2%
precision for gold and silver standard, respectively, for a total of 123,885 ex-
tracted assertions. After applying both filters, 86,227 assertions remain, with
a precision of 94.3% and 90.7%, respectively. Note that the lower precision on
the silver standard likely comes from the fact that the number of children prop-
erty itself can contain errors or can be outdated. For 2,289 out of these 86,227
persons, all children are already contained in Wikidata. The remaining 83,938
persons are missing 287,153 children, 178% more than the number of child facts
currently contained in Wikidata.


3     Using Cardinality Information to Estimate KB Recall
Given the cardinality statements that we extracted, children information is com-
plete for 0.7% of the 3.14 million humans currently contained in Wikidata
(which however, in turn, are only about 0.03% of all the people that ever
lived2 ). For those humans for which we could extract a cardinality assertion,
in turn, 2.65% have complete children in Wikidata. As it is an open challenge to
know in which parts knowledge bases are more complete [3,4], in the following
we do a simple analysis based on dead/alive and occupations of persons.
Dead vs Alive. Cardinality statements extracted from articles are more likely
to be found in articles of persons that are dead (3.81%), than for those that
are alive (1.99%). Similarly, for those having a cardinality assertion, the child
relation is more likely to be complete for dead (1.72%) than for living humans
(0.88%). One might conjecture that for dead people, it is easier to consolidate
data.
Occupations. Based on 20 most frequent occupations in Wikidata, we found that
judges (8.22%), lawyers (7.93%), and politicians (5.11%) are the top occupations
2
    https://en.wikipedia.org/wiki/World_population#Number_of_humans_who_
    have_ever_lived
with cardinality information available in their Wikipedia articles; compared with
sportsmen, e.g., association football player (0.51%), athletics competitor (1.27%),
ice hockey player (1.10%) that seldom have such information. In turn, comparing
actual child facts in Wikidata with extracted cardinality information, we find
that matches most frequently happen for showbiz-related professions such as
actor (2.42%) or film director (2.79%), and again least frequent for sport players,
e.g. ice hockey player (0.0%) or baseball player (0.13%).

4   Outlook
Given available numerous cardinality information for the child relation in Wiki-
pedia, we have presented a simple method to extract high quality cardinality
assertions, which we then used to assess the completeness of the relation.
     A challenge in broadening this work is that for weakly-defined relations such
as hobby or profession, cardinality is difficult to assert. We plan to focus next on
other well quantifiable relations such as as sibling (“He has 3 older brothers”),
graduatedFrom (“She holds a PhD in Chemistry”), and in particular intellectual
work (“He has written two books, she composed 5 operas, he directed 12 movies”).
     There are several ways to improve the quantity and quality of extracted
cardinality statements. Cardinality information found in Wikipedias in other
languages and further pattern engineering could be used both for retrieving
more statements, or for improving the precision. For retrieving more statements,
one could also drop the restriction to biographical Wikipedia articles or the
filters. This may decrease precision though, as co-reference resolution for entities
expressed via pronouns (“They”), incomplete names (“Barnes”), or generic nouns
(“the couple”) is still a challenging NLP task.

Acknowledgment This work has been partially supported by the projects
“MAGIC”, funded by the province of Bozen-Bolzano, and “The Call for Recall”,
funded by the Free University of Bozen-Bolzano.

References
1. X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann,
   S. Sun, and W. Zhang. Knowledge vault: a web-scale approach to probabilistic
   knowledge fusion. In SIGKDD, 2014.
2. A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information
   extraction. In EMNLP, 2011.
3. L. Galàrraga, S. Razniewski, A. Amarilli, and F. M. Suchanek. Predicting complete-
   ness in knowledge bases. Manuscript, 2016. Available at http://luisgalarraga.
   de/manuscripts.
4. S. Razniewski, F. M. Suchanek, and W. Nutt. But what do we actually know?
   AKBC, 2016.
5. F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: a core of semantic knowledge.
   In WWW, 2007.
6. D. Vrandečić and M. Krötzsch. Wikidata: a free collaborative knowledgebase. Com-
   munications of the ACM, 2014.