<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Expanding Wikidata's Parenthood Information by 178%, or How To Mine Relation Cardinalities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paramita Mirza</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Razniewski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Werner Nutt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Free University of Bozen-Bolzano</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Max Planck Institute for Informatics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>While so far automated knowledge base construction has largely focused on fully qualified facts, e.g., hObama, hasChild, Maliai, the Web contains also extensive amounts of existential information in the form of cardinality assertions, e.g., that someone has two children without giving their names. In this paper we argue that the extraction of such information could substantially increase the scope of knowledge bases. For the sample of the hasChild relation in Wikidata, we show that simple regular-expression based extraction from Wikipedia can increase the size of the relation by 178%. We also show how such cardinality information can be used to estimate the recall of knowledge bases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        General-purpose knowledge bases (KBs) such as Wikidata [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], YAGO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or
the Google Knowledge Vault [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] try to capture as much information about the
world as possible. While they usually have high precision (for instance &gt;95% for
YAGO), their recall is generally much lower (e.g., only 6 out of 35 Dijkstra prize
winners are in DBpedia, or only about 0.02% of all living people are currently
in Wikidata), and in general hard to assess [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ]. And even though extraction
techniques are continually improving, there exists the fundamental barriers to
high recall that many facts, for instance the favourite dishes of the authors of
this paper, are just not present on the web.
      </p>
      <p>But there is some hope. For a substantial set of topics, natural language
texts at least mention the existence of information via cardinality statements,
for instance “John wrote two books”, or “Mary has three children”. While such
cardinality assertions do not allow to recreate fully qualified facts, they still
carry interesting information, and can be useful for instance for directing KB
authors towards incomplete parts, for informing data consumers about missing
data, or for improving the precision of query results (e.g., a correct answer to
the query “Give me the average number of children per person” does not require
fully qualified facts).</p>
      <p>Most common data models support existential information, RDF for instance
via blank nodes, SQL via nulls, and OWL via cardinality constraints. Cardinality
information can also be found in Wikidata, which has a property called number
of children (P1971). It is scarcely used so far however, i.e., only 0.21% of humans
in Wikidata have it (6,740 in total).</p>
      <p>In this paper we exemplify the extraction and use of cardinality information
for the hasChild relation in Wikidata. Our technical contribution is threefold:
1. We show that cardinality assertions exists numerously in Wikipedia, thus
confirming the motivation for data models that allow to specify cardinality
constraints, blank nodes, labeled nulls, and similar.
2. We show that with simple filters, we can extract high quality cardinality
assertions having &gt;90% precision, which allow us to learn about the existence
of 178% more children than there are currently in Wikidata.
3. We show how this information can be used to assess the recall of existing
KBs, finding for instance that child data is almost 10 times more complete
for actors (2.42%) than for association football players (0.25%).
Our extracted cardinality assertions and the hand-crafted extraction patterns
used are available online.1
2</p>
    </sec>
    <sec id="sec-2">
      <title>Extracting Cardinality Information</title>
      <p>In natural language texts, cardinality information for children is expressed by
phrases such as:
1. The couple had 6 children.
2. He never had any children.
3. They are the parents of three beautiful daughters.
4. Barnes has 2 sons and one young daughter.</p>
      <p>
        In this work, we use surface patterns via regular expressions to extract
cardinalities. We manually constructed 30 patterns to find such sentences and to
determine the total number of children according to the cardinal numbers found
in the sentences. Our method is able to resolve, for instance, that according to
Sentence 2 the total number of children is zero, or three for Sentence 4. Existing
Open IE systems, such as ReVerb [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], fail to resolve such quantification.
      </p>
      <p>A major challenge in information extraction is entity resolution. We avoid this
challenge by working only on biographical articles in Wikipedia, and assuming
that children cardinalities mentioned in texts refer to the number of children of
the person the article is about. To reduce the number of incorrect assertions that
may result from this, we propose two filters:
1. 1-statement filter. This filter removes all articles that contain more than one
cardinality statement. The intuition is that even if cardinalities of multiple
statements match, it is hard to decide whether one of the statements is
just wrong or redundant, or whether they should be summed (frequently,
articles would describe children counts from different marriages in separate
sentences).
2. 75%-shortest filter. This filter removes the 25% longest articles, based on
the observation that longer articles frequently contain children information
of parents or other relatives (“His son John is a successful lawyer that lives
with his wife and two children in New Hampshire” ).
1 http://paramitamirza.com/other/cardinality-statements/
Evaluation. We evaluate the precision of our extraction in two ways: (i) manual
evaluation on 50 random phrases expressing children cardinalities (gold
standard) and (ii) comparison of the extracted cardinality statements with the
values of the number of children property (silver standard). Table 1 shows the
evaluation results in which our unfiltered extraction achieves 86.0% and 83.2%
precision for gold and silver standard, respectively, for a total of 123,885
extracted assertions. After applying both filters, 86,227 assertions remain, with
a precision of 94.3% and 90.7%, respectively. Note that the lower precision on
the silver standard likely comes from the fact that the number of children
property itself can contain errors or can be outdated. For 2,289 out of these 86,227
persons, all children are already contained in Wikidata. The remaining 83,938
persons are missing 287,153 children, 178% more than the number of child facts
currently contained in Wikidata.
3</p>
      <p>
        Using Cardinality Information to Estimate KB Recall
Given the cardinality statements that we extracted, children information is
complete for 0.7% of the 3.14 million humans currently contained in Wikidata
(which however, in turn, are only about 0.03% of all the people that ever
lived2). For those humans for which we could extract a cardinality assertion,
in turn, 2.65% have complete children in Wikidata. As it is an open challenge to
know in which parts knowledge bases are more complete [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ], in the following
we do a simple analysis based on dead/alive and occupations of persons.
Dead vs Alive. Cardinality statements extracted from articles are more likely
to be found in articles of persons that are dead (3.81%), than for those that
are alive (1.99%). Similarly, for those having a cardinality assertion, the child
relation is more likely to be complete for dead (1.72%) than for living humans
(0.88%). One might conjecture that for dead people, it is easier to consolidate
data.
      </p>
      <p>Occupations. Based on 20 most frequent occupations in Wikidata, we found that
judges (8.22%), lawyers (7.93%), and politicians (5.11%) are the top occupations
2 https://en.wikipedia.org/wiki/World_population#Number_of_humans_who_
have_ever_lived
with cardinality information available in their Wikipedia articles; compared with
sportsmen, e.g., association football player (0.51%), athletics competitor (1.27%),
ice hockey player (1.10%) that seldom have such information. In turn, comparing
actual child facts in Wikidata with extracted cardinality information, we find
that matches most frequently happen for showbiz-related professions such as
actor (2.42%) or film director (2.79%), and again least frequent for sport players,
e.g. ice hockey player (0.0%) or baseball player (0.13%).
4</p>
    </sec>
    <sec id="sec-3">
      <title>Outlook</title>
      <p>Given available numerous cardinality information for the child relation in
Wikipedia, we have presented a simple method to extract high quality cardinality
assertions, which we then used to assess the completeness of the relation.</p>
      <p>A challenge in broadening this work is that for weakly-defined relations such
as hobby or profession, cardinality is difficult to assert. We plan to focus next on
other well quantifiable relations such as as sibling (“He has 3 older brothers ”),
graduatedFrom (“She holds a PhD in Chemistry ”), and in particular intellectual
work (“ He has written two books, she composed 5 operas, he directed 12 movies ”).</p>
      <p>There are several ways to improve the quantity and quality of extracted
cardinality statements. Cardinality information found in Wikipedias in other
languages and further pattern engineering could be used both for retrieving
more statements, or for improving the precision. For retrieving more statements,
one could also drop the restriction to biographical Wikipedia articles or the
filters. This may decrease precision though, as co-reference resolution for entities
expressed via pronouns (“They”), incomplete names (“Barnes”), or generic nouns
(“the couple”) is still a challenging NLP task.</p>
      <p>Acknowledgment This work has been partially supported by the projects
“MAGIC”, funded by the province of Bozen-Bolzano, and “The Call for Recall”,
funded by the Free University of Bozen-Bolzano.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          , E. Gabrilovich, G. Heitz,
          <string-name>
            <given-names>W.</given-names>
            <surname>Horn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Strohmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sun</surname>
          </string-name>
          , and
          <string-name>
            <surname>W. Zhang.</surname>
          </string-name>
          <article-title>Knowledge vault: a web-scale approach to probabilistic knowledge fusion</article-title>
          .
          <source>In SIGKDD</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Fader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Soderland</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Identifying relations for open information extraction</article-title>
          .
          <source>In EMNLP</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>L.</given-names>
            <surname>Galàrraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Amarilli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          .
          <article-title>Predicting completeness in knowledge bases</article-title>
          .
          <source>Manuscript</source>
          ,
          <year>2016</year>
          . Available at http://luisgalarraga. de/manuscripts.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Nutt</surname>
          </string-name>
          .
          <article-title>But what do we actually know?</article-title>
          <source>AKBC</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , G. Kasneci, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum. YAGO:</surname>
          </string-name>
          <article-title>a core of semantic knowledge</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          .
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>