<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Measuring Demonstrated Potential Domain Knowledge with Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiyin He</string-name>
          <email>jhe@cwi.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marc Bron</string-name>
          <email>mbron@yahoo-inc.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CWI</institution>
          ,
          <addr-line>Amsterdam</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Yahoo!</institution>
          ,
          <addr-line>London</addr-line>
        </aff>
      </contrib-group>
      <fpage>13</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>Current search and recommendation engines enable us to effectively retrieve a set of documents based on topical relevance. What is not taken into account is the knowledge a user may already have about a topic, e.g., whether information is redundant or whether he/she is able to understand the results. We propose a method to measure demonstrated potential domain knowledge (DPDK) as a proxy for knowledge and use this metric to analyse the query log of a user spanning over 10 years.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright c by the paper’s authors. Copying permitted for private and academic purposes.</p>
      <p>In: L. Dietz, C. Xiong, E. Meij (eds.): Proceedings of the First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR),
Tokyo, Japan, 11-Aug-2017, published at http://ceur-ws.org
1Although search engines may compile user profiles based on historical data these are at best implicit represenations of knowledge.
this may allow us to gain insights in a user’s knowledge. Search logs alone, however, are not enough. They may be able
to provide an overview of what a user does know, but it does not provide any information about what is unknown to the
user, i.e., what a user could potentially learn about a particular topic. Knowledge bases are (often) manually constructed
repositories aimed at providing an overview of the concepts that exist in a particular domain. These repositories could be
seen as a representation of the knowledge of an expert in the domain.</p>
      <p>Given these two components we propose a method to measure demonstrated potential domain knowledge (DPDK) as a
proxy for knowledge and use this metric to analyse the query log of one of the authors spanning over 10 years. The choice
to use a single log in our analysis was made for several reasons. The first is practical, i.e., it is hard to obtain a large sample
of long term (e.g., over a decade) user logs. Second, it is difficult for an external assessor to judge what motivated users to
issue particular queries. The third and final one is ethical, in our qualitative analysis we aim to understand the actions and
motivations of the user over an extended period of their life and we feel that such an analysis should require user consent.
We note that this study is exploratory in nature and understand the lack of generalizability of our findings beyond the log
used in the study. However, we believe that our findings will motivate further study into this topic.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>
        Defining Knowledge. The definition of knowledge has been the topic of debate among scholars and philosophers since
at least the ancient Greek times and is generally known as a branch of philosophy called epistemology. The work of the
Greek philosopher Plato gave rise to an early definition of knowledge. In his work the Meno he puts forward the following
paradox [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: For anything, F, either one knows F or one does not know F. If one knows F, then one cannot inquire about F.
If one does not know F, then one cannot inquire about F. Therefore, for all F, one cannot inquire about F.
      </p>
      <p>In response Plato concedes that one cannot inquire about something one does not know, however, having a belief about
something is adequate to start inquiry. Hence Plato describes different steps on the way towards knowing something:</p>
      <sec id="sec-2-1">
        <title>Perception of a phenomenon that stimulates forming a belief;</title>
        <p>to hold a belief about something and develop explanations for the belief;
to verify that the belief is true and to finally know something.</p>
        <p>Although Plato remains vague on how one actually transitions between stages his description of how one acquires
knowledge suggests a definition of knowledge as a justifiable true belief (JTB):
Belief : something that is known must have been encountered, whether it is through perception or derivation.</p>
      </sec>
      <sec id="sec-2-2">
        <title>True : the proposition, fact, or object that is believed must be true in order to be known.</title>
        <p>
          Justifiable : a reason, explanation, or account that explains why someone holds a belief and believes that it is true.
This notion of justifiable true belief has a central place in epistemology. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] More recent critics have suggested cases
where a definition of knowledge as a justifiable true belief fails. Often referred to as Gettier problems or generalizations
thereof. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. For example, the “cow in the field” problem, where a farmer checking up on his cow confuses a piece of
black and white canvas caught up in a distant bush for a cow. However, since the animal actually is in the field, but lying
hidden in a ditch the farmer has a justified, true belief, which does not seem to qualify as “knowledge” [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Others have
suggested modifications to the theory of justified true beliefs by adding additional constraints or modifying the definition
of justification to render such examples false.
        </p>
        <p>
          In the remainder of the paper we focus on two necessary conditions for justifiable true beliefs, i.e., truth and belief.
Measuring Knowledge. One component in our discussion of epistemology above is truth, i.e., the objects or facts that
exist and can be known. This is actually the object of study of another branch of philosophy, i.e., metaphysics. In his work
Aristotle describes one of the first ontologies that provides a high level categorization of the things there are.[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] With the
rise of the semantic web and Linking Open Data Cloud an ever increasing ontology or knowledge base is available that
provides a reference for the facts and objects that exist in the world. It provides an excellent representation of the things in
the world that someone can have knowledge about.
        </p>
        <p>The second component is belief, i.e., in order to know an object someone should have a belief about that object.
However, it is difficult to observe one’s beliefs directly. Instead we can observe beliefs as expressed through interactions with
digital systems, where users type queries, click on articles, and write messages. These expressions may serve as observable
expressions of beliefs.</p>
        <p>
          The final step is then to take the intersection between the truth (things existing in the world) and the user’s beliefs.
Recent advances in query understanding and entity recognition have resulted in systems that are able to reliably link user
expressions with objects in knowledge bases [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. For example, a user issuing the query “Michael Schumacher” could link
this expression to the entry of Michael Schumacher the race car driver and derive related information from the assertions
in the KB. This expression suggests that this user potentially knows who Michael Schumacher is and attempts to connect
his belief with the truth through querying.
        </p>
        <p>Having set the scene with the above definitions, we now explore quantities and their associated assumptions that can be
used as metrics to operationalize the measurement of knowledge.</p>
        <p>First of all if we accept that the collective justifiable true beliefs held by a person constitute his/her knowledge, then
counting the number of JTBs is a metric of knowledge.</p>
        <p>However we already relaxed the requirement for justification and only focus on counting TBs. Under this condition,
if we assume that all beliefs of a user are observed as expressions and that all these expression are perfectly connected
to an exhaustive KB of things then we measure the potential set of knowledge (TBs). A user’s actual knowledge will
be the subset of these that are justifiable (JTBs).</p>
        <p>Since no KB is complete, but if we assume that all beliefs of a user are observed and those that appear in the KB
are successfully connected as true beliefs then we measure the user’s potential knowledge within a certain domain as
specified by the KB.</p>
        <p>Even if we observe all a user’s expressions in digital systems during his/her lifetime, not all that user’s beliefs will be
observed. If both the observed beliefs and the KB are incomplete then we measure the potential observable beliefs of
a user within a particular domain: one might call this “demonstrated/observed potential domain knowledge” (DPDK).
The process of linking between observed expressions of beliefs and KBs is also not perfect. However, we treat this as
measurement error.</p>
        <p>A user can have beliefs about objects (entities) but also relations between objects or aspects of objects. In this
preliminary work we focus only on beliefs held about objects. We leave detecting beliefs about relations and aspects
as well as how to link these to knowledge bases as future work.</p>
        <p>Finally a note on forgetting. In the above argument we assume that the TBs of a person always increase as he/she acquires
new information. We therefore treat forgetting as no longer being able to provide a justification for a belief. This seems
reasonable since having observed that the belief was once held makes it a potential JTB and thus potential knowledge. We
leave the handling of forgetting for future work.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>
        Experimental setup and Data description. As our reference knowledge base of things that exist in the world we use
DBpedia. DBpedia is a knowledge base extracted from Wikipedia and contains encyclopedic knowledge in the order of
580 million facts (relations) about 4.5 million objects (entities). In order to obtain expressions of beliefs we obtained the
Google search query log of one of the authors from https://takeout.google.com/settings/takeout. The
log ranges from May 2006 to May 2017 and contains about 62K queries. To link expressions of belief, i.e., queries, to the
knowledge base we use a state of the art open source entity linking tool released by Yahoo! [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] 2. If a query is successfully
linked to one or more entities we select the entity with the highest log-likelihood score with a minimal score threshold of
-3.0. Using this procedure we link 39K queries to entities in the knowledge base.
      </p>
      <p>Figure 1 shows the number of queries per day, week, and month in darkgray as well as the number of entities linked
from the queries to the knowledge base in lightgray. We observe that the number of queries issued has high variance,
whether per day, week or month. There are times when many queries are issued and times when fewer queries are issued.
Further there are some spikes in the number of queries issued suggesting high search activity in a particular day, week, or
month. In the monthly data we observe that there is an increasing trend in the number of queries up to 100 months (end
2014) after which it drops. This aligns with the start of the author’s use of an additional search engine. The number of
entities linked follows the same trends as the number of queries issued, albeit at about 2/3 the volume. On average there
are 15.57 queries per day, 107.0 queries per week, and 468.3 queries issued per month. The number of entities linked are
on average 9.884 per day, 67.93 per week, and 297.3 per month. In terms of time periods without queries we find that in
0 500 1000 1500 2000 2500 3000 3500 4000
0
100
200
300
400
500
600
0
20
40
60
the period May 2006 to May 2017 there are 701 days without queries and 887 days on which no entities were linked. In
terms of months we find that there are 3 months where no queries were issued, and 4 months in which no entities could be
linked. These occurred all in 2006. Therefore, In the remainder of the paper, we will use the period Jan 2007 to Dec 2016
and analyze the data at a monthly granularity.</p>
      <p>Quantitative Results. To measure demonstrated potential domain knowledge (DPDK) we compute the cumulative number
of unique entities expressed per month. The entities are unique since only the first occurrence of an entity is considered.
This reflects the intuition that multiple expressions of the same knowledge do not increase the overall DPDK. The left
graph in Figure 2 shows the cumulative number of queries (darkgray) and entities (lightgray) each month. We observe that
the cumulative number of queries and the cumulative number of entities follow a similar curve. This follows as well from
Figure 1 where we observed that roughly 60 to 70 percent of the queries are consistently linked to an entity over time. The
right graph in Figure 2 shows the cumulative number of unique queries and entities, when only the first occurrence of a
query (or entity) is taken. Here we no longer observe similar trends between the unique cumulative queries and unique
cumulative entities. The unique cumulative entities now exhibit a linear relationship over time. This is an interesting
finding for two reasons. First, we observe that a metric based on queries and a metric based on entities measure different
things. It suggests that the step of linking queries (or expressions of beliefs) to a knowledge base succeeds in differentiating
between searching for information and expressing beliefs about facts in a domain. Second, it suggests that over 10 years
time the author sought out new knowledge at roughly the same rate. It would be interesting to investigate whether this
pattern generalizes to a wider sample and whether it correlates with curiosity or other personality traits.</p>
      <p>Next we examine two derivative metrics from the DPDK, that is DPDK-velocity and DPDK-acceleration. The left
graph in Figure 3 shows the derivative at each point of the cumulative unique queries (darkgray) and entities (lightgray).
We observe that the number of unique queries per month has high variance similar as observed for the number of queries
per month in Figure 1. In contrast the number of unique entities per month is more stable, with one exception around
month 25. This is more clearly visible in the right graph that shows the second derivative at each point of the cumulative
unique entities (for clarity the analogous data for queries is not shown). The author suspects this spike is due to finishing a
MSc thesis followed by a long holiday.</p>
      <p>This part shows that using unique entities as a basis for measuring beliefs is different from using queries. It suggests that
query volume is not necessarily driven by quests for new knowledge but also by re-finding or perhaps knowledge outside
of the domain of Wikipedia. We have not touched on what kind of potential knowledge was expressed by the author. We
look into this next.</p>
      <p>Qualitative Results. The above analysis treats all expressions that are linked to the knowledge base as observed expressions
of beliefs about the domain of the knowledge base. It is not uncommon, however, for sub-domains to exist within a
(a) 2012-2013
knowledge base, especially one as broad as DBPedia. People, then, may express beliefs about a diverse set of domains
during a particular time, or such expressions could be focussed on a particular domain. To analyze whether we observe
such behavior in the author’s log, we utilize the DBPedia categories, representing different domains, to cluster queries that
are linked to entities within the same category in the knowledge base.</p>
      <p>The Wikipedia category structure is not a strict hierarchy. To assign each entity to a single category at a particular
level in the category structure we extract a hierarchy from the category structure. We use Category:Main\_topic\
_classifications as the root node of the category hierarchy, eliminate cycles, and pick the shortest one to the root
when multiple paths exist. Given this hierarchy, we find for each entity all its categories and the shortest path to the root.
We can then slice the hierarchy at a particular level and find for each category all the entities that are associated with it. The
amounts of identified entities covered under different categories can be seen as a distribution of a user’s potential domain
knowledge. One way to use this distribution, for instance, is to compare the knowledge distribution between different
people, e.g. experts vs. novices. Since we lack the logs of multiple people, here we inspect changes in potential domain
knowledge within the same log year over year.</p>
      <p>We observe that in 2012-2013 most expressions were linked to the category “Probability and Statistics”. These queries
were issued during the author’s time as a PhD student in Information Retrieval. The author suspects that this may have
prompted an increase in queries related to that category. In 2013-2014, after just finishing up as a PhD, the author was
teaching a course on Social Network Analysis to undergraduate students and suspects that is the cause of an increase in the
proportion of queries related to the category “Graph Theory”. In 2014-2015 it is more difficult to single out a category in
which new knowledge was demonstrated, but the authors’ note that they were preparing their wedding in that year,
explaining the rise in queries related to the category “Marriage”. The visualization of 2015-2016 shows a number of interesting
categories “Human Reproduction”, “Fertility”, and “Tissues (Biology)” as well as “Governance in the United Kingdom
and Taxation.” These categories again align with some of the life events of the authors’ during that year, i.e,. expecting
their first child and moving to the United Kingdom.</p>
      <p>The kind of qualitative analysis of a query log, like the one above, would be very difficult for anyone but the one who
issued the queries. Even then it is hard to know what prompted an acceleration in queries related to concepts in a particular
domain in hindsight. What we do observe is that some of the accelerations can be explained by the occurrence of certain
life events. An interesting direction for future work would be to analyze how accelerations observed at different (lower)
levels of the hierarchy, i.e., more specialized domains, relate to events in users’ lives or more specific tasks.
In this paper we explored the potential of using search logs and knowledge bases to gain insight in a user’s potential
knowledge of certain domains. We explored the definition of a metric of knowledge based on the theory of justifiable true
beliefs. We further operationalized the measure as demonstrated potential domain knowledge (DPDK), i.e., a person’s
observed potential beliefs within a particular domain, based on a series assumptions that relax the requirements for JTBs.
We showed that measuring the number of expressions of beliefs (e.g., queries) that can be linked to a knowledge base
is different from measuring the number of queries issued by a user. The former was observed to, surprisingly, increase
linearly over time. Further we found that changes in DPDK within sub-domains of a knowledge base can be associated
with certain life events of a user. Although anecdotal in nature these observation suggest that DPDK captures situations in
which a user is learning about new things and increasing his/her knowledge.</p>
      <p>Obviously this is the first step in creating a metric for demonstrated potential domain knowledge. We briefly highlight
some promising directions for future work: (i) to validate the score with user experiments; (ii) to improve the accuracy of
the metric by improving the linking of expressions to knowledge bases; (iii) linking expressions to relations as well as
objects; (iv) using the score in applications such as ranking with cognitive relevance or improving targeting for advertisements
and articles; and (v) extending the score to also incorporate justifications of user’s true beliefs.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments References</title>
      <p>This research was supported by the Netherlands Organisation for Scientific Research (NWO) under project nr 13675.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          , G. Ottaviano, and
          <string-name>
            <given-names>E.</given-names>
            <surname>Meij</surname>
          </string-name>
          .
          <article-title>Fast and space-efficient entity linking for queries</article-title>
          .
          <source>In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining</source>
          , pages
          <fpage>179</fpage>
          -
          <lpage>188</lpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <article-title>101 philosophy problems</article-title>
          . Routledge,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <article-title>Aristotle's metaphysics</article-title>
          .
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Gettier</surname>
          </string-name>
          .
          <article-title>Is justified true belief knowledge? analysis,</article-title>
          <volume>23</volume>
          (
          <issue>6</issue>
          ):
          <fpage>121</fpage>
          -
          <lpage>123</lpage>
          ,
          <year>1963</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Goldman</surname>
          </string-name>
          .
          <article-title>What is justified belief? In Justification and knowledge</article-title>
          , pages
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          . Springer,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Harter</surname>
          </string-name>
          .
          <article-title>Psychological relevance and information science</article-title>
          .
          <source>Journal of the American Society for information Science</source>
          ,
          <volume>43</volume>
          (
          <issue>9</issue>
          ):
          <fpage>602</fpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Richardson</surname>
          </string-name>
          .
          <article-title>Learning about the world through long-term query logs</article-title>
          .
          <source>ACM Transactions on the Web (TWEB)</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>21</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Silverman</surname>
          </string-name>
          .
          <article-title>Plato's middle period metaphysics and epistemology</article-title>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Greisdorf</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Bateman</surname>
          </string-name>
          .
          <article-title>From highly relevant to not relevant: examining different regions of relevance</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>34</volume>
          (
          <issue>5</issue>
          ):
          <fpage>599</fpage>
          -
          <lpage>621</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>