<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SEKI@home, or Crowdsourcing an Open Knowledge Graph</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Steiner</string-name>
          <email>tsteiner@lsi.upc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Mirea</string-name>
          <email>s.mirea@jacobs-university.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science, Jacobs University Bremen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politècnica de Catalunya - Department LSI</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In May 2012, the Web search engine Google has introduced the so-called Knowledge Graph, a graph that understands real-world entities and their relationships to one another. It currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects. Soon after its announcement, people started to ask for a programmatic method to access the data in the Knowledge Graph, however, as of today, Google does not provide one. With SEKI@home, which stands for Search for Embedded Knowledge Items, we propose a browser extension-based approach to crowdsource the task of populating a data store to build an Open Knowledge Graph. As people with the extension installed search on Google.com, the extension sends extracted anonymous Knowledge Graph facts from Search Engine Results Pages (SERPs) to a centralized, publicly accessible triple store, and thus over time creates a SPARQL-queryable Open Knowledge Graph. We have implemented and made available a prototype browser extension tailored to the Google Knowledge Graph, however, note that the concept of SEKI@home is generalizable for other knowledge bases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1.1
With the introduction of the Knowledge Graph, the search engine Google has
made a significant paradigm shift towards “things, not strings” [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], as a post on
the official Google blog states. Entities covered by the Knowledge Graph include
landmarks, celebrities, cities, sports teams, buildings, movies, celestial objects,
works of art, and more. The Knowledge Graph enhances Google search in three
main ways: by disambiguation of search queries, by search log-based
summarization of key facts, and by explorative search suggestions. This triggered demand
for a method to access the facts stored in the Knowledge Graph
programmatically [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. At time of writing, however, no such programmatic method is available.
      </p>
    </sec>
    <sec id="sec-2">
      <title>On Crowdsourcing</title>
      <p>
        The term crowdsourcing was first coined by Jeff Howe in an article in the
magazine Wired [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is a portmanteau of “crowd” and “outsourcing”. Howe writes:
“The new pool of cheap labor: everyday people using their spare cycles to create
content, solve problems, even do corporate R&amp;D”. The difference to outsourcing
is that the crowd is undefined by design. We suggest crowdsourcing for the
described task of extracting facts from SERPs with Knowledge Graph results for
two reasons: (i) there is no publicly available list of the 500 million objects [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
in the Knowledge Graph, and (ii) even if there was such a list, it would not be
practicable (nor allowed by the terms and conditions of Google) to crawl it.
1.3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Search Results as Social Media</title>
      <p>
        Kaplan and Haenlein have defined social media as “a group of Internet-based
applications that build on the ideological and technological foundations of Web
2.0, and that allow the creation and exchange of user-generated content” [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We
argue that search results are social media as well, especially in the case of Google
with its tight integration of Google+, a feature called Search plus Your World [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
1.4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Contributions and Paper Structure</title>
      <p>In this position paper, we describe and provide a prototype implementation of
an approach, tentatively titled SEKI@home and based on crowdsourcing via
a browser extension, to make closed knowledge bases programmatically and
openly accessible. We demonstrate its applicability with the Google Knowledge
Graph. The extension can be added to the Google Chrome browser by
navigating to http://bit.ly/SEKIatHome, the Open Knowledge Graph SPARQL
endpoint can be tested at http://openknowledgegraph.org/sparql1.</p>
      <p>The remainder of this paper is structured as follows. In Section 2, we highlight
related work for the field of extracting data from websites with RDF wrappers. In
Section 3, we describe the SEKI@home approach in detail. We provide a short
evaluation in Section 4. The paper ends with an outlook on future work in
Section 5 and a conclusion in Section 6.
2</p>
      <sec id="sec-4-1">
        <title>Related Work</title>
        <p>
          Wrappers around Web services or Web pages have been used in the past to
lift data from the original source to a meaningful, machine-readable RDF level.
Examples are the Google Art wrapper by Guéret [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which lifts the data from
the Google Art project [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], or the now discontinued SlideShare wrapper2 by the
same author. Such wrappers typically work by mimicking the URI scheme of the
site they are wrapping. Adapting parts of the URL of the original resource to
that of the wrapper provides access to the desired data. Wrappers do not offer
SPARQL endpoints, as their data gets computed on-the-fly.
1 The SPARQL endpoint and the extension were active from Aug. 11 to Sep. 6, 2012.
2 http://linkeddata.few.vu.nl/slideshare/
        </p>
        <p>With SEKI@home, we offer a related, however, still different in the detail,
approach to lift and make machine-readably accessible closed knowledge bases
like the Knowledge Graph. The entirety of the knowledge base being unknown,
via crowdsourcing we can distribute the heavy burden of crawling the whole
Knowledge Graph on many shoulders. Finally, by storing the extracted facts
centrally in a triple store, our approach allows for openly accessing the data via
the standard SPARQL protocol.
3
3.1</p>
      </sec>
      <sec id="sec-4-2">
        <title>Methodology</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Browser Extensions</title>
      <p>We have implemented our prototype browser extension for the Google Chrome
browser. Chrome extensions are small software programs that users can install
to enrich their browsing experience. Via so-called content scripts, extensions can
inject and modify the contents of Web pages. We have implemented an extension
that gets activated when a user uses Google to search the Web.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>Web Scraping</title>
      <p>
        Web scraping is a technique to extract data from Web pages. We use CSS
selectors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to retrieve page content from SERPs that have an associated
realworld entity in the Knowledge Graph. An exemplary query selector is .kno-desc
(all elements with class name “kno-desc”), which via the JavaScript command
document.querySelector returns the description of a Knowledge Graph entity.
3.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Lifting the Extracted Knowledge Graph Data</title>
      <p>
        Albeit the claim of the Knowledge Graph is “things, not strings” [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], what gets
displayed to search engine users are strings, as can be seen in a screenshot
available at http://twitpic.com/ahqqls/full. In order to make this data
meaningful again, we need to lift it. We use JSON-LD [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a JSON representation
format for expressing directed graphs; mixing both Linked Data and non-Linked
Data in a single document. JSON-LD allows for adding meaning by simply
including or referencing a so-called (data) context. The syntax is designed to not
disturb already deployed systems running on JSON, but to provide a smooth
upgrade path from JSON to JSON-LD.
      </p>
      <p>We have modeled the plaintext Knowledge Graph terms (or predicates) like
“Born”, “Full name”, “Height”, “Spouse”, etc. in an informal Knowledge Graph
ontology under the namespace okg (for Open Knowledge Graph) with spaces
converted to underscores. This ontology has already been partially mapped to
common Linked Data vocabularies. One example is okg:Description, which
directly maps to dbpprop:shortDescription from DBpedia. Similar to the
unknown list of objects in the Knowledge Graph (see Subsection 1.2), there is no
known list of Knowledge Graph terms, which makes a complete mapping
impossible. We have collected roughly 380 Knowledge Graph terms at time of writing,
however, mapping them to other Linked Data vocabularies will be a
permanent work in progress. As an example, Listing 1 shows the lifted, meaningful
JSON-LD as returned by the extension.
{
" @id ": " http :// openknowledgegraph . org / data / H4sIAAAAA [...]" ,
" @context ": {
" Name ": " http :// xmlns . com / foaf /0.1/ name ",
" Topic_Of ": {
" @id ": " http :// xmlns . com / foaf /0.1/ isPrimaryTopicOf ",
" type ": " @id "
},
" Derived_From ": {
" @id ": " http :// www .w3. org /ns/ prov # wasDerivedFrom ",
" type ": " @id "
},
" Fact ": " http :// openknowledegraph . org / ontology / Fact ",
" Query ": " http :// openknowledegraph . org / ontology / Query ",
" Full_name ": " http :// xmlns . com / foaf /0.1/ givenName ",
" Height ": " http :// dbpedia . org / ontology / height ",
" Spouse ": " http :// dbpedia . org / ontology / spouse "
},
" Derived_From ": " http :// www . google . com / insidesearch /
features / search / knowledge . html ",
" Topic_Of ": " http :// en. wikipedia . org / wiki / Chuck_Norris ",
" Name ": " Chuck Norris ",
" Fact ": [" Chuck Norris can cut thru a knife w/ butter ."] ,
" Full_name ": [" Carlos Ray Norris "],
" Height ": ["5 ' 10\""] ,
" Spouse ": [
{
" @id ": " http :// openknowledgegraph . org / data / H4sIA [...]" ,
" Query ": " gena o' kelley ",
" Name ": " Gena O' Kelley "
Listing 1. Subset of the meaningful JSON-LD from the Chuck Norris Knowledge
Graph data. The mapping of the Knowledge Graph terms can be seen in the @context.</p>
    </sec>
    <sec id="sec-8">
      <title>3.4 Maintaining Provenance Data</title>
      <p>
        The facts extracted via the SEKI@home approach are derived from existing
third-party knowledge bases, like the Knowledge Graph. A derivation is a
transformation of an entity into another, a construction of an entity into another,
or an update of an entity, resulting in a new one. In consequence, it is
considered good form to acknowledge the original source, i.e., the Knowledge Graph,
which we have done via the property prov:wasDerivedFrom from the PROV
Ontology [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for each entity.
}
]
},
{
}
" @id ": " http :// openknowledgegraph . org / data / H4sIA [...]" ,
" Query ": " dianne holechek ",
" Name ": " Dianne Holechek "
      </p>
      <sec id="sec-8-1">
        <title>Evaluation</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Ease of Use</title>
      <p>At time of writing, we have evaluated the SEKI@home approach for the
criterium ease of use with a number of 15 users with medium to advanced computer
and programming skills who had installed a pre-release version of the browser
extension and who simply browsed the Google Knowledge Graph by
following links, starting from the URL https://www.google.com/search?q=chuck+
norris, which triggers Knowledge Graph results. One of our design goals when
we imagined SEKI@home was to make it as unobtrusive as possible. We asked
the extension users to install the extension and tell us if they noticed any
difference at all when using Google. None of them noticed any difference, while
actually in the background the extension was sending back extracted Knowledge
Graph facts to the RDF triple store at full pace.
4.2</p>
    </sec>
    <sec id="sec-10">
      <title>Data Statistics</title>
      <p>On average, the number of 31 triples gets added to the triple store per SERP
with Knowledge Graph result. Knowledge Graph results vary in their level of
detail. We have calculated an average number of about 5 Knowledge Graph terms
(or predicates) per SERP with Knowledge Graph result. While some
Knowledge Graph values (or objects) are plaintext strings like the value “Carlos Ray
Norris” for okg:Full_name, others are references to other Knowledge Graph
entities, like a value for okg:Movies_and_TV_shows. The relation of reference
values to plaintext values is about 1.5, which means the Knowledge Graph is
well interconnected.
4.3</p>
    </sec>
    <sec id="sec-11">
      <title>Quantitative Evaluation</title>
      <p>In its short lifetime from August 11 to September 6, 2012, the extension users
have collected exactly 2,850,510 RDF triples. In that period, all in all 39 users
had the extension installed in production.
5</p>
      <sec id="sec-11-1">
        <title>Future Work</title>
        <p>
          A concrete next step for the current application of our approach to the
Knowledge Graph is to provide a more comprehensive mapping of Knowledge Graph
terms to other Linked Data vocabularies, a task whose difficulty was outlined in
Subsection 3.3. At time of writing, we have applied the SEKI@home approach
to a concrete knowledge base, namely the Knowledge Graph. In the future, we
want to apply SEKI@home to similar closed knowledge bases. Videos from video
portals like YouTube or Vimeo can be semantically enriched, as we have shown
in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for the case of YouTube. We plan to apply SEKI@home to semantic video
enrichment by splitting the computational heavy annotation task, and store the
extracted facts centrally in a triple store to allow for open SPARQL access.
In [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we have proposed the creation of a comments archive of things people
said about real-world entities on social networks like Twitter, Facebook, and
Google+, which we plan to realize via SEKI@home.
        </p>
      </sec>
      <sec id="sec-11-2">
        <title>Conclusion</title>
        <p>In this paper, we have shown a generalizable approach to first open up closed
knowledge bases by means of crowdsourcing, and then make the extracted facts
universally and openly accessible. As an example knowledge base, we have used
the Google Knowledge Graph. The extracted facts can be accessed via the
standard SPARQL protocol from the Google-independent Open Knowledge Graph
website (http://openknowledgegraph.org/sparql). Just like knowledge bases
evolve over time, the Knowledge Graph in concrete, the facts extracted via the
SEKI@home approach as well mirror those changes eventually. Granted that
provenance of the extracted data is handled appropriately, we hope to have
contributed a useful socially enabled chain link to the Linked Data world.</p>
      </sec>
      <sec id="sec-11-3">
        <title>Acknowledgments</title>
        <p>T. Steiner is partially supported by the EC under Grant No. 248296 FP7 (I-SEARCH).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Guéret. “GoogleArt - Semantic Data Wrapper (Technical Update</surname>
          </string-name>
          )”, SemanticWeb.com, Mar.
          <year>2011</year>
          . http://semanticweb.com/ googleart-semantic
          <article-title>-data-wrapper-</article-title>
          <source>technical-update_b18726.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J.</given-names>
            <surname>Howe</surname>
          </string-name>
          .
          <source>The Rise of Crowdsourcing. Wired</source>
          ,
          <volume>14</volume>
          (
          <issue>6</issue>
          ),
          <year>June 2006</year>
          . http://www.wired. com/wired/archive/14.06/crowds.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>L.</given-names>
            <surname>Hunt</surname>
          </string-name>
          and
          <string-name>
            <surname>A. van Kesteren. Selectors API</surname>
          </string-name>
          <article-title>Level 1</article-title>
          .
          <string-name>
            <surname>Candidate</surname>
            <given-names>Recommendation</given-names>
          </string-name>
          , W3C,
          <year>June 2012</year>
          . http://www.w3.org/TR/selectors-api/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Haenlein</surname>
          </string-name>
          .
          <article-title>Users of the world, unite! The challenges and opportunities of Social Media</article-title>
          .
          <source>Business Horizons</source>
          ,
          <volume>53</volume>
          (
          <issue>1</issue>
          ):
          <fpage>59</fpage>
          -
          <lpage>68</lpage>
          , Jan.
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>T.</given-names>
            <surname>Lebo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Corsar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Soiland-Reyes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zednik</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao. PROV-O: The</surname>
          </string-name>
          <string-name>
            <given-names>PROV</given-names>
            <surname>Ontology. Working Draft</surname>
          </string-name>
          , W3C,
          <year>July 2012</year>
          . http://www.w3.org/TR/prov-o/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Questioner on Quora.com. “
          <article-title>Is there a Google Knowledge Graph API (or another third party API) to get semantic topic suggestions for a text query?”</article-title>
          , May
          <year>2012</year>
          . http://bit.ly/Is-there
          <article-title>-a-Google-Knowledge-Graph-API.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Singhal</surname>
          </string-name>
          . “
          <article-title>Introducing the Knowledge Graph: things, not strings”</article-title>
          , Google Blog, May
          <year>2012</year>
          . http://googleblog.blogspot.com/
          <year>2012</year>
          /05/ introducing
          <article-title>-knowledge-graph-things-not</article-title>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Singhal</surname>
          </string-name>
          . “Search, plus Your World”,
          <string-name>
            <surname>Google</surname>
            <given-names>Blog</given-names>
          </string-name>
          , Jan.
          <year>2012</year>
          . http:// googleblog.blogspot.com/
          <year>2012</year>
          /01/search-plus-your-world.
          <source>html.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Sood</surname>
          </string-name>
          . “
          <article-title>Explore museums and great works of art in the Google Art Project”</article-title>
          ,
          <string-name>
            <surname>Google</surname>
            <given-names>Blog</given-names>
          </string-name>
          , Feb.
          <year>2011</year>
          . http://googleblog.blogspot.com/
          <year>2011</year>
          /02/ explore
          <article-title>-museums-and-great-works-of-art</article-title>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>M. Sporny</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Longley</surname>
            , G. Kellogg,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lanthaler</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Birbeck</surname>
          </string-name>
          .
          <article-title>JSON-LD Syntax 1.0, A Context-based JSON Serialization for Linking Data</article-title>
          . Working Draft, W3C,
          <year>July 2012</year>
          . http://www.w3.org/TR/json-ld-syntax/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>T.</given-names>
            <surname>Steiner. SemWebVid - Making Video</surname>
          </string-name>
          <article-title>a First Class Semantic Web Citizen and a First Class Web Bourgeois</article-title>
          .
          <source>In Proceedings of the ISWC 2010 Posters &amp; Demonstrations Track, Nov</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. T. Steiner,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gabarro</surname>
          </string-name>
          , and R. V. de Walle.
          <article-title>Adding Realtime Coverage to the Google Knowledge Graph</article-title>
          .
          <source>In Proceedings of the ISWC 2012 Posters &amp; Demonstrations Track</source>
          .
          <article-title>(accepted for publication)</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>