<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Programmatic Access to Crowdsourced Human Computation for Designing and Enhancing Interlinking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cristina Sarasua</string-name>
          <email>csarasua@uni-koblenz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Web Science and Technologies (WeST) University of Koblenz-Landau</institution>
        </aff>
      </contrib-group>
      <fpage>29</fpage>
      <lpage>34</lpage>
      <abstract>
        <p>Despite the growth of the number of LOD datasets and the increasing variety of covered topical domains, most of the links connecting RDF resources of different datasets are identity links (often between descriptions that match perfectly) and a high number of datasets still do not contain out-links. The creation of different types of links and the process of analyzing the Linked Data space for new interlinking possibilities is time-consuming and tedious. This paper describes a crowd-powered approach to knowledge integration, which aims at supporting data publishers in designing new interlinking processes, as well as validating and enhancing automatically computed links.</p>
      </abstract>
      <kwd-group>
        <kwd>data interlinking</kwd>
        <kwd>microtask crowdsourcing</kwd>
        <kwd>human computation</kwd>
        <kwd>relevance</kwd>
        <kwd>enhancement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Data interlinking is one of the critical tasks1 towards the realization of the global data
space on the Web [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As Schmachtenberg et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] reported, most of the interlinking
efforts have been focused on defining identity links between RDF resources of different
and distributed datasets (using the owl:sameAs predicate), few datasets have become
prominent interlinking hubs (e. g. DBpedia and Geonames) receiving a high number
of in-links, and still 44% of the analyzed datasets do not contain out-links. In order to
improve the current interlinking status in terms of heterogeneity (e. g. creating more
domain-specific links) and quantity (e. g. connecting each dataset to more datasets—as
long as it is semantically possible), there are at least two issues that need to be addressed:
on the one hand, general purpose (semi-)automatic link discovery methods have some
computational limitations. Domain and dataset independent interlinking systems lack
specific comparison functions required for creating particular domain-specific links
(e. g. Khrouf et al. extended state-of-the-art tools for their specific needs in the context
of the EventMedia dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). Moreover, the (semi-)automatic discovery of links
between resources with heterogenous descriptions can be troublesome (e. g. when
trying to geolocate a Point of Interest and its description does not provide explicit local
information). On the other hand, as the number of datasets increases data publishers
require methods that assist them in deciding the datasets to target and the way to define
the interlink.
      </p>
      <p>
        Methods that address these issues relying exclusively on machine computation [
        <xref ref-type="bibr" rid="ref5 ref9">9,5</xref>
        ]
have shortcomings: not all scenarios have authority files for supporting the matching,
and dataset recommendation based on existing interlinking reproduces what we have.
Complementing these approaches with human computation becomes valuable, because
humans may process and relate other sources, solve a matching task that an automatic
method cannot due to the lack of evidence to learn heuristics from, and judge the
relevance of particular information within a context.
      </p>
      <p>This paper presents CROWDKI, a system that automatically collects human input
that becomes useful for two steps of data interlinking: (1) the design of interlinking
processes and (2) the enhancement of automatically computed links. In order to do so,
CROWDKI uses microtask crowdsourcing.
2</p>
    </sec>
    <sec id="sec-2">
      <title>CROWDKI: Knowledge Integration for and with the Crowd</title>
      <p>
        CROWDKI2 is a system that automatically creates and publishes microtasks (i. e. simple
tasks) in online labor marketplaces (e. g. Clickworker, ClixSense, etc.), in which people
all around the world and with different backgrounds [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] accomplish such tasks in return
of a small amount of money. The main advantage of microtask crowdsourcing compared
to other crowdsourcing genres is that its large available workforce facilitates fast results
in a constant and cost-effective manner.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Use cases</title>
        <p>Assessing the relevance of different interlinking possibilities: given a list of
interlinking possibilities (i. e. D1.uriClass1,uriPredicate,D2.uriClass2) applicable to pairs
of RDF datasets, the DCAT and VoiD descriptions of the content of the datasets3 and a
description of the context in which the integrated data will be consumed, CROWDKI
generates survey-style microtasks to ask humans how relevant the information enabled by
the interlinking possibilities is for the specific context. Figure 1a shows the kind of
questions that CROWDKI asks the crowd to collect relevance judgments. The context in
this example is the official Web site of a car company and there are three interlinking
possibilities: connecting Cars and Persons with any of the predicates wasDesignedBy,
wasDrivenBy,wasRecommendedBy. For each of these possibilities CROWDKI asks (1)
to rate the relevance of the type of information in the context of the official Web site
of the company, (2) to (voluntarily) explain the reason of such judgment, and (3) to
classify the relevance judgment. The last question is intended to get further information
about the reason that the type of link less relevant (i. e. the predicate or the objects of the
target dataset).</p>
        <sec id="sec-2-1-1">
          <title>2 CROWDKI https://github.com/criscod/CROWDKI</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>3 DCAT http://www.w3.org/TR/vocab-dcat/ and VoiD http://www.w3.org/TR/void/</title>
          <p>Validating and Enhancing automatically computed links: given two RDF datasets
and a set of candidate links (e. g. from a link discovery algorithm), CROWDKI generates
a set of microtasks to ask the crowd to review each candidate link. The set of candidate
links can include links accepted and rejected by an automatic interlinking tool in order
to be able to validate and enhance results. CROWDKI can also generate the Cartesian
product set and a balanced set of correct and incorrect links based on a reference
interlinking (however, these link generators do not currently support large datasets).
Figure 1b shows an example of such microtasks, in which the description of two RDF
resources is displayed and the user is asked about the relation between the two resources.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>System description</title>
        <p>CROWDKI uses CrowdFlower4 to publish microtasks, because it distributes microtasks in
multiple marketplaces reaching millions of users, and it provides support for gold
standardbased quality assurance (i. e. microtasks with a known answer, used for instructing and
testing the accuracy of crowd workers). The communication between CROWDKI and
CrowdFlower is done using the CrowdFlower RESTful API5. CROWDKI is implemented
in Java and it is divided into different components grouped in packages according to the
microtask management cycle:
Microtask generation This package includes functionality for parsing and preparing
the data to be included in the microtasks (either interlinking possibilities or candidate
links), as well as classes for generating the different types of microtasks (i. e. different
templates for each of the use cases). Microtasks in CrowdFlower are created using</p>
        <sec id="sec-2-2-1">
          <title>4 CrowdFlower http://www.crowdflower.com/</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>5 CrowdFlower API http://goo.gl/D2BZUZ</title>
          <p>the CrowdFlower Markup Language6, in combination with HTML markup, and
Javascript (if needed). The basic structure of the CROWDKI microtask templates
are stored in separate text files, therefore, they are reusable and extensible. The
microtask settings (e. g. the number of trusted judgments required per microtask, the
number of microtasks per page, the payment, number of minimum gold microtasks
required to pass) are read from the configuration file. Apache Jena7 is used to parse
and query RDF data with SPARQL (to get the list of property values to be included
in the microtasks—also defined in the configuration file). The RW access to other
non-RDF files is done using the Guava IO library8.</p>
          <p>Microtask publication This package contains the classes for publishing the generated
microtasks in CrowdFlower. Gold microtasks are created separately from the
normal microtasks, but following the same design. The data for such microtasks are
provided directly from an RDF file. While it is possible to create the gold microtasks
programmatically, it is better to write explanations on the gold microtasks manually,
because even if this requires some time, gold answers need to explain crowd workers
the way the specific microtasks work. CrowdFlower offers the possibility to launch
microtasks including a QuizMode (i. e. crowd workers have to pass a test with
only gold microtasks to be able to work on the rest of the microtasks). Targeted
crowdsourcing is out of the scope of CrowdFlower, but it is possible to select between
high speed or high quality workers, filter geographical regions and restrict the work
to people speaking a particular language. CROWDKI may be easily extended to
support other crowdsourcing platforms.</p>
          <p>Response collection Once the required number of trusted workers has accomplished
the microtasks, CrowdFlower generates several reports containing the results and
information about the crowd workers (e. g. the marketplace they have worked from,
their location and the time spent on the microtask). The platform also generates
further statistics which can be seen exclusively from the requester GUI (e. g. the
agreement of crowd workers with the majority vote). Such reports are the output of
the first CROWDKI use case (i. e. assessing the relevance of interlinking possibilities).
However, for the realization of the second use case (i. e. enhancement of generated
links), CROWDKI contains further classes to process the responses of the crowd,
collect the aggregated response (defined by majority voting) and serialize the crowd
interlinking in N-Triples.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Lessons Learned</title>
      <p>Communication is key for success: crowd workers require detailed instructions and
contextual information about the data. Crowd workers complained about gold standard
questions that indicated that two persons were the same when they had a different date
of birth—without knowing about the existence of typos in the data, their claim made
perfect sense. Iterating after collecting feedback from crowd workers (also through forms
outside the platform) can improve the task design considerably.</p>
      <sec id="sec-3-1">
        <title>6 CML http://goo.gl/2rqzYk</title>
      </sec>
      <sec id="sec-3-2">
        <title>7 Apache Jena https://jena.apache.org/</title>
      </sec>
      <sec id="sec-3-3">
        <title>8 Guava IO http://goo.gl/T66Dih</title>
        <sec id="sec-3-3-1">
          <title>Communities of crowd workers and requesters are emerging: they discuss on</title>
          <p>Twitter and specialized forums, and they report on their satisfaction with
accomplished/offered work. Adopting de facto standards (e. g. the reward to pay) it is important
to be competitive with regard to other requesters.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>First decide what and if to crowdsource: since microtask crowdsourcing comes</title>
          <p>at a cost, it is better to restrict its use to cases that require it—connecting datasets by
location when both datasets contain country ISO codes can be perfectly done by Silk9.
Testing the feasibility of a subset of the data can save money and time. The limitations
are that CROWDKI requires several hours/days to get input from the crowd, and that
selecting particular crowd workers in CrowdFlower may only be done after testing
the people and programming workarounds (e. g. filtering worker IDs with Javascript /
publishing multiple sets of microtasks).
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        The use of human computation in Semantic Web tasks has been acknowledged as an
effective way to overcome the limitations of automatic methods [
        <xref ref-type="bibr" rid="ref1 ref10 ref9">10,1,9</xref>
        ]. There have
been several works on using microtask crowdsourcing for tasks related to RDF data
interlinking: CrowdER [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and ZenCrowd[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] are two of the most significative ones
for entity linking and instance matching. OpenRefine also includes a crowdsourcing
extension for LOD URI reconciliation10. While the work presented in this paper, which is
an extension of our previous work in ontology alignment [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], shares some commonalities
with these approaches (e. g. the instance matching validation microtasks), the scenarios
covered by CROWDKI are broader (i. e. other domain-specific links and relevance
assessment) and therefore, the challenges faced are different. The goal of CROWDKI is
to provide the infrastructure to extend interlinking tools with human computation.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>
        The system introduced in this paper enables crowd-powered knowledge integration on
the Web of Data. Dataset recommendation systems for interlinking could leverage the
human labels that CROWDKI can collect about the different interlinking possibilities,
and combine this information with other automatically extracted criteria such as the
current popularity of LOD datasets as defined by DING![
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Different interlinking
possibilities can be generated by querying the LOV dataset11. Additionally, relevance
microtasks could also be used to assess the relevance of predicates in different contexts
or perspectives. Interlinking validation microtasks become a useful post-processing
extension for interlinking systems. Future work will focus on the optimization of the
approach (e. g. including task assignment procedures) and the definition of new use cases.
      </p>
      <sec id="sec-5-1">
        <title>9 http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/</title>
        <p>10 http://goo.gl/6hfuTk
11 Linked Open Vocabularies http://lov.okfn.org/dataset/lov/</p>
        <sec id="sec-5-1-1">
          <title>Acknowledgments</title>
          <p>The author would like to thank Elena Simperl, Steffen Staab, Natasha Noy and Matthias
Thimm for the discussions about crowdsourced interlinking. The research leading to
these results has received funding from the European Union’s FP7 under grant agreement
no.611242-Sense4Us project.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The global brain semantic web interleaving human-machine knowledge and computation</article-title>
          .
          <source>In: Workshop on What will the Semantic Web Look Like 10 Years From Now? at ISCW</source>
          <year>2012</year>
          , Boston, MA. (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Demartini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Difallah</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cudré-Mauroux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Large-scale linked data integration using probabilistic reasoning and crowdsourcing</article-title>
          .
          <source>The VLDB Journal</source>
          <volume>22</volume>
          (
          <issue>5</issue>
          ),
          <fpage>665</fpage>
          -
          <lpage>687</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Linked data: Evolving the web into a global data space</article-title>
          .
          <source>Synthesis lectures on the semantic web: theory and technology 1(1)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>136</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Khrouf</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          , R.:
          <article-title>Eventmedia: A lod dataset of events illustrated with media</article-title>
          .
          <source>Semantic Web journal</source>
          , Special Issue on Linked Dataset descriptions pp.
          <fpage>1570</fpage>
          -
          <lpage>0844</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lopes</surname>
            ,
            <given-names>G.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leme</surname>
            ,
            <given-names>L.A.P.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunes</surname>
            ,
            <given-names>B.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casanova</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Two approaches to the dataset interlinking recommendation problem</article-title>
          .
          <source>In: Web Information Systems EngineeringWISE</source>
          <year>2014</year>
          , pp.
          <fpage>324</fpage>
          -
          <lpage>339</lpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irani</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silberman</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaldivar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomlinson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Who are the crowdworkers?: shifting demographics in mechanical turk</article-title>
          .
          <source>In: CHI '10 Extended Abstracts on Human Factors in Computing Systems</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sarasua</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simperl</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          :
          <article-title>Crowdmap: Crowdsourcing ontology alignment with microtasks</article-title>
          .
          <source>In: The Semantic Web-ISWC</source>
          <year>2012</year>
          , pp.
          <fpage>525</fpage>
          -
          <lpage>541</lpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Schmachtenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Adoption of the linked data best practices in different topical domains</article-title>
          .
          <source>In: The Semantic Web-ISWC</source>
          <year>2014</year>
          , pp.
          <fpage>245</fpage>
          -
          <lpage>260</lpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Shvaiko</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Euzenat</surname>
          </string-name>
          , J.:
          <article-title>Ontology matching: state of the art and future challenges. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on
          <volume>25</volume>
          (
          <issue>1</issue>
          ),
          <fpage>158</fpage>
          -
          <lpage>176</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Siorpaes</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simperl</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Human intelligence in the process of semantic content creation</article-title>
          .
          <source>World Wide Web</source>
          <volume>13</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>33</fpage>
          -
          <lpage>59</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Toupikov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delbru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hausenblas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tummarello</surname>
          </string-name>
          , G.:
          <article-title>Ding! dataset ranking using formal descriptions (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kraska</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
          </string-name>
          , J.: Crowder:
          <article-title>Crowdsourcing entity resolution</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>5</volume>
          (
          <issue>11</issue>
          ),
          <fpage>1483</fpage>
          -
          <lpage>1494</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>