<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Crowdsourcing Feedback for Pay-As-You-Go Data Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fernando</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norman W. Paton</string-name>
          <email>norm@cs.man.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alvaro A. A. Fernandes</string-name>
          <email>alvaro@cs.man.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Osorno-Gutierrez, School of Computer Science, University of Manchester</institution>
          ,
          <addr-line>Oxford Road, Manchester, M13 9PL</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science, University of Manchester</institution>
          ,
          <addr-line>Oxford Road, Manchester, M13 9PL</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>Providing an integrated representation of data from heterogeneous data sources involves the specification of mappings that transform the data into a consistent logical schema. With a view to supporting large-scale data integration, the specification of such mappings can be carried out automatically using algorithms and heuristics. However, automatically generated mappings typically provide partial and/or incorrect results. Users can help to improve such mappings; expert users can act on the mappings directly using data integration tools, and end users or crowds can provide feedback in a pay-as-you-go fashion on results from the mappings. Such feedback can be used to inform the selection and refinement of mappings, thus improving the quality of the integration and reducing the need for expensive and potentially scarce expert staff. In this paper, we investigate the use of crowdsourcing to obtain feedback on mapping results that inform mapping selection and refinement. The investigation involves an experiment in Amazon Mechanical Turk that obtains feedback from the crowd on the correctness of mapping results. The paper describes this experiment, considers generic issues such as reliability, and reports the results for different mappings and reliability strategies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Large scale data integration, for example over web sources,
is challenging due to the heterogeneities that inevitably
result from multiple autonomous data publishers. Classical
data integration is labour-intensive, and tends to be applied
to produce high-quality but high-cost integrations in
reasonably stable environments. As a result, there has been
a growing interest in pay-as-you-go data integration, where
an initial integration is generated automatically, the quality
of which is improved incrementally over time [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
incremental improvement can take many forms, but is often
informed by feedback on the current integration [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Crowdsourcing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] has recently emerged as a way of
tapping into human expertise through the web, and systems
such as Amazon Mechanical Turk1 (AMT) and CrowdFlower2
provide systematic mechanisms for recruiting and paying
workers for carrying out specific tasks. This paper explores
the hypothesis that crowdsourcing can provide cost-effective
feedback of a form that can support data integration. The
paper contributes an experiment design that tests the
hypothesis, and an analysis of the results of the experiment.
Specifically, given automatically generated mappings, we use
the crowd to provide feedback on the correctness of the
results produced by those mappings. Such feedback has been
shown to be useful by several authors. For example,
Belhajjame et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] showed how such feedback could be used to
select between and inform the generation of new mappings;
and Talukdar et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] used such feedback to identify
effective ways of answering keyword queries over structured
sources.
      </p>
      <p>The remainder of this paper is structured as follows.
Section 2 describes related work on data integration and
crowdsourcing. Section 3 describes the data integration context
for the experiment. Section 4 presents the design of the
experiment including the role of redundancy in validating
results. Section 5 presents and analyses the results of the
experiment. Section 6 draws some conclusions.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>This section describes work related to that described in
this paper, focusing on results in data integration and
crowdsourcing for data management.</p>
      <p>
        In terms of data integration, our research builds on the
work of Belhajjame et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], who use feedback on mapping
results to annotate mappings with estimates of their
precision and recall. More specifically, feedback takes the form
of true positive, false positive and false negative annotations
on tuples returned by mappings, and such feedback allows
estimates for precision and recall to be obtained; the more
feedback, the more accurate the estimates are likely to be.
The estimates of precision and recall are then used to
support the selection of mappings for answering a query that
meet specific user requirements (e.g. by selecting mappings
in a way that maximises recall for a precision above some
Copyright c 2013 for the individual papers by the papers’ authors. Copying
permitted for private and academic purposes. This volume is published and
copyrighted by its editors.
      </p>
      <sec id="sec-2-1">
        <title>1http://mturk.amazon.com/</title>
        <p>2http://crowdflower.com/
threshold), and the generation of new mappings whose
precision and recall can be estimated in the light of the feedback.
Belhajjame et al. evaluate the techniques using
synthetically generated feedback; this paper explores the collection
of such feedback using crowdsourcing.</p>
        <p>
          Our work is one of a growing collection of contributions
in crowdsourcing, for which a survey has been carried out
by Doan et al.[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In this survey, a classification of
crowdsourcing systems is presented. The application developed
in our work would have been classified as a standalone
application with explicit collaboration of users. In terms of
crowd sourcing for data management, there are a range of
other approaches that share this classification. Several
proposals have been made in which crowdsourcing plays a role
in query answering, including CrowdDB [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and CrSS [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ];
such systems extend standard query evaluation over static
data sources with techniques for consulting the crowd for
information that is not available through other means. In
relation to data integration, McCann et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] propose using
online communities to support the matching of attributes
from different sources; such work complements our results,
as matches are often used as a foundation for the
construction of mappings. At a later stage in the data integration
pipeline, CrowdER [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] carries out entity resolution with a
technique that combines machine and human work; as in
this paper, data is first processed by automatic techniques,
the results of which are then verified using the crowd. Our
results complement these recent contributions by evaluating
the use of the crowd to obtain an additional type of feedback,
and by including comparative evaluations of different
techniques for ascertaining the reliability of the feedback from
the crowd.
3.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>DATA INTEGRATION CONTEXT</title>
      <p>This paper tests the hypothesis that feedback from the
crowd can inform the annotation of mappings with
information about their quality, where the feedback takes the form
of true positive or false positive feedback on tuples produced
by the mappings. This section describes the data
integration context for the experiment, including the data that is
to be integrated, mapping generation, and the sampling of
data on which feedback is to be obtained.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental data</title>
      <p>As music is a well known domain, we use as data sources
two music databases (Musicbrainz3 and Discogs4). The schemas
of Musicbrainz and Discogs contain a range of information
about artists and their recordings. In our experiment we
focus on the entity artist of each database and the attributes
name, real name, gender, country, type and begin date year,
that essentially provide simplified views of the artist
information from the sources. Given this focus, the tables
musicbrainz artist (name, gender, country, type, begin date year)
and discogs artist (name, realname) were created to form the
source schema in the experiment.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Generation of schema mappings</title>
      <p>
        We used Spicy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to automatically generate mappings for
which feedback is obtained. Spicy is a schema mapping tool
that generates candidate schema mappings as SQL views
      </p>
      <sec id="sec-5-1">
        <title>3Musicbrainz - http://www.musicbrainz.org/</title>
        <p>
          4Discogs - http://www.discogs.com/
that can be used to map data from a source schema into a
global schema. Spicy requires as input one source schema
and one global schema. In the source schema, a foreign key
was inserted between the attributes artist discogs.name and
artist musicbrainz.name. The global schema consists of a
table artist that contains the union of the attributes from
the tables artist musicbrainz and artist discogs. Figure 1
presents the architecture of Spicy [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>Spicy requires sample instance values in the source schemas
to generate candidate schema mappings. With a large
number of tuples (more than 250), Spicy generates only one
mapping. This is insufficient for our experiment, for obvious
reasons. Since the reason behind this outcome is that
ample information about the sources enables Spicy to generate
fewer alternative mappings, we provided the tool with fewer
(viz., 200) source tuples. Indeed this caused Spicy to
generate several alternative mappings on which we were then
able to obtain feedback and proceed with our experimental
goals. Also, the data was constrained to artists that started
to play in 1980 or later and that are from the United States5.
This process generated ten candidate mappings. After this
process, we selected the attributes name, type and country
on which to obtain feedback. These attributes are common
to most of the artists, whereas some attributes are relevant
only for certain artists. For example, the attribute gender
is relevant only to person artists and not to group artists.
The three mappings that met the criteria to be used in the
experiment are presented in Table 1.</p>
        <p>The mappings produced by Spicy are SQL statements
inferred from the source schemas and a sample of source
tuples. When the Spicy-inferred mappings were run against
the complete tuple-content of the Musicbrainz and Discogs
databases, we obtained the results that populate the global
schema characterized by those mappings. Each mapping,
5This action was in response to a study of the demographics
of the workers that participate in Amazon Mechanical Turk.
Most workers are from the US and in an age bracket that
suggests that their knowledge will be sharper for post-1980
artists and groups.
as a SQL query against the global schema, then produced
4203 rows. However, in our experiment we require more
than three mappings to evaluate and we want to have some
mappings that are likely to obtain a precision between 0 and
1 (the mappings of Table 1 have precision of either 0 or 1);
for this reason, we have guided the generation of additional
mappings. M1 was used as the starting point for the
generation of additional mappings. The process to generate
additional mappings is the following. First, we made a copy of
the results from M1, which has a precision of 1. After that,
we changed a percentage of tuples in the results to a
different value for the country attribute from the set {Canada,
Australia, New Zealand, France, United Kingdom }. By this
means, four more mappings were created with different
percentages of tuples modified in each mapping. We modified
20% for Mapping 4 (M4), 40% for Mapping 5 (M5), 60%
for Mapping 6 (M6) and 80% for Mapping 7 (M7). In total,
then, we have seven mappings: three generated directly
using Spicy, and four mappings that are variants of one of the
Spicy mappings.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Sampling mapping results</title>
      <p>
        Having defined the mappings, it was necessary to decide
on how many tuples should feedback be obtained given that
it would be too expensive to obtain feedback from the crowd
on all the tuples produced by the mappings. For this
purpose, we used a statistical method, simple random sampling
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], to determine the sample size for populations with
variables that can take only two values (i.e. Correct or
Incorrect). The method to determine the sample size considers a
confidence level and a standard error, which are commonly
used in social sciences. For our study we computed the
sample size for a confidence of 95% that the mean (i.e. the
percentage of values that are annotated as Correct or
Incorrect ) would be within 5% of the correct mean. For these
requirements, the resulting sample size of a population of
4203 is 352 tuples.
      </p>
    </sec>
    <sec id="sec-7">
      <title>HUMAN INTELLIGENCE TASK GEN</title>
    </sec>
    <sec id="sec-8">
      <title>ERATION</title>
      <p>Having identified the tuples on which feedback is to be
obtained, the information is now in place to enable the design
of the tasks to be completed by the crowd. For the
experiment, we used the AMT crowdsourcing platform, within
which user activities are referred to as Human Intelligence
Tasks (HITs). This section describes how tuples are
allocated to HITs, including redundancy and screen design.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Distribution of result tuples</title>
      <p>The seven samples of tuples of size 352 obtained for each
mapping are distributed into groups of 25 unique tuples that
will feature in questionnaires, such that each questionnaire
is a HIT, and each result tuple is the subject of one question.
The distribution of tuples considers that one HIT should not
have more than one tuple with information about the same
artist, and that the number of tuples in a HIT produced from
the same mapping is controlled. The number of resulting
HITs is presented in Table 2.
4.2</p>
    </sec>
    <sec id="sec-10">
      <title>Reliability</title>
      <p>In our experiment we obtain feedback from humans, who,
of course, may fail to provide reliable answers (e.g. answers
that contradict those of other users, or even self-contradictory
ones). The goal for the investigation is to minimise the risk
that unreliable data is obtained, or to manage the
unreliability when it is encountered.</p>
      <p>
        A method to estimate the reliability of a single observer is
called Intra Observer Reliability (IaOR) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. With a view
to estimating IaOR, some of the questions are asked more
than once. To estimate IaOR, each respondent (a worker in
AMT) answered two HITs at least two hours apart to reduce
the risk that the worker remembered their first answers and
answered based on memory. This is called the practice effect
in social sciences [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Only a subset of questions in each HIT is redundant. The
HITs are organised in pairs such that each HIT contains
three random questions from the other HIT in his pair. In
Figure 2(a), each arrow represents three questions.
Therefore, in each pair there are 6 redundant questions used to
assess IaOR. After introducing redundancy for IaOR, each
HIT contains 28 questions. In Figure 2(a), HIT1 and HIT4
form one pair, which is answered by Worker 1. Then, the
reliability is estimated by the percentage of agreement of
the 6 redundant questions. The IaOR associated with
different numbers of consistent questions is as follows. The
worker obtains 16.60% of IaOR for answering consistently 1
question, 33.30% for 2 questions, and so on.</p>
      <p>
        However, there exists the possibility that the answers of a
worker are not correct; it is possible to provide (consistently)
wrong answers, which would give rise to a high estimate for
IaOR. To avoid this situation, we can compare answers
between workers, by way of Inter Observer Reliability (IrOR).
We assume that respondents are reliable if they provide the
same answers [
        <xref ref-type="bibr" rid="ref10 ref5">10, 5</xref>
        ]. To introduce redundancy for IrOR,
first we group the HITs in groups of three. Then, inside
each group, we choose at random two questions from each
HIT to occur in another HIT too, thereby introducing the
redundancy required to obtain evidence of IrOR. The
selected questions are different from the questions selected for
IaOR. Figure 2(b) shows how redundancy was introduced in
order to estimate IrOR; each arrow represents 2 questions.
In each group there are 6 redundant questions for IrOR.
      </p>
      <p>After introducing redundancy for IaOR and IrOR, each
HIT contains 32 questions (25 unique + 3 for IaOR + 4 for
IrOR).</p>
      <p>In each group of three HITs for IrOR, we estimate the
percentage of agreement of each pair in the group.
Therefore, we obtain three evaluations, and each worker receives
two IrOR evaluations in the group. Then, when reporting
results that take into account IrOR in the experiments, we
apply an IrOR threshold, and discard the HITs from workers
that obtained IrOR evaluations below the threshold.</p>
      <p>As an example, consider the group of HITs: HIT1, HIT2
and HIT3, answered by Worker1, Worker2 and Worker3,
respectively. Then, Worker1 and Worker2 agree on 4
questions and obtain 66.6%; Worker1 and Worker3 agree on 4
questions and obtain 66.6%; finally Worker2 and Worker3
agree on 6 questions and obtain 100%. If we set an
inter observer reliability threshold of 100%, in this example
we would ignore the answers in this group of HITs from
Worker1, who has two reliability evaluations were below the
threshold. However, Worker2 and Worker3 are considered
reliable because each has only one evaluation of reliability
below the threshold, which is not enough to determine that
they are unreliable (we assume that it was Worker 1 who
provided incorrect answers).</p>
      <p>Note that every worker answered two HITs in the
experiment. Therefore, they are evaluated twice for inter observer
reliability but in different groups. For example, Worker1
answered the HIT1 and HIT4, which are in different groups
for IrOR, as illustrated in Figure 2(b).
4.3</p>
    </sec>
    <sec id="sec-11">
      <title>HIT Design</title>
      <p>
        An example HIT is presented in Figure 3. The possible
answers to a question are Correct or Incorrect. All the
questions in the survey have the same structure. To set the
questionnaire length, we carried out pilot tests. We estimated
that 32 questions would take users less than 20 minutes to
answer. We paid $1.00 (one US dollar) per HIT, which is a
higher than average payment per completed task, with the
goal of making the HIT attractive to workers [
        <xref ref-type="bibr" rid="ref12 ref9">12, 9</xref>
        ]. The
workers could find the HITs by browsing the AMT tasks list
or by searching the keywords music, survey or artists. The
HITs in the experiment were available to workers located in
the US. We accepted workers that have responded
successfully to 1000 HITs before and that have finished successfully
95% of all the HITs that they have ever responded to before
(approval rate). Thus we have been quite selective in terms
of the experience and ratings of participants.
4.4
      </p>
    </sec>
    <sec id="sec-12">
      <title>Experiment Setup on AMT</title>
      <p>A crowdsourcing application was developed to control the
publishing of HITs in AMT. The application consists of:
a controller that configures and posts HITs in the AMT
platform; a database that stores the result tuples that are
assigned to the HITs, the AMT IDs of the workers, and data
used to control which HITs are assigned to which workers;
and a Tomcat Apache web server that contains Java Server
Pages (JSP) forms for the HITs. The JSP forms use the
AMT ID of the worker requesting to view a HIT to retrieve
the tuples that are chosen to be part of the HIT assigned to
that worker.</p>
      <p>
        The crowdsourcing application went live, and 90 HITs
were completed in 40 days. This is a substantial elapsed time
compared with that reported by other AMT users (e.g. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]).
We expect that this can be explained by the complex and
somewhat unconventional pairing of HITs to enable IaOR,
the results of which are discussed further below. On
average, the HITs took 12:40 minutes to answer. New HITs were
made available every week or when previous HITs were
finished in order to appear in the first pages of the AMT task
lists. This is a common practice followed by other AMT
requesters [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-13">
      <title>EXPERIMENTAL RESULTS 5. 5.1</title>
    </sec>
    <sec id="sec-14">
      <title>Precision and Error in Precision</title>
      <p>
        The candidate mappings used in our experiment are
annotated to indicate if they meet the requirements of users. For
this purpose, we annotate the mappings with values of
precision and the error in precision as in Belhajjame et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Precision is the fraction of retrieved tuples that are
indicated to be correct by the user [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We can estimate the
precision of a mapping j after i feedback instances with the
formula below.
      </p>
      <p>P recisionij =</p>
      <p>true positivesij
true positivesij + f alse positivesij
(1)
where, for mapping j, true positivesij (false positivesij) is the
number of correct (incorrect) tuples retrieved after i
feedback instances. The calculation of precision is incrementally
updated as the user provides feedback; in the experiment,
i changes from 0 to 352, which is the number of tuples of
the mapping evaluated by the workers. We are interested in
measuring how user feedback can contribute in the
evaluation of the mappings. For this reason, we compare the
estimated precision for a mapping j with i feedback instances
to a known precision value, which is a gold standard
precision (GSP) for the mapping j. The error in precision can
be used for this purpose.</p>
      <p>
        Error in P recisionij = |GSPj − P recisionij|
(2)
where GSPj is the gold standard precision of the mapping
j. As in Belhajjame et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we use the average error in
precision (AEP) to measure the quality of an annotation,
i.e., the difference between the estimated precision and the
GSP.
      </p>
      <p>Average Error in P recisioni = j=1
(3)
where K is the total number of mappings. The AEP is,
likewise, incrementally updated as feedback from users arrives.
K
X Error in P recisionij</p>
      <p>K
5.2</p>
    </sec>
    <sec id="sec-15">
      <title>Annotation Quality</title>
      <p>Using the definitions from Section 5.1, the precision of
the mappings was estimated based on the feedback obtained
from the crowd. To understand how effective the feedback
from the crowd has been at estimating the precision,
Figure 4 shows the error in the estimated precision of each of
the mappings as the amount of feedback obtained increases.
The following can be observed: (i) The error in precision
drops rapidly as feedback is collected, such that most
mappings have an error of less than 0.1 from around 50 feedback
instances. (ii) The errors obtained for mappings M1, M2
and M3, where the ground truth precision is 0 or 1, are in
the same range as those obtained for M4, M5, M6 and M7,
where the ground truth precision is between 0 and 1. We
had expected larger errors in precision for M4 to M7 because
these mappings seem to have less obvious errors than those
in M2 and M3. In M2 and M3, the error in the mapping
is that values are presented in the wrong columns, whereas
in M4 to M7 incorrect but plausible values are provided
for an attribute. Nevertheless, the users were able to
identify errors in nationality with similar reliability to column
transposition. (iii) The error in precision for some mappings
increases towards the end of feedback collection; this is most
likely explained by the effectively random order in which
different users provide feedback, with several fairly unreliable
users participating late in the experiment.</p>
      <p>Abstracting over the plots for the different mappings,
Figure 5 shows the AEP from Formula 3 as feedback is
collected. We observe that when fewer than 70 feedback
instances per mapping have been obtained, the plot is quite
unstable, but that thereafter, errors are both quite small
and quite stable. This suggests that reasonably reliable
estimates of mapping quality can be obtained with quite small
amounts of feedback, and thus at a modest financial cost.
5.3</p>
    </sec>
    <sec id="sec-16">
      <title>Feedback Reliability</title>
      <p>Applying the reliability techniques from Section 4.2 to the
data from Section 5.2, using a reliability threshold of 100%:
44 out of 51 workers are reliable for intra observer reliability;
28 out of 51 workers are reliable for intra and inter observer
reliability; and 35 out of 51 workers are reliable for inter
observer reliability.</p>
      <p>Some users were found to be reliable only for IaOR, some
users were found to be reliable only for IrOR, and some
users were found to be reliable for both IaOR and IrOR.
Figure 6 shows the distribution of the workers that were
found reliable against each of the reliability methods.</p>
      <p>We estimate the AEP with the feedback obtained filtered
to remove the users who are considered to be unreliable by
the different techniques. After filtering the feedback, we
report the AEP for the results filtered with IaOR, the results
filtered with IrOR, and the results filtered with IaOR and
IrOR in Figure 7. The results reflect the order in which
the workers provided the feedback. Note that the Feedback
Amount in the horizontal axis is the total amount of
feedback obtained, and that the different reliability schemes all
discard some of that feedback. The following can be
observed: (i) there is significant variation in the error during
the early parts of feedback collection, but this stabilises quite
rapidly to a low error as the feedback is increased; and (ii)
the different reliability schemes provide better results for
different amounts of feedback, reflecting the impact on the
conclusions that can be drawn of the order in which users
provide feedback. To remove this effect, we have repeatedly
randomly changed the order in which the feedback has been
obtained from the users, until such time as the additional
of further random orderings made no difference to the plot.
The resulting plot is provided in Figure 8. This plot shows
that while the combined filtering eventually yields the
greatest reduction in error, inter observer reliability is almost as
effective, and is more effective than intra observer reliability.
This is an important observation, because inter observer
reliability is much easier to implement, as it does not involve
users carrying out repeated tasks at different times.</p>
    </sec>
    <sec id="sec-17">
      <title>CONCLUSIONS</title>
      <p>
        This paper has studied the use of crowdsourcing to collect
feedback on the correctness of query/mapping results; such
feedback has been shown to be useful for different data
integration tasks, including keyword query evaluation and
mapping refinement [
        <xref ref-type="bibr" rid="ref1 ref15">1, 15</xref>
        ]. The following contributions have
been made:
• An experiment has been designed that collects true
positive and false positive annotations for query results
using the crowd, including techniques for estimating
sample sizes and for integrating reliability tests.
• The results of the experiment show that precision
estimates derived from crowd feedback improve rapidly
as feedback is accumulated, suggesting that the crowd
can be used as a cost-effective way of selecting between
collections of automatically generated mappings. This
confirms the experimental result obtained with
synthetic feedback reported by Belhajjame et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
• The experiment design included both inter- and
intraobserver reliability. Although simpler for both
experimenters and users, inter-observer reliability turned out
to be more effective than intra-observer reliability.
      </p>
      <p>Acknowledgment. Fernando Osorno-Gutierrez is supported
by a grant from the Mexican National Council for Science
and Technology (CONACYT).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Embury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. A.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Hedeler</surname>
          </string-name>
          .
          <article-title>Feedback-based annotation, selection and refinement of schema mappings for dataspaces</article-title>
          .
          <source>In EDBT</source>
          , pages
          <fpage>573</fpage>
          -
          <lpage>584</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Raunich</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Summa.</surname>
          </string-name>
          <article-title>The spicy system: towards a notion of mapping quality</article-title>
          .
          <source>In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD '08</source>
          , pages
          <fpage>1289</fpage>
          -
          <lpage>1294</lpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>D. de Vaus</surname>
          </string-name>
          . Surveys In Social Research (Social Research Today).
          <source>Routledge</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Crowdsourcing systems on the world-wide web</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>54</volume>
          (
          <issue>4</issue>
          ):
          <fpage>86</fpage>
          -
          <lpage>96</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fink</surname>
          </string-name>
          .
          <article-title>The Survey Handbook. The Survey Kit</article-title>
          .
          <source>SAGE Publications</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          .
          <article-title>From databases to dataspaces: a new abstraction for information management</article-title>
          .
          <source>SIGMOD Rec</source>
          .,
          <volume>34</volume>
          (
          <issue>4</issue>
          ):
          <fpage>27</fpage>
          -
          <lpage>33</lpage>
          , Dec.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Xin</surname>
          </string-name>
          .
          <article-title>Crowddb: answering queries with crowdsourcing</article-title>
          .
          <source>In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, SIGMOD '11</source>
          , pages
          <fpage>61</fpage>
          -
          <lpage>72</lpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hedeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. A.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Embury</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          .
          <article-title>Dimensions of dataspaces</article-title>
          .
          <source>In BNCOD</source>
          , pages
          <fpage>55</fpage>
          -
          <lpage>66</lpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Ipeirotis.</surname>
          </string-name>
          <article-title>Analyzing the amazon mechanical turk marketplace</article-title>
          .
          <source>XRDS</source>
          ,
          <volume>17</volume>
          (
          <issue>2</issue>
          ):
          <fpage>16</fpage>
          -
          <lpage>21</lpage>
          , Dec.
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          .
          <article-title>How to Measure Survey Reliability and Validity. The Survey Kit</article-title>
          .
          <source>SAGE Publications</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schtze</surname>
          </string-name>
          . Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Mason</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Watts</surname>
          </string-name>
          .
          <article-title>Financial incentives and the “performance of crowds”</article-title>
          .
          <source>SIGKDD Explor</source>
          . Newsl.,
          <volume>11</volume>
          (
          <issue>2</issue>
          ):
          <fpage>100</fpage>
          -
          <lpage>108</lpage>
          , May
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          .
          <article-title>Matching schemas in online communities: A web 2.0 approach</article-title>
          .
          <source>In ICDE 2008</source>
          , pages
          <fpage>110</fpage>
          -
          <lpage>119</lpage>
          , april
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          .
          <article-title>Answering queries using humans, algorithms and databases</article-title>
          .
          <source>In CIDR 2011</source>
          . Stanford InfoLab,
          <year>January 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Talukdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jacob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Mehmood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Crammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. G.</given-names>
            <surname>Ives</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Guha</surname>
          </string-name>
          .
          <article-title>Learning to create data-integrating queries</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>785</fpage>
          -
          <lpage>796</lpage>
          , Aug.
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Crowder: crowdsourcing entity resolution</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>5</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1483</fpage>
          -
          <lpage>1494</lpage>
          ,
          <year>July 2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>