=Paper= {{Paper |id=None |storemode=property |title=Human Computation Must Be Reproducible |pdfUrl=https://ceur-ws.org/Vol-842/crowdsearch-paritosh.pdf |volume=Vol-842 |dblpUrl=https://dblp.org/rec/conf/www/Paritosh12 }} ==Human Computation Must Be Reproducible== https://ceur-ws.org/Vol-842/crowdsearch-paritosh.pdf
                    Human Computation Must Be Reproducible

                                                               Praveen Paritosh
                                                                   Google
                                                                345 Spear St,
                                                           San Francisco, CA 94105.
                                                              pkp@google.com


ABSTRACT                                                                    sures of quality [Ipeirotis, Provost and Wang, 2010], utility
Human computation is the technique of performing a com-                     [Dai, Mausam and Weld, 2010], among others. In order for
putational process by outsourcing some of the difficult-to-                 us to have confidence in any such criteria, the results must
automate steps to humans. In the social and behavioral sci-                 be reproducible, i.e., not a result of chance agreement or irre-
ences, when using humans as measuring instruments, repro-                   producible human idiosyncrasies, but a reflection of the un-
ducibility guides the design and evaluation of experiments.                 derlying properties of the questions and task instructions, on
We argue that human computation has similar properties,                     which others could agree as well. Reproducibility is the de-
and that the results of human computation must be repro-                    gree to which a process can be replicated by different human
ducible, in the least, in order to be informative. We might                 contributors working under varying conditions, at different
additionally require the results of human computation to                    locations, or using different but functionally equivalent mea-
have high validity or high utility, but the results must be                 suring instruments. A total lack of reproducibility implies
reproducible in order to measure the validity or utility to a               that the given results could have been obtained merely by
degree better than chance. Additionally, a focus on repro-                  chance agreement. If the results are not differentiable from
ducibility has implications for design of task and instruc-                 chance, there is little information content in them. Using
tions, as well as for the communication of the results. It                  human computation in such a scenario is wasteful of an ex-
is humbling how often the initial understanding of the task                 pensive resource, as chance is cheap to simulate.
and guidelines turns out to lack reproducibility. We suggest                   A much stronger claim than reproducibility is validity. For
ensuring, measuring and communicating reproducibility of                    a measurement instrument, e.g., a vernier caliper, a stan-
human computation tasks.                                                    dardized test, or, a human coder, the reproducibility is the
                                                                            extent to which a measurement gives consistent results, and
                                                                            the validity is the extent to which the tool measures what
1.    INTRODUCTION                                                          it claims to measure. In contrast to reproducibility, valid-
  Some examples of tasks using human computation are:                       ity concerns truths. Validity requires comparing the results
labeling images [Nowak and Ruger, 2010], conducting user                    of the study to evidence obtained independently of that ef-
studies [Kittur, Chi, and Suh, 2008], annotating natural lan-               fort. Reproducibility provides assurances that particular re-
guage corpora [Snow, O’Connor, Jurafsky and Ng, 2008], an-                  search results can be duplicated, that no (or only a negligi-
notating images for computer vision research [Sorokin and                   ble amount of) extraneous noise has entered the process and
Forsyth, 2008], search engine evaluation [Alonso, Rose and                  polluted the data or perturbed the research results, validity
Stewart, 2008; Alonso, Kazai and Mizzaro, 2011], content                    provides assurances that claims emerging from the research
moderation [Ipeirotis, Provost and Wang, 2010], entity rec-                 are borne out in fact.
onciliation [Kochhar, Mazzocchi and Paritosh, 2010], con-                      We might want the results of human computation to have
ducting behavioral studies [Suri and Mason, 2010; Horton,                   high validity, high utility, low cost, among other desirable
Rand and Zeckhauser, 2010].                                                 characteristics. However, the results must be reproducible
  These tasks involve presenting a question, e.g.,“Is this im-              in order for us to measure the validity or utility to a degree
age offensive?,” to one or more humans, whose answers are                   better than chance.
aggregated to produce a resolution, a suggested answer for                     More than a statistic, a focus on reproducibility offers
the original question. The humans might be paid contribu-                   valuable insights regarding the design of the task and the
tors. Examples of paid workforces include Amazon Mechani-                   guidelines for the human contributors, as well as the commu-
cal Turk [www.mturk.com] and oDesk [www.odesk.com]. An                      nication of the results. The output of human computation is
example of a community of volunteer contributors is Foldit                  thus akin to the result of a scientific experiment, and it can
[www.fold.it], where human computation augments machine                     only be considered meaningful if it is reproducible — that is,
computation for predicting protein structure [Cooper et al.,                the same results could be replicated in an independent exer-
2010]. Another example involving volunteer contributors is                  cise. This requires clearly communicating the task instruc-
games with a purpose [von Ahn, 2006], and the upcoming                      tions, and the criterion of selecting the human contributors,
Duolingo [www.duolingo.com], where the contributors are                     ensuring that they work independently, and reporting an
translate previously untranslated web corpora while learn-                  appropriate measure of reproducibility. Much of this is well
ing a new language.                                                         established in the methodology of content analysis in the so-
  The results of human computation can be characterized                     cial and behavioral sciences [Armstrong, Gosling, Weinman
by accuracy [Oleson et al., 2011], information theoretic mea-


Copyright c 2012 for the individual papers by the papers’ authors. Copy-
ing permitted for private and academic purposes. This volume is published
and copyrighted by its editors.
CrowdSearch 2012 workshop at WWW 2012, Lyon, France
and Marteau, 1997; Hayes and Krippendorff, 2007], being            3.     CONTENT ANALYSIS AND CODING IN
required of any publishable result involving human contrib-               THE BEHAVIORAL SCIENCES
utors. In Section 2 and 3, we argue that human computation
                                                                      In the social sciences, content analysis is a methodology
resembles the coding tasks of behavioral sciences. However,
                                                                   for studying the content of communication [Berelson, 1952;
in the human computation and crowdsourcing research com-
                                                                   Krippendorff, 2004]. Coding of subjective information is a
munity, reproducibility is not commonly reported.
                                                                   significant source of empirical data in social and behavioral
   We have collected millions of human judgments regard-
                                                                   sciences, as they allow techniques of quantitative research
ing entities and facts in Freebase [Kochhar, Mazzocchi and
                                                                   to be applied to complex phenomena. These data are typ-
Paritosh, 2010; Paritosh and Taylor, 2012]. We have found
                                                                   ically generated by trained human observers who record or
reproducibility to be a useful guide for task and guideline
                                                                   transcribe textual, pictorial or audible matter in terms suit-
design. It is humbling how often the initial understanding
                                                                   able for analysis. This task is called coding, which involves
of the task and guidelines turns out to lack reproducibility.
                                                                   assigning categorical, ordinal or quantitative responses to
In section 4, we describe some of the widely used measures of
                                                                   units of communication.
reproducibility. We suggest ensuring, measuring and com-
                                                                      An early example of a content analysis based study is “Do
municating reproducibility of human computation tasks.
                                                                   newspapers now give the news?” [Speed, 1893], which tried
   In the next section, we describe the sources of variabil-
                                                                   to show that the coverage of religious, scientific and literary
ity human computation, which highlight the role of repro-
                                                                   matters was dropped in favor of gossip, sports and scandals
ducibility.
                                                                   between 1881 and 1893, by New York newspapers. Conclu-
                                                                   sions from such data can only be trusted if the reading of
2. SOURCES OF VARIABILITY IN HUMAN                                 the textual data as well as of the research results are replica-
   COMPUTATION                                                     ble elsewhere, that the coders demonstrably agree on what
  There are many sources of variability in human compu-            they are talking about. Hence, the coders need to demon-
tation that are not present in machine computation. Given          strate the trustworthiness of their data by measuring their
that human computation is used to solve problems that are          reproducibility. To perform reproducibility tests, additional
beyond the reach of machine computation, by definition,            data are needed: by duplicating the research under various
these problems are incompletely specified. Variability arises      conditions. Reproducibility is established by independent
due to incomplete specification of the task. This is convolved     agreement between different but functionally equal measur-
with the fact that the guidelines are subject to differing in-     ing devices, for example, by using several coders with di-
terpretations by different human contributors. Some char-          verse personalities. The reproducibility of coding has been
acteristics of human computation tasks are:                        used for comparing consistency of medical diagnosis [e.g.,
                                                                   Koran, 1975], for drawing conclusions from meta-analysis of
   • Task guidelines are incomplete: A task can span a wide
                                                                   research findings [e.g., Morley et al., 1999], for testing indus-
     set of domains, not all of which are anticipated at the
                                                                   trial reliability [Meeker and Escobar, 1998], for establishing
     beginning of the task. This leads to incompleteness
                                                                   the usefulness of a clinical scale [Hughes et al., 1982].
     in guidelines, as well as varying levels of performance
     depending upon the contributor’s expertise in that do-        3.1     Relationship between Reproducibility and
     main. Consider the task of establishing relevance of
     an arbitrary search query [Alonso, Kazai and Mizzaro,
                                                                           Validity
     2011].
                                                                        • Lack of reproducibility limits the chance of validity: If
   • Task guidelines are not precise: Consider, for exam-                 the coding results are a product of chance, it may well
     ple, the task of declaring if an image is unsuitable for             include a valid account of what was observed, but re-
     a social network website. Not only is it hard to write               searchers would not be able to identify that account
     down all the factors that go into making an image of-                to a degree better than chance. Thus, the more unre-
     fensive, it is hard to communicate those factors to hu-              liable a procedure, the less likely it is to result in data
     man contributors with vastly different predispositions.              that lead to valid conclusions.
     The guidelines usually rely upon shared common sense
     knowledge and cultural knowledge.                                  • Reproducibility does not guarantee validity: Two ob-
                                                                          servers of the same event who hold the same concep-
   • Validity data is expensive or unavailable: An oracle                 tual system, prejudice, or, interest may well agree on
     that provide the true answer for any given question is               what they see but still be objectively wrong, based on
     usually unavailable. Sometimes for a small subset of                 some external criterion. Thus a reliable process may
     gold questions, we have answers from another indepen-                or may not lead to valid outcomes.
     dent source. This can be useful in making estimates of
     validity, subject to the degree that the gold questions          In some cases, validity data might be so hard to obtain
     are representative of the set of questions. These gold        that one has to contend with reproducibility. In tasks such
     questions could be very useful for training and feed-         as interpretation and transcription of complex textual mat-
     back to the human contributors, however, we have to           ter, suitable accuracy standards are not easy to find. Be-
     be ensure their representatitiveness in order to make         cause interpretations can only be compared to interpreta-
     warranted claims regarding validity.                          tions, attempts to measure validity presuppose the privi-
  Each of the above might be true to a different degree for        leging of some interpretations over others, and this puts any
different human computation tasks. These factors are simi-         claims regarding validity on epistemologically shaky grounds.
lar to the concerns of behavioral and social scientists in using   In some tasks like psychiatric diagnosis, even reproducibil-
humans as measuring instruments.                                   ity is hard to attain for some questions. Aboraya et al.
[2006] review the reproducibility of psychiatric diagnosis.       flecting the original bias in the distribution of safe and unsafe
Lack of reproducibility has been reported for judgments of        bags. A misguided interpretation of accuracy or a poor esti-
schizophrenia and affective disorder [Goodman et al., 1984],      mate of accuracy can be less informative than reproducibil-
calling such diagnosis into question.                             ity.

3.2 Relationship with Chance Agreement                            3.3    Reproducibility and Experiment Design
                                                                    The focus on reproducibility has implications on design of
   In this section, we look at some properties of chance agree-
                                                                  task and instruction materials. Krippendorff [2004] argues
ment, and its relationship to reproducibility. Given two cod-
                                                                  that any study using observed agreement as a measure of
ing schemes for the same phenomenon, the one with fewer
                                                                  reproducibility must satisfy the following requirements:
categories will have higher chance agreement. For example,
in reCAPTCHA [von Ahn et al., 2008], two independent                 • It must employ an exhaustively formulated, clear, and
humans are shown an image containing text and asked to                 usable guidelines;
transcribe it. Assuming that there is no collusion, chance
agreement, i.e., two different humans typing in the same             • It must use clearly specified criteria concerning the
word/phrase by chance is very small. However, in a task                choice of contributors,so as others may use such cri-
in which there are only two possible answers, e.g., true and           teria to reproduce the data;
false, the probability of chance agreement between two an-           • It must ensure that the contributors that generate the
swers is 0.5.                                                          data used to measure reproducibility work indepen-
   If a disproportionate amount of data falls under one cate-          dently of each other. Only if such independence is as-
gory, then the expected chance agreement is very high, so in           sured can covert consensus be ruled out the observed
order to demonstrate high reproducibility, even higher ob-             agreement be explained in terms of the given guidelines
served agreement is required [Feinstein and Cicchetti 1990;            and the task.
Di Eugenio and Glass 2004].
   Consider a task of rating a proposition as true or false.         The last point cannot be stressed enough. There are po-
Let p be the probability of the proposition being true. An        tential benefits from multiple contributors collaborating, but
implementation of chance agreement is the following: toss a       data generated in this manner neither ensure reproducibil-
biased coin with the same odds, i.e., p is the probability that   ity nor reveal its extent. In groups like these, humans are
it turns heads, and declare the proposition to be true when       known to negotiate and to yield to each other in quid pro
the coin lands heads. Now, we can simulate n judgments            quo exchanges, with prestigious group members dominating
by tossing the coin n times. Let us look at the properties        the outcome [see for example, Esser, 1998]. This makes the
of unanimous agreement between two judgments. The like-           results of collaborative human computation a reflection of
lihood of a chance agreement on true is p2 . Independent          the social structure of the group, which is nearly impossible
of this agreement, the probability of the proposition being       to communicate to other researchers and replicate. The data
true is p, therefore, the accuracy of this chance agreement       generated by collaborative work are akin to data generated
on true is p3 . By design, these judgments do not contain any     by a single observer, while reproducibility requires at least
information other than the a priori distribution across an-       two independent observers. To substantiate the contention
swers. Such data has close to zero reproducibility, however,      that collaborative coding is superior to coding by separate
sometimes it can show up in surprising ways when looked           individuals, a researcher would have to compare the data
through the lens of accuracy.                                     generated by at least two such groups and two individuals,
   For example, consider the task of an airport agent declar-     each working independently.
ing a bag as safe or unsafe for boarding on the plane. A             A model in which the coders work independently, but con-
bag can be unsafe if it contains toxic or explosive materials     sult each other when unanticipated problems arise, is also
that could threaten the safety of the flight. Most bags are       problematic. A key source of these unanticipated problems
safe. Let us say that one in a thousand bags is potentially       is the fact that the writers of the coding instructions did
unsafe. Random coding would allow two agents to jointly           not anticipate all the possible ways of expressing the rele-
assign “safe” 99.8% of the time, and since 99.9% of the bags      vant matter. Ideally, these instructions should include every
are safe, this agreement would be accurate 99.7% of the time!     applicable rule on which agreement is being measured. How-
This leads to the surprising result that when data are highly     ever, discussing emerging problems could create re-interpretation
skewed, the coders may agree on a high proportion of items        of the existing instructions in ways that are a function of the
while producing annotations that are accurate, but of low         group and not communicable to others. In addition, as the
reproducibility. When one category is very common, high           instructions become reinterpreted, the process loses its sta-
accuracy and high agreement can also result from indiscrim-       bility: data generated early in the process use instructions
inate coding. The test for reproducibility in such cases is       that differ from those later.
the ability to agree on the rare categories. In the airport          In addition to the above, Craggs and McGee Wood [2005]
bag classification problem, while chance accuracy on safe         discourage researchers from testing their coding instructions
bags is high, chance accuracy on unsafe bags is extremely         on data from more than one domain. Given that the re-
low, 10−7 %. In practice, the cost of errors vary: mistakenly     producibility of the coding instructions depends to a great
classifying a safe bag as unsafe causes far less damage than      extent on how complications are dealt with, and that every
classifying an unsafe bag as safe.                                domain displays different complications, the sample should
   In this case, it is dangerous to consider an averaged accu-    contain sufficient examples from all domains which have to
racy score, as different errors do not count equal: a chance      be annotated according to the instructions.
process that does not add any information has an average             Even the best coding instructions might not specify all
accuracy which is higher than 99.7%, most of which is re-         possible complications. Besides the set of desired answers,
the coders should also be allowed to skip a question. If the        to Popping [1988], Artstein and Poesio [2007]. The different
coders cannot prove any of the other answers is correct, they       coefficients of reproducibility differ in the assumptions they
skip that question. For any other answer, the instructions          make about the properties of coders, judgments and units.
define an a priori model of agreement on that answer, while         Scott’s π [1955] is applicable to two raters and assumes that
skip represents the unanticipated properties of questions and       the raters have the same distribution of responses, where
coders. For instance, some questions might be too difficult         Cohen’s κ [1960; 1968] allows for a a separate distribution
for certain coders. Providing the human contributors with           of chance behavior per coder. Fleiss’ κ [1971] is a gen-
an option to skip is a nod to the openness of the task, and         eralization of Scott’s π for an arbitrary number of raters.
can be used to explore the poorly defined parts of the task         All of these coefficients of reproducibility correct for chance
that were not anticipated at the outset. Additionally, the          agreement similarly. First, they find how much agreement
skip votes can be removed from the analysis for computing           is expected by chance: let us call this vallue Ae . The data
reproducibility, as we do not have expectation of agreement         from the coding is a measure of the observed agreement,
on them [Krippendorff, 2012, personal communication].               Ao . Various inter-rater reliabilities measure the proportion
                                                                    of the possible agreement beyond chance that was actually
4. MEASURING REPRODUCIBILITY                                        observed.
   In measurement theory, reliability is the more general
guarantee that the data obtained are independent of the                                              Ao − Ae
                                                                                         S, π, κ =
measuring event, instrument or person. There are three dif-                                          1 − Ae
ferent kinds of reliability:                                           Krippendorff’s α [1970; 2004] is a generalization of many
   Stability: measures the degree to which a process is un-         of these coefficients. It is a generalization of the above met-
changing over time. It is measured by agreement between             rics for an arbitrary number of raters, not all of whom have
multiple trials of the same measuring or coding process. This       to answer every question. Krippendorff’s α has the following
is also called test-retest condition, in which one observer         desirable characteristics:
does a task, and after some time, repeats the task again.
This measures intra-observer reliability. A similar notion is          • It is applicable to an arbitrary number of contributors
internal consistency [Cronbach, 1951], which is the degree               and invariant to the permutation and selective partic-
to which the answers on the same task are consistent. Sur-               ipation of contributors. It corrects itself for varying
veys are designed so that the subsets of similar questions               amounts of reproducibility data.
are known a priori, and measures for internal consistency
metrics are based on correlation between these answers.                • It constitutes a numerical scale between at least two
   Reproducibility: measures the degree to which a process               points with sensible reproducibility interpretations, 0
can be replicated by different analysts working under varying            representing absence of agreement, and 1 indicating
conditions, at different locations, or, using different but func-        perfect agreement.
tionally equivalent measuring instruments. Reproducible                • It is applicable to several scales of measurement: ordi-
data, by definition, are data that remain constant through-              nal, nominal, interval, ratio, and more.
out variations in the measuring process [Kaplan and Gold-
sen, 1965].                                                           Alpha’s general form is:
   Accuracy: measures the degree to which the process pro-
duces valid results. To measure accuracy, we have to com-                                              Do
pare the performance of contributors with the performance                                    α=1−
                                                                                                       De
of a procedure that is known to be correct. In order to
generate estimates of accuracy, we need accuracy data, i.e.,          Where Do is the observed disagreement:
valid answers to a representative sample of the questions.
Estimating accuracy gets harder in cases where the hetero-                                    1 XX       2
                                                                                       Do =        ock .δck
geneity of the task is poorly understood.                                                     n c
                                                                                                       k
   The next section focuses on reproducibility.
                                                                      and De is the disagreement one would expect when the
4.1 Reproducibility                                                 answers are attributable to chance rather than to the prop-
   There are two different aspects of reproducibility: inter-       erties of the questions:
rater reliability and inter-method reliability. Inter-rater reli-
ability focuses on the reproducibility by agreement between                                 1    XX          2
                                                                                  De =              nc .nk .δck
independent raters, and inter-method reliability focuses on                              n(n − 1) c
                                                                                                           k
the reliability of different measuring devices. For example,
                                                                           2
in survey and test design, parallel forms reliability is used         The δck term is the distance metric for the scale of the
to create multiple equivalent tests, of which more than one         answerspace. For a nominal scale,
are administered to the same human. We focus on inter-
rater reliability as the measure of reproducibility typically                                    (
                                                                                          2          0 if c = k
applicable to human computation tasks, where we generate                                 δck =
judgments from multiple humans per question. The simplest                                            1 if c 6= k
form of inter-rater reliability is percent agreement, however
it is not suitable as a measure of reproducibility as it does       4.2    Statistical Significance
not correct for chance agreement.                                     The goal of measuring reproducibility is to ensure that
   For extensive survey of measures of reproducibility, refer       the data does not deviate too much from perfect agreement,
not that the data is different from chance agreement. In the       4.5    Other Measures of Quality
definition of α, chance agreement is one of the two anchors           Ipeiritos, Provost and Wang [2010] present an information
for the agreement scale, the other, more important, reference      theoretic quality score, which measures the quality of a con-
point being that of perfect agreement. As the distribution of      tributor in terms of comparing their score to a spammer who
α is unknown, confidence intervals on α are obtained from          is trying to advantage of chance accuracy. In that regard, it
empirical distribution generated by bootstrapping — that           is a similar metric to Krippendorff’s alpha, and additionally
is, by drawing a large number of subsamples from the re-           models a notion of cost.
producibility data, computing α for each. This gives us a
probability distribution of hypothetical α values that could                                     ExpCost(Contributor)
                                                                           QualityScore = 1 −
occur within the constraints of the observed data. This can                                       ExpCost(Spammer)
be used to calculate the probability of failing to reach the          Turkontrol [Dai, Mausam and Weld, 2010], uses both a
smallest acceptable reproducibility αmin , q|α < αmin , or a       model of utility and quality using decision-theoretic control
two tailed confidence interval for chosen level of significance.   to make trade-offs between quality and utility for workflow
4.3 Sampling Considerations                                        control.
                                                                      Le, Edmonds, Hester and Biewald [2010] develop a gold
  To generate an estimate of the reproducibility of a pop-
                                                                   standard based quality assurance framework that provides
ulation, we need to generate a representative sample of the
                                                                   direct feedback to the workers and targets specific worker
population. The sampling needs to ensure that we have
                                                                   errors. This approach requires extensive manually generated
enough units from the rare categories of questions in the
                                                                   collection of gold data. Oleson et al. [2011], further develop
data. Assuming that α is normally distributed, Bloch and
                                                                   this approach to include pyrite, which are programmatically
Kraemer [1989] provide a suggestion for minimum number of
                                                                   generated gold questions on which contributors are likely to
questions from each category to be included in the sample,
                                                                   make an error, for example by mutating data so that it is
Nc , by,
                                                                   no longer valid. These are very useful metrics for training,
                                                                 feedback and protection against spammers, but these do not
                        (1 + αmin )(3 − αmin )                     reveal the accuracy of the results. The gold questions, by
         Nc = z 2                                − αmin
                        4pc (1 − pc )(1 − αmin )                   design, are not representative of the original set of questions.
  Where,                                                           These lead to wide error bars on the accuracy estimates, and
   • pc is the smallest estimated proportion values of the         it might be valuable to measure reproducibility of results.
     category c in the population,
   • alphamin is the smallest acceptable reproducibility be-       5.    CONCLUSIONS
     low which data will have to be rejected as unreliable,           We describe reproducibility as a necessary but not suf-
     and                                                           ficient requirement for results of human computation. We
                                                                   might additionally require the results to have high validity
   • z is the desired level of statistical significance, repre-    or high utility, but our ability to measure validity or utility
     sented by the corresponding z value for one-tailed tests      with confidence is limited if the data are not reproducible.
   This is a simplification, as it assumes α is normally dis-      Additionally, a focus on reproducibility has implications for
tributed, and binary data, and does not account for the num-       design of task and instructions, as well as for the commu-
ber of raters. A general description of sampling requirement       nication of the results. It is humbling how often the initial
is an open problem.                                                understanding of the task and guidelines turns out to lack
                                                                   reproducibility. We suggest ensuring, measuring and com-
4.4 Acceptable Levels of Reproducibility                           municating reproducibility of human computation tasks.
  Fleiss [1981] and Krippendorff [2004] present guidelines
for what should acceptable values of reproducibility based
on surveying the empirical research using these measures.
                                                                   6.    ACKNOWLEDGMENTS
Krippendorff suggests,                                               The author would like to thank Jamie Taylor, Reilly Hayes,
                                                                   Stefano Mazzocchi, Mike Shwe, Ben Hutchinson, Al Marks,
   • Rely on variables with reproducibility above α = 0.800.       Tamsyn Waterhouse, Peng Dai, Ozymandias Haynes, Ed
     Additionally don’t accept data if the confidence inter-       Chi and Panos Ipeirotis for insightful discussions on this
     val reaches below the smallest acceptable reproducibil-       topic.
     ity, αmin = 0.667, or, ensure that the probability, q, of
     the failure to have less than smallest acceptable repro-
     ducibility alphamin is reasonably small, e.g., q < 0.05.      7.    REFERENCES
                                                                     [1] Aboraya, A., Rankin, E., France, C., El-Missiry, A.,
   • Consider variables with reproducibility between α =
                                                                   John, C. The Reliability of Psychiatric Diagnosis Revisited,
     0.667 and α = 0.800 only for drawing tentative con-
                                                                   Psychiatry (Edgmont). 2006 January; 3(1): 41-50.
     clusions.
                                                                     [2] Alonso, O., Rose, D. E., Stewart, B.. Crowdsourcing
   These are suggestions, and the choice of thresholds of ac-      for relevance evaluation. SIGIR Forum, 42(2):9-15, 2008.
ceptability depend upon the validity requirements imposed            [3] Alonso, O., Kazai, G. and Mizzaro, S., 2011, Crowd-
on the research results. It is perilous to “game” α by vio-        sourcing for search engine evaluation, Springer 2011.
lating the requirements of reproducibility: for example, by          [4] Armstrong, D., Gosling, A., Weinman, J. and Marteau,
removing a subset of data post-facto to increase α. Parti-         T., 1997, The place of inter-rater reliability in qualitative
tioning data by agreement measured after the experiment            research: An empirical study, Sociology, vol 31, no. 3, 597-
will not lead to valid conclusions.                                606.
   [5] Artstein, R. and Poesio, M. 2007. Inter-coder agree-          [25] Krippendorff, K. 2004. Content Analysis: An Intro-
ment for computational linguistics. Computational Linguis-        duction to Its Methodology, Second edition. Sage, Thousand
tics.                                                             Oaks.
   [6] Bennett, E. M., R. Alpert, and A. C. Goldstein. 1954.         [26] Kochhar, S., Mazzocchi, S., and Paritosh, P., 2010,
Communications through limited questioning. Public Opin-          The Anatomy of a Large-Scale Human Computation Engine,
ion Quarterly, 18(3):303-308.                                     In Proceedings of Human Computation Workshop at the
   [7] Berelson, B., 1952. Content analysis in communication      16th ACM SIKDD Conference on Knowledge Discovery and
research, Free Press, New York.                                   Data Mining, KDD 2010, Washington D.C.
   [8] Bloch, Daniel A. and Helena Chmura Kraemer. 1989.             [27] Koran, L. M., 1975, The reliability of clinical meth-
2 x 2 kappa coefficients: Measures of agreement or associa-       ods, data and judgments (parts 1 and 2), New England Jour-
tion. Biometrics, 45(1):269-287                                   nal of Medicine, 293(13/14).
   [9] Bollacker, K., Evans, C., Paritosh, P., Sturge, T. and        [28] Kittur, A., Chi, E. H. and Suh, B. Crowdsourcing
Taylor, J., 2008, Freebase: A Collaboratively Created Graph       user studies with Mechanical Turk. In Proceedings of the
Database for Structuring Human Knowledge. In the Pro-             Proceeding of the twenty-sixth annual SIGCHI conference
ceedings of the 28th ACM SIGMOD International Confer-             on Human factors in computing systems, Florence.
ence on Management of Data, Vancouver.                               [29] Mason, W., and Siddharth. 2010. Conducting Be-
   [10] Cohen, Jacob. 1960. A coefficient of agreement for        havioral Research on Amazon’s Mechanical Turk, Working
nominal scales. Educational and Psychological Measure-            Paper, Social Science Research Network.
ment, 20(1):37-46.                                                   [30] Meeker, W. and Escobar, L., 1998, Statistical Meth-
   [11] Cohen, Jacob. 1968. Weighted kappa: Nominal scale         ods for Reliability Data, Wiley.
agreement with provision for scaled disagreement or partial          [31] Morley, S. Eccleston, C. and Williams, A., 1999, Sys-
credit. Psychological Bulletin, 70(4):213-220.                    tematic review and meta-analysis of randomized controlled
   [12] Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee,   trials of cognitive behavior therapy and behavior therapy for
J., Beenen, M., Leaver-Fay, A., Baker, D., and Popovic,           chronic pain in adults, excluding headache, Pain, 80, 1-13.
Z. Predicting protein structures with a multiplayer online           [32] Nowak, S. and Ruger, S. 2010. How reliable are an-
game. Nature, 466:7307, 756-760                                   notations via crowdsourcing: a study about inter-annotator
   [13] Cronbach, L. J., 1951, Coefficient alpha and the in-      agreement for multi-label image annotation. In Multimedia
ternal structure of tests. Psychometrica, 16, 297-334.            Information Retrieval, pages 557-566.
   [14] Craggs, R. and McGee Wood, M. 2004. A two-dimensional        [33] Oleson, D., Hester, V., Sorokin, A., Laughlin, G.,
annotation scheme for emotion in dialogue. In Proc.of AAAI        Le, J., and Biewald, L. Programmatic Gold: Targeted and
Spring Symmposium on Exploring Attitude and Affect in             Scalable Quality Assurance in Crowdsourcing. In HCOMP
Text, Stanford.                                                   ’11: Proceedings of the Third AAAI Human Computation
   [15] Dai, P., Mausam, Weld, D. S. Decision-Theoretic           Workshop.
Control of Crowd-Sourced Workflows. AAAI 2010.                       [34] Paritosh, P. and Taylor, J., 2012, Freebase: Curating
   [16] Di Eugenio, Barbara and Michael Glass. 2004. The          Knowledge at Scale, To appear in the 24th Conference on
kappa statistic: A second look. Computational Linguistics,        Innovative Applications of Artificial Intelligence, IAAI 2012.
30(1):95-101.                                                        [35] Popping, R., 1988, On agreement indices for nominal
   [17] Esser, J.K., 1998, Alive and Well after 25 Years: A       data, Sociometric Research: Data Collection and Scaling,
Review of Groupthink Research, Organizational Behavior            90-105.
and Human Decision Processes, Volume 73, Issues 2-3, 116-            [36] Scott, W. A. 1955. Reliability of content analysis:
141.                                                              The case of nominal scale coding. Public Opinion Quarterly,
   [18] Feinstein, Alvan R. and Domenic V. Cicchetti. 1990.       19(3):321-325.
High agreement but low kappa: I. The problems of two para-           [37] Snow, R., O’Connor, B., Jurafsky, D., and Ng, A.
doxes. Journal of Clinical Epidemiology, 43(6):543-549.           Y. 2008. Cheap and fast, but is it good?: evaluating non-
   [19] Fleiss, Joseph L. 1971. Measuring nominal scale agree-    expert annotations for natural language tasks. In EMNLP
ment among many raters. Psychological Bulletin, 76(5):378-        ’08: Proceedings of the Conference on Empirical Methods
382.                                                              in Natural Language Processing.
   [20] Goodman AB, Rahav M, Popper M, Ginath Y, Pearl               [38] Sorokin, A., and Forsyth, D. 2008. Utility data anno-
E. The reliability of psychiatric diagnosis in Israel’s Psychi-   tation with amazon mechanical turk. In First International
atric Case Register. Acta Psychiatr Scand. 1984 May;69(5):391-    Workshop on Internet Vision, CVPR 08.
7.                                                                   [39] Speed, J. G., 1893, Do newspapers now give the news,
   [21] Hayes, A. F., and Krippendorff, K. 2007. Answering        Forum, August, 1893.
the call for a standard reliability measure for coding data.         [40] von Ahn, L. Games With A Purpose. IEEE Computer
Communication Methods and Measures, 1(1):77-89.                   Magazine, June 2006. Pages 96-98
   [22] Hughes CD, Berg L, Danziger L, Coben LA, Martin              [41] von Ahn, L., and Dabbish, L. Labeling Images with
RL. A new clinical scale for the staging of dementia. Then        a Computer Game. ACM Conference on Human Factors in
British Journal of Psychiatry 1982;140:56.                        Computing Systems, CHI 2004. Pages 319-326.
   [23] Ipeirotis, P.; Provost, F.; and Wang, J. 2010. Quality       [42] von Ahn, L., Maurer, B., McMillen, C., Abraham,
management on amazon mechanical turk. In KDD-HCOMP                D. and Blum, M. reCAPTCHA: Human-Based Character
’10.                                                              Recognition via Web Security Measures. Science, September
   [24] Krippendorff, K. 1970. Bivariate agreement coef-          12, 2008. pp 1465-1468.
ficients for reliability of data. Sociological Methodology,
2:139-150