=Paper= {{Paper |id=Vol-2276/paper6 |storemode=property |title=Expert Disagreement in Sequential Labeling: A Case Study on Adjudication in Medical Time Series Analysis |pdfUrl=https://ceur-ws.org/Vol-2276/paper6.pdf |volume=Vol-2276 |authors=Mike Schaekermann,Edith Law,Kate Larson,Andrew Lim |dblpUrl=https://dblp.org/rec/conf/hcomp/SchaekermannLLL18 }} ==Expert Disagreement in Sequential Labeling: A Case Study on Adjudication in Medical Time Series Analysis== https://ceur-ws.org/Vol-2276/paper6.pdf
     Expert Disagreement in Sequential Labeling:
          A Case Study on Adjudication in
           Medical Time Series Analysis?

       Mike Schaekermann1 , Edith Law1 , Kate Larson2 , and Andrew Lim3
           1
              HCI Lab, School of Computer Science, University of Waterloo
 2
   Artificial Intelligence Group, School of Computer Science, University of Waterloo
 3
   Division of Neurology, Sunnybrook Health Sciences Centre, University of Toronto



        Abstract. Low inter-rater agreement is typical in various expert do-
        mains that rely in part on subjective evaluation criteria. Prior work has
        predominantly focused on expert disagreement with respect to individual
        cases in isolation. In this work, we report results from a case study on
        expert disagreement in sequential labeling tasks where the interpretation
        of one case can affect the interpretation of subsequent or previous cases.
        Three board-certified sleep technologists participated in face-to-face ad-
        judication sessions to resolve disagreement in the context of sleep stage
        classification. We collected 1,920 independent scoring decisions from each
        expert on the same dataset of eight 2-hour long multimodal medical time
        series recordings. From all disagreement cases (29% of the dataset), a rep-
        resentative subset of 30 cases was selected for adjudication and expert
        discussions were analyzed for sources of disagreement. We present our
        findings from this case study and discuss future application scenarios of
        expert discussions for the training of non-expert crowdworkers.

        Keywords: Inter-rater disagreement · Adjudication · Sequence data.


1      Introduction

One of the most common use cases for crowdsourcing is the classification of
objects into categories. While crowdsourced classification tasks traditionally fo-
cused on problems not requiring domain expertise, recent work suggests that
crowdsourcing can also be effective for expert-level classification. Examples of
such expert tasks from the medical domain include the identification of low-level
patterns in sleep-related biosignals [22], the annotation of retinal images [12],
and medical relation extraction [3].
    In many mission-critical expert domains including the interpretation of med-
ical data, low inter-rater agreement rates are the norm [5, 11, 15, 16]. Expert
disagreement, however, poses fundamental challenges to quality control proce-
dures in crowdsourcing, and to the use of data labels in supervised machine
?
     Supported by NSERC CHRP (CHRP 478468-15) and CIHR CHRP (CPG-140200).
M. Schaekermann et al.




Fig. 1. Visualization of one 30-second epoch of biosignal data to be scored into one of
five sleep stages




learning, as it is not immediately obvious how cases at the inter-subjective de-
cision boundary should be disambiguated if multiple equally-qualified domain
experts exhibit genuine disagreement.

     Prior work has predominantly paid attention to the nature, sources and re-
solvability of expert disagreement on individual classification tasks in isolation
[1, 4, 13, 21]. Many interpretation tasks, however, are sequential in nature, i.e.,
the interpretation of one case affects the interpretation of subsequent or previ-
ous cases. For example, in text translation, the semantic interpretation of one
phrase or sentence can affect the translation of subsequent or previous phrases or
sentences. Heidegger called this reciprocity of text and context the hermeneutic
circle. Overall, sequential labeling makes up a large and diverse class of problems
from numerous expert domains.

    In this work, we present findings from a case study on expert disagreement in
the context of sleep stage classification, the expert task of mapping a sequence of
fixed-length pages of continuous multimodal medical time series (polysomnogram,
see Figure 1) to a sequence of discrete sleep stages (hypnogram). Prior work has
established that inter-rater agreement in sleep staging averages around 82.6%
[17]. The objective of this case study is to identify various sources of expert
disagreement in sleep stage classification and to investigate if and to what extent
disagreement may be specific to the sequential nature of the labeling task and
underlying data.

    To answer these questions, we collected 1,920 independent sleep scoring de-
cisions from a committee of three board-certified sleep technologists. We then
selected a representative subset of the resulting disagreement cases which were
resolved through in-person adjudication among the members of the expert com-
mittee. The rest of this paper describes the related work, then details our study
for collecting and analyzing the expert deliberation data, and concludes with a
discussion of application scenarios for the training of non-expert crowdworkers.
                                        Expert Disagreement in Sequential Labeling

2     Related Work

2.1   Ambiguity and Sources of Inter-rater Disagreement

Ambiguity, the quality of being open to more than one interpretation, and the
phenomenon of expert disagreement are central to the justification of knowledge,
and have been extensively discussed in the epistemic literature [1, 4, 13, 21]. An
early theoretical investigation named three types of expert disagreement [13]:
personality-based disagreement arising from the incompetence, ideology, or ve-
nality of experts, judgment-based disagreement arising from information gaps,
or structural disagreement that arises because experts adopt different organizing
principles or problem definitions. Garbayo [4], on the other hand, distinguished
a form of legitimate disagreement, that arises when experts can access the same
evidence, but still diverge in interpretations, from verbal disagreement, i.e., mis-
understanding among experts due to discrepancies in terminology.
    Recent work in the field of human-computer interaction (HCI) has explored
the issue of disagreement in the context of crowdsourcing tasks. Gurari and Grau-
men [6] analyzed visual question answering tasks and found that disagreement
can be attributed to ambiguous and subjective questions, insufficient or am-
biguous visual evidence, differing levels of annotator expertise, and vocabulary
mismatch. Chang et al. [2] proposed to elicit help from the crowd for refinement
of category definitions, based on the finding that workers may disagree because
of incomplete or ambiguous classification guidelines. Kairam and Heer [8] intro-
duced a technique to identify clusters of workers with diverging, but legitimate
interpretations of the same task. Their work shows that disagreement can arise
from differences in how liberally or conservatively workers interpret classificaion
guidelines.
    Our study revolves around the task of biomedical time series classification, a
field with typically low inter-scorer reliability. For example, Rosenberg and van
Hout [17] conducted a large-scale study on inter-scorer reliability in sleep stage
classification and found that average expert agreement is as low as 82.6%. In a
comment on this study, Penzel et al. [15] explained that systematic studies on
the inter-rater reliability of sleep automatically bring up the question of truth,
claiming that the “true” state (i.e., sleep stage) is unknown and can only be
approximated through aggregation of expert opinions.


2.2   Group Deliberation as a Method for Disambiguation

Group deliberation is an interactive form of decision making among humans
which typically involves group members with conflicting beliefs who try to reach
consensus on a given question by presenting arguments, weighing evidence and
reconsidering individual positions.
    Several works explored factors that affect the process and outcomes of group
deliberation. Solomon [20] appreciates conflict as an important phenomenon of
any fruitful deliberation process. He argues that dissent is both required and
useful—as “dissenting positions are associated with particular data or insights
M. Schaekermann et al.




Fig. 2. Sources of disagreement by transition type. The vertical axis plots the number of
times a particular source of disagreement was mentioned in an expert discussion about
a case from one of two transition types: Last Wake before sleep onset and transitions
from N1 sleep to N2 sleep. Note the two vertical axes, one for each transition type, are
re-scaled to facilitate a visual comparison of both distributions relative to the number
of epochs discussed (# Epochs Discussed) for each transition type. Expert discussions
could mention more than one source of disagreement.



that would be otherwise lost in consensus formation”—and criticizes procedures
endowed with the a priori aim of reaching consensus. Instead, he advocates for a
structured deliberation procedure that avoids the undesired effects of groupthink
[7] by actively encouraging dissent, organizing individual subgroups to deliberate
on the same question, and ensuring diverse group compositions.
    Kiesler and Sproull [9] found that time limits imposed on deliberation tend
to polarize discussions and to decrease the number of arguments exchanged. The
same work suggests the use of voting techniques or explicit decision protocols to
structure the deliberation process.
    Recent work by Schaekermann et al. [19] introduced a real-time deliberation
framework to disambiguate edge cases in crowdsourced classification tasks draw-
ing inspiration from some of these early design considerations. The same work
also introduced a novel public deliberation dataset including all deliberation di-
alogues, original and revised classification decisions, and evidence regions from
two different text classification tasks.
    Navajas et al. [14] studied the effectiveness of in-person group deliberation
for general-knowledge questions reporting that averaging consensus decisions
yielded better results than averaging individual responses.
                                        Expert Disagreement in Sequential Labeling

2.3   Consensus Scoring in Medical Data Analysis

Group deliberation has also been proposed as a technique for disambiguating
edge cases in the interpretation of medical data. Rajpurkar et al. [16] employed
group deliberation among cardiologists to generate a high-quality validation data
set in the context of arrythmia detection from electrocardiograms (ECGs). Their
work revealed that a convolutional neural network trained on independent la-
bels (i.e., labels collected without deliberation) exceeded the classification per-
formance of individual cardiologists when benchmarked against the consensus
validation set.
    Krause et al. [11] compared majority vote to in-person deliberation as tech-
niques for aggregating expert opinions for diagnosing eye diseases from photos
of the eyeground. Compared to majority vote, in-person deliberation yielded
substantially higher recall, suggesting the potential of group deliberation for
mitigating underdiagnosis of diabetic retinopathy and diabetic macular edema.
Krause et al. also showed that performing group deliberation on a small portion
of the entire data set can make tuning of hyperparameters for deep learning
models more effective. The same consensus data set was later used by Guan
et al. [5] to validate the classification performance of a novel machine learning
approach involving the training of multiple grader-specific models. They demon-
strated that training and aggregating separate grader-specific models can be
more effective than training a single prediction model on majority labels.
    In the context of sleep stage classification, Penzel et al. [15] refer to the
concept of in-person group deliberation as consensus scoring, concluding that
an “optimal training for [...] sleep scorers is participation in consensus scoring
rounds”. In this work, we translate this idea to the non-expert domain suggesting
a method to augment training procedures for crowdworkers through the use of
edge-case examples and the associated expert discussion dialogues in the context
of sleep stage classification.


3     Expert Deliberation Data Set

An in-person deliberation study was conducted with an expert committee of
three board-certified sleep technologists at Sunnybrook Health Sciences Centre
in Toronto to investigate the extent and potential sources of inter-rater disagree-
ment in sleep stage classification, and the effectiveness of group deliberation as
a method for consensus formation.


3.1   Data Set

We prepared a data set of eight 2-hour-long PSG recording fragments. Each 2-
hour-long fragment contained a sequence of 240 30-second epochs of biosignal
data, resulting in 1,920 (240 x 8) epochs for the entire data set. Half of the
fragments were from healthy subjects, the other half from patients with Parkin-
son’s disease. Both parts of the data set (Healthy and Parkinson) contained
M. Schaekermann et al.




Fig. 3. Sources of disagreement by disease state. The vertical axis plots the number
of times a particular source of disagreement was mentioned in an expert discussion
about a case from one of two disease states: Healthy and Parkinson’s Disease. Note
the two vertical axes, one for each disease state, are re-scaled to facilitate a visual
comparison of both distributions relative to the number of epochs discussed (# Epochs
Discussed) for each disease state. Expert discussions could mention more than one
source of disagreement.


examples of different transition types. We included examples from four different
transition types identified by Rosenberg et al. [17] as regions with typically low
inter-rater agreement: the last epoch of stage Wake before sleep onset, the first
epoch of stage N2 after stage N1, the first epoch of stage REM after stage N2,
and transitions between stages N2 and N3.

3.2   Procedure
The full data set was first scored independently by each sleep technologist, result-
ing in 5,760 individual scoring decisions, three for each of the 1,920 epochs. We
then identified all epochs with disagreement among scorers and selected a subset
of 30 epochs for in-person group deliberation. The selected disagreement epochs
represented both disease states and all four transition types. All 30 epochs were
discussed in person by the three scorers using a graphical scoring interface to
facilitate detailed discussions about patterns present in the time series data. The
experts participants were not explicitly required to reach unanimous consensus,
and could instead choose to declare a case as irresolvable. We did not impose an
explicit voting scheme or limit the amount of time available per discussion, but
instead left the discussion dynamics open until all experts either agreed on one
                                         Expert Disagreement in Sequential Labeling


                          Tech A Tech B Tech C Majority # Obs.
           Tech B         0.71    —         —       —
           Tech C         0.71    0.68      —       —         N=1920
           Majority       0.87    0.83      0.84    —
           Deliberation 0.63      0.50      0.02    0.54      N=30

Table 1. Pairwise agreement between all sleep technologists (Tech A, Tech B, Tech
C), as well as the group labels as determined by majority vote and the deliberation
process. Agreement is measured by Cohen’s kappa.



sleep stage or declared a case as irresolvable. Unanimous decisions were reached
for all 30 epochs through a process of verbal argumentation and re-interpretation
of the patterns shown in the biosignal data The irresolvable option was never
used. Discussions were recorded (screen capture and audio), transcribed and
qualitatively coded for the different sources of disagreement.

3.3   Inter-rater Disagreement
We measured pairwise agreement between all scorers (Tech A, Tech B, Tech C),
as well as the group labels as determined by majority vote (Majority) and the
deliberation process (Deliberation). Agreement was measured by Cohen’s kappa.
Table 1 summarizes all agreement results. Pairwise agreement among scorers was
moderate, ranging between 0.68 and 0.71 (N=1920). Agreement between indi-
vidual scorers and the majority vote was high, between 0.84 and 0.87 (N=1920).
For the epochs discussed in person, we measured pairwise agreement between the
deliberation decision and individual scorers’ decisions. Two of the three scorers
showed weak agreement with deliberation outcomes (Cohen’s kappa of 0.63 and
0.50, N=30), while the third scorer showed no systematic agreement with the
deliberation outcomes (Cohen’s kappa of 0.02, N=30). Agreement between the
majority vote and deliberation decisions was low (Cohen’s kappa of 0.54, N=30).

3.4   Sources of Disagreement
Initial qualititative coding of the expert discussions for 17 cases from two major
transition types revealed a broad range of reasons why sleep technologists may
disagree on the correct sleep stage label. Figures 2 and 3 compare the relative
frequency of different sources of disagreement across two transition types (Last
W before sleep and First N2 after N1 ) and across two disease states (Healthy and
Parkinson) respectively. Overall, we identified two sources of disagreement which
occurred with the highest frequency in both transition types and disease states.
These were (a) the presence of multiple stages in one epoch causing disagreement
about which stage was the dominant one, and (b) different configurations of
the graphical scoring interface in terms of amplitude scaling causing divergent
interpretations of visual patterns in the signal.
M. Schaekermann et al.


                                 Exper t s                                                                                             Cr owd Wor ker s

                                                                                                                         +          - +
                                                 +        ?          -
 E1 + + -                                          E1: Argument
                                                                                                                         I ndependent Var s:          Dependent Var s:

 E2 + - -                                                  E2: Argument
                                                                                                    Deliberat ion
                                                                                                                         Ambiguity Measure
                                                                                                                          Ambiguity Levels
                                                                                                                                                       Answer Accuracy
                                                                                                                                                     Task Completion Time
 E3 + - -                                        E3: Argument
                                                                                                     Dat a Set           Expert Discussions                NASA-TLX

                                                                                              (expert disagreement
           Labeling                               Deliberat ion                                 resolved through            Training                      Test ing
         (independent)                               (in group)                                group deliberation)       (with feedback)             (without feedback)




                                         Training I nt er f ace
                                                                                                                                                 Am biguit y Levels:
       Your answer is wrong! This is " Wake" .                                EXPERT DI SCUSSI ON
                                                                                                                           Training
       Click OK to watch an expert discussion
              about this particular case.                                                                                 Condit ions          Clear Cases | Edge Cases
                                                                                                                                                         Both

                                                                     >> Expert: The frequency is not fast enough
                                                                                                                       Am biguit y Measur e:      Exper t Discussions
                                                                     to be a spindle. It's just that person's alpha.                                f or Edge Cases:
                                                                                                                       Machine Uncertainty
                                                                                                                       Expert Disagreement        Do not Show | Show
  Classif y Biosignal and Get Feedback                            Lear n f r om Exper t Discussion




Fig. 4. Application scenario of using expert discussions for improving example-based
training for non-expert crowdworkers.



    While these two sources of disagreement could persist on individual cases
without the sequential context, we identified two other sources of disagreement
that explicitly depend on the sequential nature of the labeling task and under-
lying data:

 – Number of scoring passes:
   for 3 out of 30 adjudicated cases, experts explicitly mentioned that their
   scoring decision depended on the number of passes they had taken on a par-
   ticular recording. In other words, experts indicated that their interpretation
   of biosignals is often updated once certain patient-specific patterns are ob-
   served towards the end of the recording. A subsequent re-interpretation (i.e.,
   second scoring pass) would then allow experts to take into account observa-
   tions they have made in the other parts of the data sequence in one of the
   earlier scoring passes. Disagreement could therefore arise if one expert had
   only performed one initial pass whereas other experts may have performed
   two or more passes.
 – Cascade from previous disagreement:
   3 out of 30 adjudicated cases could be resolved automatically once the dis-
   agreement on one of the close-by preceding cases had been resolved. This
   dynamic was observed since evidence for specific stages of sleep may some-
   times be observed only at the transition point from one sleep stage to an-
   other. Consequently, disagreement may arise at a “critical” transition point
   and persist over multiple steps in the sequence. Once the disagreement at
   the transition point is resolved, the resolution can cascade to the subsequent
   steps until the next transition point.
                                        Expert Disagreement in Sequential Labeling

   These two sources of disagreement co-occurred once, meaning that 5 out of
30 adjudicated cases (17%) were associated with sources of disagreement that
depend on the sequential nature of the labeling task and underlying data.


4   Discussion

In this work, we provided an initial investigation of expert disagreement in the
context of sequential labeling tasks, studying the effectiveness of in-person ad-
judication for resolving disagreement and for analyzing information about the
original source of disagreement.
    Our results suggest that majority vote is not necessarily a good proxy for
group deliberation decisions in sleep staging. This finding provides some confi-
dence in the usefulness of expert discussions for the purpose of resolving disagree-
ment cases. Beyond that, our qualitative analysis of expert discussion dialogues
uncovered a diverse set of different reasons why domain experts disagree in the
context of sleep stage classification, most of which go beyond the notion of mere
input mistakes.
    Perhaps most importantly, we identified two sources of disagreement with
a clear connection to the sequential nature of the labeling task and underlying
data. This observation provides some support for our hypothesis that the reci-
procity of data and context in sequential labeling may lead to unique forms of
expert disagreement that are characteristic for sequential labeling tasks, where
the interpretation of one case affects the interpretation of subsequent or pre-
vious cases. One exciting avenue for future research is the problem of whether
it is possible to detect the “critical” tasks that might set up a cascade of dis-
agreement and potentially incorrect labels. Successful detection of such “critical”
tasks would allow for a more cost-effective use of expert resources by focusing
disambiguation procedures on those cases and saving expert resources on other
cases that may be resolved automatically.
    Picking up on Penzel et al.’s comment on the nature of “truth” in sleep
staging [15], some of the inherent difficulty may arise because there exists a
certain degree of both temporal and spatial continuity at transitions between
states. In other words, despite the fact that any single neuron or cortical circuit
may be thought of as existing in one state or another at any given moment, it is
possible for local assemblies of neurons to take some time to transition from one
state to another, and also that distant assemblies of neurons in different parts
of the brain can exist in different states at the same time. These transitions
may take minutes [18, 23] which encompasses several 30-second epochs. Thus, we
hypothesize that some of the ambiguity stems from the need to force transitional
states into one sleep stage category or another.
    We posit that expert disagreement in complex tasks can be used as a signal
to identify ambiguous edge cases, and as a driver for eliciting conclusive expert
discussions to disambiguate such edge cases. For future work, we propose the
idea that example-based training procedures for non-expert crowdworkers may
benefit from the presentation of edge cases and their associated expert discus-
M. Schaekermann et al.

sions. While expert disagreement may be one signal for the identification of edge
cases, other techniques for the automatic selection of edge case examples, e.g.,
based on measures of machine uncertainty, have been proposed in prior work
[10]. We believe that expert disagreement and the associated expert discussions
open up interesting opportunities for optimizing example-based training proce-
dures for human learners, e.g., to improve disambiguation skills and depth of
understanding.
    Figure 4 illustrates a high-level overview of some of these future directions. In
summary, we hope to conduct research on augmenting example-based training
procedures for non-expert crowdworkers using edge-cases and their associated
expert discussions to help human learners develop more accurate classification
strategies for expert-level tasks exhibiting a certain amount of ambiguity.
    Another promising avenue for future work will be to explore the minimum
“bandwidth” and effective protocols of communication between experts needed
to result in successful disambiguation in the context of sequential labeling set-
tings like the one presented in this work. Comparisons may include different
styles of expert communication ranging from online text-based asynchronous
approaches, to in-person verbal real-time communication.


5   Conclusion
In this work, we reported results from a case study on expert disagreement
in sequential labeling tasks where the interpretation of one case can affect the
interpretation of subsequent or previous cases. Three board-certified sleep tech-
nologists scored 1,920 cases in a sequential 5-class labeling task. Out of all dis-
agreement cases, 30 cases were discussed and resolved through face-to-face ad-
judication. We identified various sources of disagreement that are specific to the
sequential nature of the underlying data and labeling procedure. Our work con-
cluded with a discussion of promising application scenarios of expert discussions
for the training of non-expert crowdworkers that we hope to explore in future
work.


References
 1. Beatty, J., Moore, A.: Should We Aim for Consensus? Episteme 7(3), 198214
    (2010). https://doi.org/10.3366/E1742360010000948
 2. Chang, J.C., Amershi, S., Kamar, E.: Revolt: Collaborative Crowd-
    sourcing for Labeling Machine Learning Datasets. In: Proceedings
    of the 2017 CHI Conference on Human Factors in Computing Sys-
    tems - CHI ’17. pp. 2334–2346. ACM, ACM Press, New York,
    New      York,     USA      (2017).   https://doi.org/10.1145/3025453.3026044,
    http://dl.acm.org/citation.cfm?doid=3025453.3026044
 3. Dumitrache, A., Aroyo, L., Welty, C.: Crowdsourcing Ground Truth
    for Medical Relation Extraction. ACM Transactions on Interactive In-
    telligent Systems 8(2), 1–20 (7 2018). https://doi.org/10.1145/3152889,
    http://dl.acm.org/citation.cfm?doid=3232718.3152889
                                          Expert Disagreement in Sequential Labeling

 4. Garbayo, L.: Epistemic Considerations on Expert Disagreement, Normative Justi-
    fication, and Inconsistency Regarding Multi-criteria Decision Making. Constraint
    Programming and Decision Making 539, 35–45 (2014),
 5. Guan, M., Gulshan, V., Dai, A., Hinton, G.: Who said what: Modeling individ-
    ual labelers improves classification. In: AAAI Conference on Artificial Intelligence
    (2018), https://arxiv.org/pdf/1703.08774.pdf
 6. Gurari, D., Grauman, K.: CrowdVerge: Predicting If People Will Agree on the
    Answer to a Visual Question. In: Proceedings of the 2017 CHI Conference on Hu-
    man Factors in Computing Systems - CHI ’17. pp. 3511–3522. ACM, ACM Press,
    New York, New York, USA (2017). https://doi.org/10.1145/3025453.3025781,
    http://dl.acm.org/citation.cfm?doid=3025453.3025781
 7. Jones, A.M.: Victims of Groupthink: A Psychological Study of Foreign Policy Deci-
    sions and Fiascoes. The ANNALS of the American Academy of Political and Social
    Science 407(1), 179–180 (5 1973). https://doi.org/10.1177/000271627340700115,
    http://journals.sagepub.com/doi/10.1177/000271627340700115
 8. Kairam, S., Heer, J.: Parting Crowds: Characterizing Divergent In-
    terpretations in Crowdsourced Annotation Tasks. In: Proceedings of
    the 19th ACM Conference on Computer-Supported Cooperative Work
    & Social Computing - CSCW ’16. pp. 1635–1646. ACM Press, New
    York, New York, USA (2016). https://doi.org/10.1145/2818048.2820016,
    http://dl.acm.org/citation.cfm?doid=2818048.2820016
 9. Kiesler, S., Sproull, L.: Group decision making and communication
    technology. Organizational Behavior and Human Decision Processes
    52(1), 96–123 (6 1992). https://doi.org/10.1016/0749-5978(92)90047-B,
    http://linkinghub.elsevier.com/retrieve/pii/074959789290047B
10. Kim, J., Park, J., Lee, U.: EcoMeal: A Smart Tray for Promoting Healthy Dietary
    Habits. In: Proceedings of the 2016 CHI Conference Extended Abstracts on Hu-
    man Factors in Computing Systems - CHI EA ’16. pp. 2165–2170. ACM Press,
    New York, New York, USA (2016). https://doi.org/10.1145/2851581.2892310,
    http://dl.acm.org/citation.cfm?doid=2851581.2892310
11. Krause, J., Gulshan, V., Rahimy, E., Karth, P., Widner, K., Cor-
    rado, G.S., Peng, L., Webster, D.R.: Grader Variability and the
    Importance of Reference Standards for Evaluating Machine Learn-
    ing    Models     for    Diabetic    Retinopathy.     Ophthalmology     (3    2018).
    https://doi.org/10.1016/j.ophtha.2018.01.034, http://arxiv.org/abs/1710.01711
    http://linkinghub.elsevier.com/retrieve/pii/S0161642017326982
12. Mitry, D., Zutis, K., Dhillon, B., Peto, T., Hayat, S., Khaw, K.T., Mor-
    gan, J.E., Moncur, W., Trucco, E., Foster, P.J.: The Accuracy and Reliabil-
    ity of Crowdsource Annotations of Digital Retinal Images. Translational Vi-
    sion Science & Technology 5(5), 6 (2016). https://doi.org/10.1167/tvst.5.5.6,
    http://tvst.arvojournals.org/article.aspx?doi=10.1167/tvst.5.5.6
13. Mumpower,        J.L.,    Stewart,     T.R.:    Expert     Judgement     and     Ex-
    pert      Disagreement.       Thinking        &     Reasoning      2(2-3),      191–
    212           (7          1996).          https://doi.org/10.1080/135467896394500,
    https://www.tandfonline.com/doi/full/10.1080/135467896394500
14. Navajas, J., Niella, T., Garbulsky, G., Bahrami, B., Sigman, M.: Aggregated knowl-
    edge from a small number of debates outperforms the wisdom of large crowds.
    Nature Human Behaviour (1 2018). https://doi.org/10.1038/s41562-017-0273-4,
    http://www.nature.com/articles/s41562-017-0273-4
M. Schaekermann et al.

15. Penzel, T., Zhang, X., Fietze, I.: Inter-scorer reliability between sleep centers can
    teach us what to improve in the scoring rules. Journal of Clinical Sleep Medicine
    9(1), 81–87 (2013)
16. Rajpurkar, P., Hannun, A.Y., Haghpanahi, M., Bourn, C., Ng, A.Y.: Cardiologist-
    Level Arrhythmia Detection with Convolutional Neural Networks (7 2017),
    http://arxiv.org/abs/1707.01836
17. Rosenberg, R.S., van Hout, S.: The American Academy of Sleep
    Medicine Inter-scorer Reliability Program: Sleep Stage Scoring. Journal
    of Clinical Sleep Medicine (1 2013). https://doi.org/10.5664/jcsm.2350,
    http://www.aasmnet.org/jcsm/ViewAbstract.aspx?pid=28772
18. Saper,     C.B.,     Fuller,    P.M.,     Pedersen,     N.P.,    Lu,     J.,   Scam-
    mell,     T.E.:     Sleep      State     Switching.      Neuron      68(6),    1023–
    1042        (12         2010).        https://doi.org/10.1016/j.neuron.2010.11.032,
    http://linkinghub.elsevier.com/retrieve/pii/S0896627310009748
19. Schaekermann, M., Goh, J., Larson, K., Law, E.: Resolvable vs. Irresolvable Dis-
    agreement: A Study on Worker Deliberation in Crowd Work. In: Proceedings of the
    2018 ACM Conference on Computer Supported Cooperative Work and Social Com-
    puting (CSCW’18). New York City, NY (2018). https://doi.org/10.1145/3274423
20. Solomon, M.: Groupthink versus The Wisdom of Crowds : The Social Epis-
    temology of Deliberation and Dissent. The Southern Journal of Philoso-
    phy 44(S1), 28–42 (3 2006). https://doi.org/10.1111/j.2041-6962.2006.tb00028.x,
    http://doi.wiley.com/10.1111/j.2041-6962.2006.tb00028.x
21. Solomon, M.: The social epistemology of NIH consensus conferences. In: Establish-
    ing medical reality, pp. 167–177. Springer (2007)
22. Warby, S.C., Wendt, S.L., Welinder, P., Munk, E.G.S., Carrillo, O.,
    Sorensen, H.B.D., Jennum, P., Peppard, P.E., Perona, P., Mignot,
    E.:    Sleep-spindle    detection:    crowdsourcing       and   evaluating    perfor-
    mance of experts, non-experts and automated methods. Nature Meth-
    ods     11(4),    385–392      (2    2014).    https://doi.org/10.1038/nmeth.2855,
    http://www.nature.com/doifinder/10.1038/nmeth.2855
23. Wright Jr, K.P., Badia, P., Wauquier, A.: Topographical and temporal patterns
    of brain activity during the transition from wakefulness to sleep. Sleep 18(10),
    880–889 (1995)