=Paper=
{{Paper
|id=Vol-2932/paper3
|storemode=property
|title=Training-Induced Class Imbalance in Crowdsourced Data
|pdfUrl=https://ceur-ws.org/Vol-2932/paper3.pdf
|volume=Vol-2932
|authors=Shawn Ogunseye,Jeffrey Parsons,Doyinsola Afolabi
|dblpUrl=https://dblp.org/rec/conf/csw/OgunseyePA21
}}
==Training-Induced Class Imbalance in Crowdsourced Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2932/paper3.pdf</pdf>
<pre>
Training-Induced Class Imbalance in Crowdsourced Data
Shawn Ogunseye1, Jeffrey Parsons2 and Doyinsola Afolabi3
1
  Bentley University, Waltham, MA, USA
2
  Memorial University of Newfoundland and Labrador, St. John's, NL, Canada
3
  University of Lagos, Akoka, Lagos State, Nigeria


                                  Abstract
                                  In this paper, we examine how the design of data-collection systems can lead to imbalanced
                                  data. Specifically, we scrutinize how training affects the imbalance of data in a data
                                  crowdsourcing experiment. We randomly assigned contributors to explicitly trained, implicitly
                                  trained, and untrained (control) groups and asked them to report artificial insect sightings in a
                                  simulated crowdsourcing task. We posit that training contributors can lead them to selectively
                                  pay attention to and report specific aspects of observations while ignoring others. In the
                                  experiment, explicitly trained contributors reported less balanced data than untrained and
                                  implicitly trained contributors did. We then explored the effect of training-induced imbalance
                                  on an unsupervised classification task and found that the purity of classes formed was lower
                                  for explicitly trained contributors than for the other two types of contributors. We conclude by
                                  discussing the implications of artificial imbalance for the usefulness and insightfulness of
                                  crowdsourced data.

                                  Keywords 1
                                  Data Imbalance, Crowdsourcing Design, Crowd Knowledge, Data-driven insight

1. Introduction

    Data crowdsourcing is one effective way that organizations can access information about a
phenomenon of interest from willing human contributors. This method has been widely used to collect
data across diverse domains, ranging from monitoring invasive species of mosquitoes that transmit
diseases, such as dengue fever and the Zika virus (The Invasive Mosquito Project, 2017), providing
guidance for consumer purchase decisions [1], and keeping track of outpatient health of seniors [2]. In
data-crowdsourcing projects, information is collected from an undefined sample population, which is a
source of concern for data consumers – organizations and individuals who gather input via
crowdsourcing platforms – who wish to have the best data possible. Research has therefore sought to
guide the design of data-crowdsourcing systems to ensure high-quality data is collected.
    However, scholarship on data quality has centered chiefly on the representational quality of data –
its capacity to adequately represent observed real-world phenomena. The consideration for what
constitutes high-quality data (data fit for specific use) is limited, usually focusing on accuracy (the
extent to which the information correctly represents observed real-world phenomena) and completeness
(the extent to which it contains all the attributes of a phenomenon of interest) [3]. But data is
increasingly repurposed to answer questions not anticipated when it was collected. There is, therefore,
a need to better understand how to improve the quality and quantity of insights that data can provide
[3], [4]. This makes understanding other intrinsic properties of data that can affect the usefulness of
crowdsourced data – the quality of information derivable from an analyzed dataset – a critical success
factor for data consumers.


VLDB 2021 Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, August 20, 2021,
Copenhagen, Denmark
EMAIL: sogunseye@bentley.edu (S. Ogunseye); jeffreyp@mun.ca (J. Parsons); dogunbiyi@unilag.edu.ng (D. Afolabi)
ORCID: 0000-0001-5774-4965 (S. Ogunseye); 0000-0002-4819-2801 (J. Parsons); 0000-0001-8442-7367 (D. Afolabi)
                               © 2021 Copyright for this paper by its authors.
                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g

                               CEUR Workshop Proceedings (CEUR-WS.org)
    One intrinsic property of crowdsourced data that can affect its usefulness is class imbalance (also
called data imbalance) [5]–[7]. A balanced dataset is one in which the attributes needed to classify all
instances of classes reported in the dataset are sufficiently proportional to allow classification
algorithms to differentiate these instances into recognizable classes. When data is balanced, the
instances that make up different classes in the dataset are evenly or equitably represented. The
classification of the data is not skewed towards some instances to the detriment of others. Data
imbalance is a pervasive problem usually originating from the world from which data is collected. The
attributes of entities are not uniformly distributed in nature; some occur more frequently than others.
For example, a dataset of fraud markers in financial transactions will show an imbalance because
fraudulent transactions are much rarer than legitimate transactions. Likewise, markers in intrusion-
detection datasets and in cancer-detection data will be imbalanced for the same reason – rareness of the
target attribute or class.
    Imbalance in data is a persistent problem in analytics [8]–[11], negatively affecting the accuracy of
models in terms of regression [12], [13], classiﬁcation (supervised classification) [14], clustering
(unsupervised classification) [11], and artificial neural networks [15]. In many cases, the rarely
occurring instances are deemed important, and their sparse occurrence implies algorithms may
misclassify them into more prevalent classes. This is because machine-learning algorithms identify
members of classes using models made from the broadest set of similar attributes of instances found in
data. That is, they show a "maximum-generality bias" [16, p. 201], assuming the data has enough
attributes of the relevant states of a phenomenon of interest. The use of the most commonly occurring
attributes to form classes constrains the ability of classification algorithms to form classes from rarely
occurring attributes. Instances that would have formed a minority class get subsumed into a majority
class [9], [16].
    Imbalance is assumed to be solely caused by the inherent nature of the observed phenomenon and
mainly addressed after data has been collected using algorithmic and data-level strategies [15], [17]
(referred to here as after-the-fact measures). But imbalance may also be caused "artificially" [9, p. 1].
The design choices that data consumers make about data crowdsourcing projects can bias contributors
to provide more or less balanced data. These decisions include how the crowdsourcing system is
designed, the task(s) assigned to the sample population, who is recruited, and what motivates
contributors [18].
    Preventing imbalance, when possible, would benefit data consumers who must trust the
effectiveness of after-the-fact measures when they do not have sufficient domain knowledge to know
what classes are expected in their datasets. Without domain knowledge, or when data consumers do not
know what insights may lie in a dataset, the potential for discoveries may be lost or significantly reduced
as minority classes get subsumed [21]. Moreover, after-the-fact strategies introduce new problems that
degrade a machine-learning model's performance, such as random over-sampling, which increases the
likelihood of over-fitting as several exact replications of the minority class are added to the initial
dataset [19], [20]. The resulting learning process takes more time as the dataset becomes larger.
Algorithmic approaches may not accurately detect all the minority classes and their instances [10].
Increasing our understanding of how to prevent or mitigate artificial imbalance can improve the
usefulness of crowdsourced data when machine learning is applied to them.
    In this paper, we consider a design decision that can affect the imbalance of data in crowdsourcing
projects – the decision to recruit knowledgeable contributors. Data consumers prefer to recruit
knowledgeable contributors, but when these are scarce, they train novices to become more proficient in
a crowdsourcing task. We investigate the effect of this design choice on the balance of contributed data.
Through training, we induce the acquisition of task knowledge in contributors in a simulated
crowdsourcing experiment and compare the level of imbalance in the data they provided. The study
provides evidence that trained contributors report more imbalanced data than untrained contributors.

2. Knowledge and Data Contribution
   Training contributors is a common design decision that data consumers make to improve the quality
of crowdsourced data. However, this decision may affect the balance of data contributors' reports.
Consider that humans are overloaded with sensory information every second. In a reporting task, such
as identifying an entity, we learn about objects by paying selective attention to the relevant features that
aid in identification. Consequently, irrelevant features (those not helpful for determining class
membership) are safely ignored. Selective attention is the cognitive process of attending to one or more
sensory stimuli while ignoring others considered irrelevant to a task [21]. Although selective attention
leads to efficient learning, especially when making connections between instances with few similar
features, it comes with costs. The direct cost of selective attention is a learned inattention to features
that are not relevant to a particular data-reporting task [22]–[24]. These features, however, may be
critical for classification in another context, so a failure to capture them precludes the possibility of
using these features in a different context [25].
    Learning leads trained contributors to focus on relevant diagnostic features (i.e., for a specific task
such as species identification), making them less likely to attend to non-diagnostic attributes than
novices. We consider two forms of learning from the literature [23]: supervised learning – engendered
by some form of explicit training (e.g., by a teacher) with sufficient feedback to improve the classifier's
skill – and unsupervised learning – without explicit training (self-taught). Unsupervised or implicit
learning may involve less rule-based processing and, consequently, more attentiveness to attributes,
while supervised learning or explicit learning leads to a sharper focus on the acquisition of rules.
Trained contributors will tend to selectively attend to only attributes they have been exposed to and
learned to prioritize in training. Trained contributors will selectively attend to the few attributes they
have learned, leading to disproportionate distribution of attributes in a dataset. Implicitly trained
contributors may form different inclusion rules involving different attributes of the entity. The attributes
they attend to and report will be influenced by the salience of features more so than the attributes
reported by explicitly trained contributors. Meanwhile, explicitly trained contributors have a uniform
set of attributes that they have been taught. They will focus mainly on those attributes or entities they
have learned about and report attributes and entities that deviate from their existing knowledge to a
lesser extent. As a result, explicitly trained contributors will report the most imbalanced data.
Additionally, untrained contributors will attend to more attributes and consistently report these
attributes about the entities they observe. We, therefore, predict that data from untrained crowds would
be more balanced than data from trained crowds.

Proposition 1: Untrained contributors will report more balanced data than implicitly or explicitly
trained contributors.

The consequence of imbalanced data is reduced ability to infer additional classes from data. We,
therefore, predict that the design decision to train will negatively affect the usefulness of crowdsourced
data, most significantly for the explicitly trained group and, to the least extent, the untrained group.

Proposition 2: Data from untrained contributors will be more accurately classified using
classification-based machine learning algorithms than data from trained contributors.


3. Research Method

    To test these propositions, we designed an experiment to simulate a data crowdsourcing task. These
types of projects often seek knowledgeable contributors [26] and sometimes provide training to ensure
they can provide the type and quality of data needed by scientists [27]–[29]. In addition, many data
crowdsourcing projects are interested in discoveries [33]. Data crowdsourcing is, therefore, a proper
context to test the impact of training on data imbalance. Following the example of [31], we designed an
experiment with two classes of artificial insects: tyrans and nontyrans. We used artificial creatures as
primary entities of interest to limit the effect of contributor prior knowledge on the study. We defined
tyran as a class (species) of artificial insects whose members meet a classification rule consisting of
five requirements: (1) short tail, (2) light blue bodies, (3) two or three buttons on their light blue bodies,
(4) blue wings, and (5) either one or two rings on each blue wing. Similar artificial stimuli that do not
satisfy this classification rule (i.e., they meet some, but not all, of the conditions) are nontyrans. Figure
1 shows a sample tyran and a sample nontyran used in the experiment.
                                                             Secondary entities

                                                                                      Sources of Nondiagnostic
                                                              Antennae                Information


                                                            Light blue body

                                                             Buttons
                                                                                   Sources of Diagnostic
                                                            Blue wings             Information
                                                              Rings


                                                             Tail


                                                                                    Nontyran: It has three rings on each wing.
        Tyran with parts labeled
                                                                                  Note that the number of legs is not diagnostic.
Figure 1: Sample Tyran and Nontyran Images

    The experiment consisted of twenty images, each presented on a separate slide. Sixteen slides (the
test images) showed a mixture of tyrans and nontyrans. Four images containing unrelated items were
placed intermittently within the sequence of test images to check whether participants paid attention to
the task. These items were differently shaped/colored stimuli that were not insects (e.g., a triangle), and
each participant was expected to report these stimuli correctly. The slides were presented in a
nonrandomized order to all three groups, with images 5, 10, 15, and 20 showing items not related to the
actual task. Each image had one primary entity and 13 of the 16 slides also included secondary entities.
Secondary entities presented in the images are everyday objects, such as birds, insects, and fences. We
displayed images of the entities in separate PowerPoint slides and asked contributors a nonleading
question: "What do you see?"2

3.1.         Task
    We asked participants to imagine that designers of a game, similar to Pokémon Go, required their
assistance in designing an aspect of the game. For example, the game requires players to interact with
artificial insects. Specific insects called Tyrans are harmful and can kill a player's character, while
similar insects called nontyrans, which lack some defining features, can provide energy to a player's
character in the game. The designers, therefore, needed to test if the participants can report data about
these insects to help improve their game. The participants were further informed that the goal of the
experiment was to examine how people report the entities they observe.
    Participants were issued data entry booklets in which to write down their observations. For each
participant, individually numbered pages within the booklet had spaces to record their observation about
a correspondingly numbered image slide that was projected on a screen. The prompt on each page of
the booklet was, "What do you see?" Participants were also required to complete a demographic section
after the study.

3.2.         Participants
   The 93 students who participated in this experiment were undergraduate students at a Canadian
university. Students chose to receive either course credit or a donation to their class graduation, and
each participant was entered into a draw for a campus bookstore gift card. After screening for
completeness and the attentiveness of the contributor using embedded 'catch' items, responses from 84
participants were analyzed. Submissions from the remaining nine participants were excluded due to


2
    This is similar to the prompt used by eBird, a popular data crowdsourcing platform (www.ebird.org).
illegible writing, failure to report at least 3 out of 4 catch items correctly, and incomplete reports. Thirty-
six participants identified as male and 48 as female.
We randomly assigned participants to three groups: (1) explicitly trained, (2) implicitly trained, and (3)
untrained. The explicitly trained group members were taught the classification rule introduced above
for identifying the primary entities as tyrans or nontyrans. To increase their familiarity with the task,
participants in the explicitly trained group were also shown five sample tyrans, asked if they were
tyrans, and given feedback on why these entities qualified as tyrans. We only showed participants
images of tyrans because there are unlimited ways the attributes of a primary entity may violate the
classification rules (here, we also follow [31], [32]). We briefed participants in the implicitly trained
group on the task they would perform and showed them the same five target stimuli used to teach the
explicitly trained group, one at a time, to allow them to elicit classification criteria. The participants
were allowed to study each image; however, we did not provide explicit rules to members of this group,
nor did we give them feedback on their ability to determine whether an entity is a tyran or not. Members
of the untrained group were not shown any sample images. However, like those of the other groups,
they were informed that we were interested in examining how people report information.

3.3.    Measures
    We developed a coding scheme that accounts for attributes of both the primary and secondary
entities reported by participants. Two of the authors coded the first ten reports to establish consensus
and conformance with the coding scheme. The first author coded the remaining reports, while the
second author reviewed the coded data at different stages of the coding process. The variables coded
for are presented in Table 1.

Table 1.
Variables Coded in the Contributed Data
 Codes for Attribute types      Description
    Behavior Attribute          The number of attributes describing the behavior of the primary entity
    Mutual Attribute            The number of primary entity mutual attributes (a class of attributes that
                                show an interaction between the primary entities and secondary entities)
    (primary entity)
    Diagnostic Attribute        The number of attributes intrinsic to the primary entity that can be used to
                                identify the primary entity
Non-diagnostic Attribute        The number of attributes intrinsic to the primary entity that cannot be used
                                to identify the primary entity
    Secondary Entity            The number of secondary entities reported
Diagnostic Attributes           The number of attributes that can be used to identify the secondary entity
(Secondary Entity)
    Mutual Attribute            The number of attributes describing an interaction between the secondary
                                entity and other entities
    (Secondary Entity)
    Behavior Attribute          The number of attributes about the behavior of the secondary entity
    (Secondary Entity)

   Using the coded data about each attribute class, we analyzed the image data to see their attribute
compositions. Eight classes of attributes are present in different proportions in the dataset. Our goal is
to understand the probability they will be classified correctly using a classification type machine-
learning algorithm.
3.4.    Manipulation Check

    Before testing for imbalance, we confirmed that the trained contributors exhibited selective attention
to the primary entity and its attributes more than untrained contributors. Since secondary entities in the
images were familiar objects (such as birds and fences), we examined the degree to which each group
reported their presence as evidence of a differing degree of selective attention to primary entities.
Applying one-way ANOVA to the data of the 13 images that included secondary entities, we found that
the untrained contributors reported more secondary entities than did trained contributors. The implicitly
trained group reported more secondary entities than did the explicitly trained group. The results indicate
that those we expected to show more selective attention (i.e., the explicitly trained group) did so by
reporting the least number of secondary entities (Table 2). This shows that our training was effective
and selective attention indeed occurs at different levels in our groups.

 Table 2.
 Differences in the Reporting of Secondary Entities
       A B mean(A) mean(B)               Mean Diff.                  Std. Err           T       p-value
       E I        1.036        1.544        −0.508                      0.109         −4.674    0.001
       E U        1.036        2.289        −1.253                      0.109         −11.521   0.001
       I U        1.544        2.289        −0.745                      0.109         −6.847    0.001
        E=Explicitly trained group, I= Implicitly trained group, U= Untrained group


3.5.    Comparing Imbalance in Attribute Classes Between Datasets

    We used only images containing secondary entities in our analysis. This is because some of the
attributes coded for included Behavioral and Mutual Attributes for both the primary and secondary
entities. Images with other entities provided a reference frame for contributors to report behavior and
attribute types that involve more than one entity. We assume a case where mutual and behavioral
attributes can give more insight into the observed entity than only attributes intrinsic to the entity. To
compare the level of imbalance in the datasets collected by the three groups of contributors, we used
the Shannon diversity index (H) – a mathematical measure of variability and can be used to compare
entities in a specific space. Two different aspects contribute to the measurement of diversity: richness
and evenness [30]. Richness is the total number of unique classes of a thing (attribute classes) in the
data, and evenness is the distribution of the number of instances (attributes) for the available classes in
the dataset.
     The Shannon diversity index is denoted by the formula below:

          𝑖=1
   H= ∑      [(pi)×ln(pi)]
           𝑠

   Where pi = proportion of total sample represented by class i (divide number of instances belonging
to each class i by the total number of instances). ln is the natural logarithm, ∑ is the sum of the
calculations, and s is the number of classes.
    Although H is sometimes used as an indicator of imbalance, it is most sensitive to the number of
classes in an observation, so it is usually biased towards measuring class richness. Evenness (EH) is a
better measure of imbalance because it focuses on the distribution of instances (attributes) for the
available classes, regardless of the number of classes available.                         is measured
as a number between 0 and 1, with 1 denoting perfect evenness.
    • S = number of attribute types, = class richness
    • Hmax=ln(S) = Maximum diversity possible
    • EH = Evenness = H/ Hmax
   In our experiment, each group observed and reported on the same thirteen images. We, therefore,
compared the evenness in the number of attributes reported for each attribute class we expected. We
calculated values of Shannon's equitability index (EH) of each group across each of the 13 images. Using
ANOVA, we compared the EH for each group for all the images. There was no significant difference
between the untrained and implicitly trained groups (U and I) with a p-value of 0.7989. However, there
was a significant difference in the EH values for explicitly trained contributors and the other groups.
The explicitly trained contributors reported more imbalanced data than the implicitly trained and
untrained contributors (see Table 3).

Table 3.
Comparison of the level of imbalance in the datasets
              A    B        Mean A            Mean B       Mean Diff              F        p-value
              E    I        0.594             0.755          -0.161               17.922   0.000
              E    U        0.594             0.748          -0.154               15.914   0.000
              I    U        0.755             0.748          0.007                0.066    0.799
          E=Explicitly trained group, I= Implicitly trained group, U= Untrained group

    Even though the Shannon equitability index gives insight into which dataset should be more
balanced in theory, there is little empirical evidence to show that a high or low Shannon equitability
index translates to more or less balance in real classification situations. There is a need to therefore
examine the effect of the Shannon index score on actual classification of data. Given that we have
different datasets about the same observation from the same number of people and under the same
conditions, we have a unique opportunity to better understand how evenness translates to classification
quality.

3.6.    Relating Shannon Index to Classification Quality

    We applied unsupervised classification to the data from the three groups. We clustered the data
reported for each image using affinity propagation, an algorithm that automatically determines the
number of clusters suitable for a dataset by optimizing the fit function [34]. Using the same model, the
dataset from the explicitly trained group generated 122 clusters, while the dataset from the implicitly
trained group generated 166 clusters. The dataset from the untrained group generated 171 clusters.
Again, since these datasets were provided under the same conditions and within the same time limits,
the key differentiating factor in determining the number of clusters formed was the level of training
provided to contributors. The higher number of clusters may indicate a higher volume of data supplied
by the contributors in the untrained group or a higher capacity for the model to cluster the data for the
untrained group.
    To better understand the effect of training on the quality of classification, we calculated the purity
of the clusters formed from each dataset. Purity evaluation estimates the homogeneity of members of a
cluster using human judgment and, therefore, estimates the degree of imbalance in a dataset [35]. To
compute purity scores, we calculate the percentage of objects that are correctly and wrongly classified
in a cluster based on "ground truths" or knowledge we have about the observed entities. Two people
examined and coded attributes in the clusters for the number of right and wrong class members in each
cluster. The coders included one of the authors and a student who was not briefed on the purpose or
context of the study but presented the clusters and was asked to judge the suitability of the members of
each cluster for class membership. The two coders achieved an interrater reliability score (Cohen's
Kappa) of 0.86 without a need for discussion among the coders or any resolution. Then we applied the
purity formula:


   Where 𝑁 = number of objects (data points), 𝑘 = number of clusters, 𝑐𝑖 is a cluster in 𝐶, and 𝑡𝑗 is the
classification which has the maximum count for cluster 𝑐𝑖. We found that the clusters formed from data
from the untrained group were purer than clusters formed from the data for the trained groups. Also,
the data from the implicitly trained group was as impure as the data from the explicitly trained group.
Table 4 shows the purity scores compared using ANOVA.

Table 4:
  Comparing the Purity of Classification of Resulting Datasets
       A    B        mean(A)         mean(B)        diff                     SE      T        p-value
       E    I        0.789           0.830          -0.040                   0.025   -1.638   0.232
       E    U        0.789           0.895          -0.106                   0.025   -4.265   0.001
       I    U        0.830           0.895          -0.065                   0.025   -2.627   0.025
     E=Explicitly trained group, I= Implicitly trained group, U= Untrained group

   From the purity evaluation, we found that attribute instances that would have formed minority
classes are subsumed into majority classes more often in datasets provided by trained contributors than
was the case for the dataset provided by untrained contributors.

4. Discussion

    Decisions made in the design of data crowdsourcing systems can affect the balance of data. We
examine one such design decision – training contributors in a data collection task. Training contributors
can induce an imbalance in crowdsourced datasets because it leads contributors to focus mainly on
relevant features they have learned. Trained contributors will selectively attend to attributes to which
they have been exposed and have learned to prioritize in training. We see this in the results of our
manipulation check in Table 2. Implicitly trained contributors reported more secondary entities than
explicitly trained contributors, and both trained groups show more selective attention than the untrained
group.
    Training contributors has consequences for the balance of crowdsourced data reported. Untrained
and implicitly trained contributors have a higher propensity to report balanced data than explicitly
trained contributors. Training contributors with explicit rules rather than allowing them to learn on their
own therefore has the most adverse effect on the balance of data. Data from the explicitly trained
contributors also resulted in the most impure classifications, potentially limiting the insights that can be
gathered from data.
    Focusing on the effect of training on imbalance is unlike the majority of research on data imbalance
which seek to address imbalance after data has been collected. Our approach is preventative, seeking to
proactively design crowdsourcing systems to collect balanced data. One key value of this approach is
that it emphasizes that imbalance is not solely inherent in phenomena but can be caused by our designs
of data crowdsourcing systems and that some imbalance, particularly design-induced imbalance, is
preventable.
    Nonetheless, there are limits to the generalizability of our results. Indexes such as the Shannon
Equitability Index give us an insight into how a design choice can affect the balance of data, but they
may not accurately predict how a machine-learning algorithm will process data. Also, our hypotheses
have not been tested in other conditions beyond this experiment.
    Beyond these limits, emphasizing that imbalance can be artificial should prompt further research
into preventing data imbalance where possible. Generally, future research on identifying how to design
crowdsourcing systems so that they do not inadvertently promote the collection of imbalanced data
would benefit data consumers. More specific to training, it would be helpful to understand how to
mitigate the impact of selective attention on the balance of data when contributors need to be trained.
Also, it would be interesting to know if contributors could be trained to report balanced data.

5. Conclusion

  Imbalance is usually considered an inherent consequence of collecting data about things in the real
world and has been mainly addressed after data has been collected. In this paper, we showed that
imbalance could also be caused artificially by the design choices made for data-crowdsourcing systems.
The paper emphasizes the effect of training on the balance of data and the resulting usefulness of
collected data. Design decisions, such as the choice to train contributors, can have negative
consequences for the insightfulness of data because they encourage cognitive biases that limit the data
contributed. Data requirements can change at several stages of a decision-making process – during
collection or even after the initial analytics results come in. Thus, imbalanced data may support present
known uses of data but fail to support emergent uses. How we design our data-crowdsourcing systems
should therefore be determined by the priority we place on the insightfulness of our crowdsourced data,
now and in the future.

6. References

[1] D. C. Edelman, "Branding in the digital age," Harvard business review, vol. 88, no. 12, pp. 62–
     69, 2010.
[2] S. Ogunseye, S. X. Komiak, and P. Komiak, "The Impact of Senior-Friendliness Guidelines on
     Seniors' Use of Personal Health Records," in 2015 International Conference on Healthcare
     Informatics, 2015, pp. 597–602.
[3] S. Ogunseye and J. Parsons, "Designing for Information Quality in the Era of Repurposable
     Crowdsourced User-Generated Content," in International Conference on Advanced Information
     Systems Engineering, 2018, pp. 180–185.
[4] W. A. Günther, M. H. R. Mehrizi, M. Huysman, and F. Feldberg, "Debating big data: A literature
     review on realizing value from big data," The Journal of Strategic Information Systems, 2017.
[5] B. J. Hecht and M. Stephens, "A Tale of Cities: Urban Biases in Volunteered Geographic
     Information.," ICWSM, vol. 14, no. 14, pp. 197–205, 2014.
[6] J.-X. Liu, Y.-D. Ji, W.-F. Lv, and K. Xu, "Budget-aware dynamic incentive mechanism in spatial
     crowdsourcing," Journal of Computer Science and Technology, vol. 32, no. 5, pp. 890–904,
     2017.
[7] Q. Xu, J. Xiong, Q. Huang, and Y. Yao, "Robust evaluation for quality of experience in
     crowdsourcing," in Proceedings of the 21st ACM international conference on Multimedia, 2013,
     pp. 43–52.
[8] R. Caruana, "Learning from imbalanced data: Rank metrics and extra tasks," in Proc. Am. Assoc.
     for Artificial Intelligence (AAAI) Conf, 2000, pp. 51–57.
[9] N. V. Chawla, N. Japkowicz, and A. Kotcz, "Editorial: special issue on learning from imbalanced
     data sets," SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 1–6, Jun. 2004, DOI:
     10.1145/1007730.1007733.
[10] J. M. Johnson and T. M. Khoshgoftaar, "Survey on deep learning with class imbalance," Journal
     of Big Data, vol. 6, no. 1, p. 27, 2019.
[11] B. Krawczyk, "Learning from imbalanced data: open challenges and future directions," Prog
     Artif Intell, vol. 5, no. 4, pp. 221–232, Nov. 2016, DOI: 10.1007/s13748-016-0094-0.
[12] T. Oommen, L. G. Baise, and R. M. Vogel, "Sampling bias and class imbalance in maximum-
     likelihood logistic regression," Mathematical Geosciences, vol. 43, no. 1, pp. 99–120, 2011.
[13] J. M. Snyder Jr, O. Folke, and S. Hirano, "Partisan imbalance in regression discontinuity studies
     based on electoral thresholds," Political Science Research and Methods, vol. 3, no. 2, p. 169,
     2015.
[14] F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, "Data imbalance in classification:
     Experimental evaluation," Information Sciences, vol. 513, pp. 429–441, 2020.
[15] M. Buda, A. Maki, and M. A. Mazurowski, "A systematic study of the class imbalance problem
     in convolutional neural networks," Neural Networks, vol. 106, pp. 249–259, 2018.
[16] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, "Cost-sensitive boosting for classification of
     imbalanced data," Pattern Recognition, vol. 40, no. 12, pp. 3358–3378, Dec. 2007, DOI:
     10.1016/j.patcog.2007.04.009.
[17] H. He and E. A. Garcia, "Learning from imbalanced data," IEEE Transactions on knowledge and
     data engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
[18] T. W. Malone, R. Laubacher, and C. Dellarocas, "The collective intelligence genome," MIT
     Sloan Management Review, vol. 51, no. 3, p. 21, 2010.
[19] N. V. Chawla, "Data mining for imbalanced datasets: An overview," Data mining and knowledge
     discovery handbook, pp. 875–886, 2009.
[20] D. D. Margineantu and T. G. Dietterich, "Bootstrap methods for the cost-sensitive evaluation of
     classifiers," 2000.
[21] G. Murphy and C. M. Greene, "Perceptual Load Affects Eyewitness Accuracy and Susceptibility
     to Leading Questions," Front. Psychol., vol. 7, 2016, DOI: 10.3389/fpsyg.2016.01322.
[22] B. Colner and B. Rehder, "A new theory of classification and feature inference learning: An
     exemplar fragment model," in Proceedings of the 31st Annual Conference of the Cognitive
     Science Society, 2009, pp. 371–376. Accessed: Jan. 12, 2017.
[23] A. B. Hoffman and B. Rehder, "The costs of supervised classification: The effect of learning task
     on conceptual flexibility.," Journal of Experimental Psychology: General, vol. 139, no. 2, p. 319,
     2010.
[24] S. Ogunseye, J. Parsons, and R. Lukyanenko, "Do Crowds Go Stale? Exploring the Effects of
     Crowd Reuse on Data Diversity," Workshop on Information Technology and Systems, Seoul,
     Korea 2017.
[25] O. S. Ogunseye, "Understanding information diversity in the era of repurposable crowdsourced
     data," Ph.D. Thesis, Memorial University of Newfoundland, 2020.
[26] A. Wiggins, G. Newman, R. D. Stevenson, and K. Crowston, "Mechanisms for Data Quality and
     Validation in Citizen Science," in 2011 IEEE Seventh International Conference on e-Science
     Workshops, Dec. 2011, pp. 14–19. DOI: 10.1109/eScienceW.2011.27.
[27] U. Gadiraju, B. Fetahu, and R. Kawase, "Training Workers for Improving Performance in
     Crowdsourcing Microtasks," in Design for Teaching and Learning in a Networked World, vol.
     9307, G. Conole, T. Klobučar, C. Rensing, J. Konert, and E. Lavoué, Eds. Cham: Springer
     International Publishing, 2015, pp. 100–114. DOI: 10.1007/978-3-319-24258-3_8.
[28] G. Newman et al., "Teaching citizen science skills online: Implications for invasive species
     training programs," Appl. Environ. Educ. Commun., vol. 9, no. 4, pp. 276–286, 2010, DOI:
     10.1080/1533015X.2010.530896.
[29] F. I. Paez Wulff, "Recruitment, Training, and Social Dynamics in Geo-Crowdsourcing for
     Accessibility," 2014. Accessed: May 04, 2017. [Online]. Available:
     http://digilib.gmu.edu/jspui/handle/1920/9042
[30] R. Lukyanenko, J. Parsons, Y. F. Wiersma, and M. Maddah, "Expecting the unexpected: effects
     of data collection design choices on the quality of crowdsourced user-generated content," MIS
     Quarterly, vol. 43, no. 2, pp. 623–647, 2019.
[31] H. Kloos and V. M. Sloutsky, "What's behind different kinds of kinds: Effects of statistical
     density on learning and representation of categories.," Journal of Experimental Psychology:
     General, vol. 137, no. 1, p. 52, 2008.
[32] S. Ogunseye, J. Parsons, and R. Lukyanenko, "To Train or Not to Train? How Training Affects
     the Diversity of Crowdsourced Data," International Conference on Information Systems, India,
     2020.
[33] A. J. Daly, J. M. Baetens, and B. De Baets, "Ecological diversity: Measuring the unmeasurable,"
     Mathematics, vol. 6, no. 7, 2018, DOI: 10.3390/math6070119.
[34] D. Dueck, Affinity propagation: clustering data by passing messages. Citeseer, 2009.
[35] W. Prachuabsupakij and N. Soonthornphisaj, "Cluster-based sampling of multiclass imbalanced
     data," Intelligent Data Analysis, vol. 18, no. 6, pp. 1109–1135, 2014.

</pre>