<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Crowdsorting Timed Comments about Music: Foundations for a New Crowdsourcing Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Karthik Yadati</string-name>
          <email>n.k.yadati@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavala S.N. Chandrasekaran Ayyanathan</string-name>
          <email>p.s.n.chandrasekaranayyanathan@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martha Larson</string-name>
          <email>m.a.larson@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Delft University of Technology</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper provides an overview of the Crowdsorting Timed Comments about Music Task, a new task in the area of crowdsourcing for social media o ered by the MediaEval 2014 Multimedia Benchmark. Data for this task is a set of Electronic Dance Music (EDM) tracks, collected from online music sharing platform Soundcloud. Given a set of noisy labels for segments of Electronic Dance Music (EDM) that were collected on Amazon Mechanical Turk, the task is to predict a single `correct' label. The labels indicate whether or not a `drop' occurs in the particular music segment. The larger aim of this task is to contribute to the development of hybrid human/conventional computation techniques to generate accurate labels for social multimedia content. For this reason, participants are also encouraged to predict labels by combining input from the crowd (i.e., human computation) with automatic computation (i.e., processing techniques applied to textual metadata and/or audio signal analysis).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Multimedia content in the form of audio clips, videos and
images is available aplenty on the internet and supervised
machine learning algorithms which analyze such multimedia
content require accurate labels. Crowdsourcing platforms
such as Amazon Mechanical Turk (AMT) have created
possibilities for labeling multimedia content by reaching out to
a wider audience with di erent levels of expertise. Such
platforms have simpli ed the process of obtaining labels for
multimedia content from human annotators, which,
previously, has been an expensive and time-consuming task.
Labels gathered from these platforms are noisy since not all
workers are dedicated to the task, the task is complex, or
there is divergent perception among the workers on how to
apply the labels. The quality and other characteristics of
the labeled data directly a ects how useful these labels are
in applications. For example, if labels are used as input to
machine learning algorithms, their quality will have a strong
impact on performance. For this reason, it is imperative that
e cient algorithms are developed that can generate reliable
labels, given multiple noisy labels from the crowd.
Generating a single, useful, `correct' label from multiple noisy labels
is in itself a challenging task requiring signi cant research.</p>
      <p>
        Simplistic algorithms that aggregate labels, like majority
voting, can re ne the noisy labels to a certain extent [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The inherent limitation of simple aggregation algorithms,
however, is that they require several labels per instance for
acceptable quality, and for this reason incur high costs. A
way to address the cost is to take the performance of
individual workers into account. For example, Ipeirotis et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
developed a quality management technique that involves
assigning scalar values to workers by taking into account the
quality of the workers' responses. Such a score can then be
used as a weight on singular labels, to obtain a more accurate
estimation of the aggregated label. In general, however, it
remains di cult to outperform a majority-vote baseline, as
demonstrated by the MediaEval 2013 Crowdsourcing Task,
devoted to social images related to fashion [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Conventional computing approaches (for example, signal
analysis), can be used to generate labels, and these can be
combined with labels contributed by human annotators to
achieve a better overall result. Such a combination is
interesting in cases in which labels are to be used directly in an
application. Investigation of such hybrid approaches that
intelligently and e ectively combine human input with
conventional computation is a secondary area of focus for the
Crowdsourcing task. A further area related to hybrid
methods is the Active Learning paradigm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where the algorithm
interactively queries for labels for speci c data points which
can then be obtained from crowdsourcing.
      </p>
      <p>The remainder of this paper, presents an overview of the
task and describe the dataset. We then explain the
procedure to collect ground-truth labels and the evaluation metric
used for the task.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>TASK OVERVIEW</title>
      <p>
        The basic objective of the task is to predict labels for all
the 15-second music segments in the dataset. The music
and the associated information has been retrieved from
online music sharing platform Soundcloud1. The music
segments are represented as triplets: track identi er,
starttime and end-time of the 15-second segment. The labels
re ect whether or not the segment contains a drop. A drop
is a characteristic music event in Electronic Dance Music
(EDM). Within the EDM community, drop is described as
a moment of emotional release where people start to dance
like crazy and can be more formally characterized as a
building up of tension, which is followed by the re-introduction
of the full bassline [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For each 15-second segment, the
participants predict one of the three labels: Label 1 segment
contains a complete drop, Label 2 segment contains a
partial drop, and Label 3 segment does not contain a drop.
1www.soundcloud.com
      </p>
      <p>The participants have three sources of information which
they can exploit in order to infer the correct label of a music
segment: a) a set of `basic human labels', which are labels
collected from crowdworkers using an AMT microtask with
basic quality control, b) the metadata associated with the
music tracks (such as title, description, comments), c) the
audio in the form of mp3 les.Participants were encouraged
to use audio signal processing techniques to gain more
insight into the music segments. They were also allowed to
collect labels by designing their own microtasks (including
the quality control mechanism) and running it on a
crowdsourcing platform of their choice.</p>
    </sec>
    <sec id="sec-3">
      <title>TASK DATASET</title>
      <p>The dataset for the MediaEval 2014 Crowdsourcing Task
consists of music tracks collected from online music sharing
platform SoundCloud. The tracks were uploaded to
SoundCloud with a Creative Commons Attribution license, which
enables the dataset to be used for research purposes. The
music tracks and the associated metadata were crawled
using the SoundCloud API. An interesting feature of
SoundCloud is that the user comments are associated with a
timestamp and they refer to a particular time-point in the track.
We exploited this feature so as to create a list of short
15second segments, which might contain a drop. We collected
all the timed comments which had the word `drop' in them
and extracted a 15-second segment centered at the
timestamp of the comment. The dataset comprises 382 tracks
belonging to various sub-genres of EDM (e.g., dubstep,
electro) and their associated metadata in the form of XML les.
The dataset also contains two sets of human generated
labels. These labels are given to short 15-second segments
from the music tracks based on the occurrence of a drop.
The dataset contains a total of 591 15-second music
segments. The rst set of labels (referred to as `basic human
labels' or `low delity ground truth') has been generated by
AMT workers under the application of basic quality control.
The second set of labels (referred to as the `ground truth'
or `high- delity ground truth') contains more reliable labels
that were created by trusted annotators. The task data was
released in a single round and did not have a separate
development set.</p>
    </sec>
    <sec id="sec-4">
      <title>4. ‘BASIC HUMAN LABELS’</title>
      <p>Each 15-second music segment in the Crowdsourcing Task
data is associated with three labels collected from three
crowdworkers.. The crowdworkers listen to the segments
in order to make the judgment if the segment should be
labeled with Label 1, 2, or 3. Since we required the workers
to be familiar with EDM, we conducted a recruitment task
where the workers listen to two EDM tracks and identify
the drop moments. Additional questions to judge their
familiarity with EDM were asked. Based on their answers,
they received a quali cation that allowed them to carry out
the labeling micro task. Crowdworkers must have answered
all questions in the labeling microtask and the answers were
required to be consistent. This simple quality control
mechanism was designed so that the `basic human labels' produced
using this microtask would have noise levels characteristic of
human annotations generated without a sophisticated
mechanism for quality control. Out of the 591 assignments on
AMT, there was no agreement between the workers for 61
assignments, partial agreement (2 out of 3 workers gave the
same response) for 313 assignments and complete agreement
(all 3 workers agreed on the same response) for 218
assignments.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>GROUND TRUTH AND EVALUATION</title>
      <p>The ground-truth labels were created by a panel of eight
experts. Each music segment was labeled by three di erent
experts and a single label was obtained through majority
vote. These trusted annotations served as ground truth for
evaluating the task. Out of the labels for 591 music
segments, given by experts, there was no agreement between
the experts for 33 segments, partial agreement (2 out of
3 experts gave the same response) for 380 segments and
complete agreement (all 3 experts agreed on the same
response) for 178 assignments. Since we are dealing with a
multi-class classi cation problem (three classes), we chose
the weighted F-measure as the o cial evaluation metric.
It is the weighted average of the F-measure for individual
classes and the weights are the number of true examples of
that particular class.
6.</p>
    </sec>
    <sec id="sec-6">
      <title>OUTLOOK</title>
      <p>The Crowdsourcing for Multimedia Task ran in
MediaEval 2014 in its rst year as a so-called `Brave New Task'.
The results and interest of participants in the task will
inform the development of possible future tasks. In particular,
we are interested in understanding how to collect `basic
human labels' in the way most useful for experimentation and
how best to create high- delity ground truth against which
predicted labels can be evaluated. We hope that the
experiences this year will help us to develop better methods for
studying hybrid human/conventional computation.
7.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This task is partly supported by funding from the
European Commission's 7th Framework Programme under grant
agreement No 287704 (CUbRIK), No 610594 (CrowdRec) and
No 601166 (PHENICX).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Butler</surname>
          </string-name>
          .
          <article-title>Unlocking the Groove: Rhythm, Meter, and Musical Design in Electronic Dance Music</article-title>
          .
          <article-title>Pro les in popular music</article-title>
          . Indiana University Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Ipeirotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Provost</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Quality management on Amazon Mechanical Turk</article-title>
          .
          <source>In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10</source>
          , pages
          <fpage>64</fpage>
          {
          <fpage>67</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Loni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Georgescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morchid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dufour</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Getting by with a little help from the crowd: Practical approaches to social image labeling</article-title>
          .
          <source>In CrowdMM 2014: ACM Multimedia Workshop on Crowdsourcing for Multimedia</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nowak</surname>
          </string-name>
          and
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Ruger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation</article-title>
          .
          <source>In Proceedings of the international conference on Multimedia information retrieval</source>
          ,
          <source>MIR '10</source>
          , pages
          <fpage>557</fpage>
          {
          <fpage>566</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          .
          <article-title>Active learning literature survey</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>