1. INTRODUCTION

Detection of Musical Event Drop from Crowdsourced Annotations Using a Noisy Channel Model

0 Naveen Kumar , Shrikanth S. Narayanan

2014

16 17

This paper describes the algorithm for our submission to the MediaEval 2014 crowdsourcing challenge. We perform a Maximum Likelihood (ML) estimation of the true label, using only the multiple noisy labels. Each annotator's decision is modeled by a die-toss based on which the annotator changes the true label. We learn parameters of this noisy channel model using the Expectation-Maximization algorithm. We also show that using a smaller number of annotators in the model than the actual number can give better accuracy because there is more data per annotator to estimate the parameters reliably.

1. INTRODUCTION

The Mediaeval 2014 crowdsourcing challenge [ 3 ] involves multiple noisy annotations for presence of the musical event drop in 15s clips taken from Electronic Dance Music (EDM) genre music. Each annotator assigns one of 3 class labels depending on the extent to which the event is present in the 15 second clip. For each such clip atleast 3 unique annotations are available from di erent annotators. The total number of unique annotators is 30, however the bulk of annotations are done by a handful of them (Fig.1).

The typical approach to modeling multiple noisy annotations is to model each of the M annotators as a noisy channel that distorts the true label Y into a noisy annotation Ye k for each of the K annotations, k = 1 : : : ; K per song. This can either be done in a data-independent [ 4 ] or a data-dependent fashion [ 1 ]. However, in the current problem at hand, these methods are not readily applicable, beause of our lack understanding of good features for the task. The 2014 crowdsourcing challenge dataset comprises of only noisy annotations and without any ground truth the process of feature design is di cult.

Hence, we use a much simpler model instead, based on [ 2 ] which only uses the multiple noisy annotations and models each annotator as a noisy channel that corrupts the \true label" (Y ) by tossing a B-faced die where B is the number of classes and the die is chosen depending on Y .

NOISY CHANNEL MODEL

500 s ion400 tt a o n fan300 o r e bm200 u N 100 00 5 10 15 20 25 30 35

Annotator Figure 1: Number of annotations per annotator. Note that most of the annotations are from a few annotators. vary for each song. In addition, we denote the annotator id for each annotation Yeik using Aik. This information is provided on our dataset. For each annotator m we denote the parameters of her noisy channel model by m. As an example if we denote pik = P r(Yi = k) and qij = P r(Yei = j) then the mth annotator distorts her label as qi = mpi.

We treat the true label Yi as a hidden parameter and perform Expectation Maximization to estimate it for each parameter, learning the model parameters m at the same time. Since the annotator ids for each annotation are known to us, it is straightforward to compute the total data likelihood. For a dataset D = fY; Ye ; Ag it is shown in Eqn.(1).

N P r(D; 1 M ) = Y P r(Yi) Y P (Yeik; AikjYi; 1 M ) i=1 k

N = Y P r(Yi) Y i=1 k ab; if Aik = m; Yeik = a; Yi = b m Note that since Yi is a latent variable, in practice we shall maximize a lower bound of this likelihood function by taking an expectation w.r.t posterior distribution of the latent variable. This amounts to replacing b by a soft label pib,

N E[P r(D)] = Y P r(Yi) Y Y( amb)pib ; if Aik = m; Yeik = a i=1 k b where pib is de ned as earlier. We compute pib formally in (1) (2) (6) 0.8 0.75 0.7 0.55 0.5 0.450

N b = X i=1

ib=N amb = P r(Yeik = ajYi = b; Aik = m) =

PN i=1

PN i=1 Pk ib (Yeik = a; Aik = m)

Pk ib (Aik = m) To estimate the parameter amb we count all probability mass for true class b from annotator m when the noisy label annotated was a. This is divided by the probability mass for annotator m, for the true label b irrespective of the annotator's label.

We keep repeating these steps till the update in log-likelihood is below a certain threshold. 2.3

Uniqueness and Initialization

The EM algorithm can be shown to be a gradient ascent on log likelihood and hence is prone to getting stuck in local optima. Moreover, for this speci c model there is an inherent non-uniqueness resulting from assignment of class labels. This means that by changing the order of columns in the parameters m we can obtain a di erent permutation of true class labels. Each such permutation will still yield the same value of log-likelhood and is hence an equally optimal solution. This makes a good initialization of the EM algorithm important in this case. We use labels obtained using a majority vote as the initial estimates for ib.

Additionally, as pointed earlier most of the annotations are from a handful of annotators. This can lead to poor parameter estimates for annotators with few annotations. Thus, we choose the number of annotators M within the model to be smaller than the actual number of annotators. We use M = 8 for the submitted runs based on a rough estimate from Fig.1. The annotator ids for top (M 1) annotators by number of annotations are retained. The rest are grouped together under the M th annotator id. The e ect of varying the parameter M is shown in Fig.2. Finally, to deal with numerical instabilities resulting from dividing by small numbers in the M-step we use Laplace smoothing. 3.

RESULTS AND CONCLUSIONS

We submitted two systems using the proposed method using a random initialization and the other using majority voted labels. The results are shown in Table 1. We compare the results against a high- delity annotation that is assumed to be the ground truth for the purposes of this challenge.

The accuracy for the submitted systems indicate that a proper initialization of the EM-algorithm is important as anticipated. Using labels obtained through majority voting of multiple noisy annotations allows us to obtain better results compared to simple sample level majority voting. Results are also sensitive to the number of annotators selected, and in the future we would like to automatically learn the paramter M .

WF UWF the next section by estimating the posterior distribution of true labels given the noisy ones. 2.1

Expectation Step

In this step, we estimate the posterior probability of the latent variable Yi given the noisy annotations Yeik" model parameters 1:::M and class priors b = P r(Yib = 1). This is done as follows

P r(Yib = 1jYei1:::K ) =

P r(Yei1:::K jYib = 1)P r(Yib = 1)

Pj P r(Yei1:::K jYij = 1)P r(Yij = 1) We denote this as ib and note this can be computed by knowledge of parameters 1:::m and P r(Yib = 1) that we shall refer to as b. 2.2

Maximization Step

This step performs optimization of the alternate likelihood function. The M-step in this case can be performed by simple count and divide as follows 2 4 6 8 10 12 14 16 18 20

M Figure 2: Figure shows the e ect of the assumed number of annotators M on system F1-score.

[1]

Audhkhasi and

S. S.

Narayanan . Data-dependent evaluator modeling and its application to emotional valence classi cation from speech . In INTERSPEECH , pages 2366 { 2369 , 2010 .

[2]

A. P.

Dawid and

A. M.

Skene . Maximum likelihood estimation of observer error-rates using the em algorithm . Applied statistics, pages 20 { 28 , 1979 .

[3]

M. L.

Karthik

Yadati , Pavala

S.N. Chandrasekaran

Ayyanathan . Crowdsorting timed comments about music: Foundations for a new crowdsourcing task . In MediaEval 2014 Workshop , Barcelona, Spain, October 16 -17 2014 .

[4]

V. C.

Raykar ,

Yu ,

L. H.

Zhao ,

Jerebko ,

Florin ,

G. H.

Valadez ,

Bogoni , and

Moy . Supervised learning from multiple experts: whom to trust when everyone lies a bit . In Proceedings of the 26th Annual international conference on machine learning , pages 889 { 896 . ACM, 2009 .