=Paper= {{Paper |id=Vol-3630/paper32 |storemode=property |title=Cover Song Identification in Practice with Multimodal Co-Training |pdfUrl=https://ceur-ws.org/Vol-3630/LWDA2023-paper32.pdf |volume=Vol-3630 |authors=Simon Hachmeier,Robert Jäschke |dblpUrl=https://dblp.org/rec/conf/lwa/HachmeierJ23 }} ==Cover Song Identification in Practice with Multimodal Co-Training== https://ceur-ws.org/Vol-3630/LWDA2023-paper32.pdf

Cover Song Identification in Practice with Multimodal
Co-Training
Simon Hachmeier, Robert Jäschke
L3S Research Center, Appelstr. 9a, Hanover, 30167, Germany
School of Library and Information Science, Dorotheenstr. 26, Humboldt-Universität zu Berlin, Berlin, 10117, Germany

Abstract
The task of cover song identification (CSI) deals with the automatic matching of audio recordings by
modeling musical similarity. CSI is of high relevance in the context of applications such as copyright
infringement detection on online video platforms. Since online videos include metadata (eg. video titles,
descriptions), one could leverage it for more effective CSI in practice. In this work, we experiment with
state-of-the-art models of CSI and entity matching in a Co-Training ensemble. Our results outline slight
improvements of the entity matching model. We further outline some suggestions for improvements of
our approach to overcome the issue of overfitting CSI models which we observed.

Keywords
co-training, cover song identification, entity matching

1. Introduction
Cover song identification (CSI) aims at matching audio recordings to their respective musical
cliques based on musical similarity. One typical application of CSI is copyright infringement
detection on online video platforms or social networks.
Recent state-of-the-art CSI models have shown great success [1, 2, 3, 4, 5, 6, 7]. However,
these models are solely audio-based. Prior approaches have also demonstrated the effectiveness
of metadata for the task [8, 9].
In this work, we model the task of CSI as a multimodal problem incorporating music similarity
and entity matching. We design a Co-Training algorithm that leverages the natural split of two
views: a text view and an audio view. We utilize the two models to iteratively generate pseudo
labels for each other for an unlabeled dataset of YouTube videos. We evaluate the performance
of both models on publicly available CSI datasets. In the following, we first introduce into
Co-Training and outline some related work. We then propose our Co-Training algorithm,
and document details about our dataset and implementation in Section 3 to Section 5. In our
experiments in Section 6 we show results before closing this paper with Section 7 outlining
some ideas to improve our approach.

LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany
$ hachmeier@l3s.de (S. Hachmeier); jaeschke@l3s.de (R. Jäschke)
© 2023 Copyright by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer, Wichmann, J. (Eds.): Proceedings of the
LWDA 2023 Workshops: BIA, DB, IR, KDML and WM. Marburg, Germany, 09.-11. October 2023, published at http://ceur-ws.org
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Figure 1: Own illustration of Co-Training with two views and models.

2. Preliminaries and Related Work
Co-Training was initially proposed by Blum and Mitchell [10] and refers to the idea to leverage
automatically generated pseudo labels to improve the performance of models in the ensemble.
This enables training models in cases where only a small subset of the available data is labeled,
which applies to many real world scenarios. Co-Training relies on the availability of multiple
views which are required to fulfill the following assumptions:

1. Sufficiency: Each view is sufficient to address the task at hand.
2. Independence: The views are conditionally independent.

As illustrated in Figure 1, the models within the ensemble iteratively provide pseudo labels
for each other. One key component is the selection of a fraction of predictions as pseudo labels,
based on a constraint such as confidence thresholds [11, 12], limiting to a fraction of most
confident samples by ranking [13, 14] or other methods [15, 16, 17, 18].
Recently, various applications of Co-Training with deep learning models have been proposed
for the modalities text [13, 16], images [14, 19, 11, 12, 20, 21] and on multiple modalities [22, 15].
Lang et al. [13] improve prompt based learning by using GPT-3 output probabilities and
frozen representations of openly available large language models to improve prompt-based
learning. Both of their proposed methods for pseudo label selection are based on the ranking
of samples. An approach by Wu et al. [16] applies Q-learning to improve the selection policy
for the partition of unlabeled data to be pseudo-labeled. They demonstrate the effectiveness of
reinforced Co-Training on text classification tasks. Peng et al. [14] use adversarial examples in
an ensemble of multiple models for model diversity to improve the ensemble performance for
image segmentation on medical images. To select pseudo labels, a fraction parameter is used
that increases over iterations. In contrast, Yang et al. [11] apply a Co-Training framework with
a fixed threshold to the task of domain adaption task. Xian and Hu [12] use a fixed threshold
parameter for pseudo-labeling in the task of person re-identification.
Some approaches successfully make use of views arising from modalities. A multimodal
approach of Hinami et al. [15] leverages multiple views of the modalities text, audio and video
found in web videos to improve concept classification. Pseudo labels are selected based on a
voting approach within the ensemble. Another multimodal approach with text and images
from web articles of Bhattacharjee et al. [22] improves the task of fake news detection. Their
pipeline includes an attention-aware step to fuse two views. The models then co-train based
on sampling of hard positive samples. In the following, we present our Co-Training algorithm
which is based on fixed-thresholds.

3. Multimodal Co-Training Algorithm
We have access to an entity matching model 𝑇 𝑀 based on a language model and a cover song
identification model 𝐴𝑀 based on metric learning. Both models are pretrained and achieve
state-of-the-art performance for the task at hand. However, we aim to improve their performance
by training these models on a labeled dataset 𝐷𝐿 and an unlabeled dataset 𝐷𝑈 . Each item 𝑣
within either of the datasets is a YouTube video representation 𝑣 which is represented by a text
view (YouTube metadata) and an audio view based on audio features (cf. Section 5).
Accordingly, 𝑇 𝑀 (𝑣𝑖 , 𝑣𝑗 ) computes the entity matching confidence 0 < 𝑡𝑚 < 1 for a
pair 𝑣𝑖 , 𝑣𝑗 ∈ 𝐷𝑈 ∪ 𝐷𝐿 and 𝐴𝑀 (𝑣𝑖 , 𝑣𝑗 ) the musical similarity modeled as cosine similarity
−1 < 𝑎𝑚 < 1. A labeled item 𝑣 ∈ 𝐷𝐿 has a known clique or musical work where it belongs
to represented by 𝑤(𝑣) ∈ 𝑊 . Unlabeled items 𝑣 ∈ 𝐷𝑈 have a candidate clique 𝑤 ˆ (𝑣) ∈ 𝑊 . It
is unknown whether 𝑣 belongs to this clique. However, among all possible cliques this is the
most likely one, because 𝑣 was found with queries formulated to find items for this clique as
explained in Section 4.
We argue that we can address the problem of multimodal CSI by Co-Training. Since both
models are pretrained, we expect the first Co-Training assumption (cf. Section 2) to hold. We
further argue that both views are conditionally independent, due to the natural split given by
modalities.
In Algorithm 1, we show the Co-Training loop. We randomly sample three labeled (𝑙 = 3)
and three unlabeled videos (𝑢 = 3) for two randomly selected cliques (𝑠 = 2) per iteration.
We make use of hard threshold parameters. The output of 𝑇 𝑀 is based on softmax layers on
top of a language model. We therefore simply denote 𝛾 as the outer boundary for confidence,
indicating that a pseudo label is either positive if 𝑡𝑚 > 1 − 𝛾 or negative if 𝑡𝑚 < 𝛾. For the
audio model we impose two thresholds. We observed that the CSI model we use does not output
equally distributed similarity values spreading to the boundaries of the cosine similarity. Hence,
we use a lower threshold to set negative pseudo labels if 𝑎𝑚 < 𝜏lower and an upper threshold
to set positive pseudo labels if 𝑎𝑚 < 𝜏upper .
In Algorithm 2, we show one iteration of Co-Training where MAX(𝑀, 0) denotes the element-
wise maximum operator applied to the matrix 𝑀 and 0 and MAX(𝑀 ) and denotes the respective
row-wise maximum operation applied to 𝑀 .
We first predict the entity matching confidences and musical similarities for each pair of items
within all the pairs in the batch and assign those to matrices 𝑌ˆ text and 𝑌ˆ audio . Subsequently,
the similarity square matrices are masked to retain only pairwise relationships with a known
ground truth label from 𝐷𝐿 or with a sufficiently confident pseudo label. As masking values,
we select 1 to represent a indicating a positive relationship among the items (both are from the
Algorithm 1 Multimodal Cover Song Co-Training Loop
1: Initialize
2: Maximum number of iterations 𝐼𝑚𝑎𝑥 , Number of cliques per batch 𝑠, set of clique identifiers
𝑊 , number of labeled items per batch 𝑙, number of unlabeled items per batch 𝑢, outer
boundary for text model 𝛾, lower threshold for audio model 𝜏lower , upper threshold for
audio model 𝜏upper , labeled dataset 𝐷𝐿 , unlabeled dataset 𝐷𝑈 , learning rate 𝜂, audio model
𝐴𝑀 , text model 𝑇 𝑀
3:
4: for 𝑖 ←− to 𝐼𝑚𝑎𝑥 do
5:
6: Sample 𝑊𝐵 = {𝑤1 , . . . , 𝑤𝑠 } from 𝑊 ´
7:
8: for 𝑤 ∈ 𝑊𝐵 do:
9: Sample 𝐿𝑤 = {𝑣1 , . . . , 𝑣𝑞 } from 𝐷𝐿 where 𝑤(𝑣) ∈ 𝑊𝐵
10: Sample 𝑈 𝑤 = {𝑣ˆ1 , . . . , 𝑣ˆ𝑐 } from 𝐷𝑈 where 𝑤
ˆ (𝑣ˆ) ∈ 𝑊𝐵
11: end for
12:
𝐿𝐵 = ⋃︀𝑤∈𝑊𝐵 𝐿𝑤
⋃︀
13:
14: 𝑈𝐵 = 𝑤∈𝑊𝐵 𝑈 𝑤
15:
16: CoTrainIter(𝐿𝐵 , 𝑈𝐵 , 𝜏lower , 𝜏upper , 𝛾, 𝜂, 𝐴𝑀, 𝑇 𝑀 )
17:
18: end for

same clique) and -1 to indicate the contrary. Additionally, we select 0 to mask out uncertain
relationships for pairs without a ground truth label and insufficient confidence of the model
generating the pseudo label. The pseudo label masks 𝑀text and 𝑀audio are used to sample the
similarity values to use for training updates with hard triplet mining as proposed by Xuan et
al. [23] and applied to train prior CSI models [24, 25]. The lowest distances of the positive
relationships in DIST+ +
𝑎𝑢𝑑𝑖𝑜 and DIST𝑡𝑒𝑥𝑡 and the highest distances for the pairwise negative
relationships DIST− −
𝑎𝑢𝑑𝑖𝑜 and DIST𝑡𝑒𝑥𝑡 represent the components of the hard triplets that are
used for the training updates.
We train the metric learning model 𝐴𝑀 with triplet loss which is defined as:

𝑖 = max(𝐷(𝑣𝑖 , 𝑣+ ) − 𝐷(𝑣𝑖 , 𝑣− ) + 𝑚, 0),
𝐿tri (1)
where 𝑚 = 1 is the margin parameter, 𝑣+ and 𝑣− are the positive and negative to anchor 𝑣𝑖
which are used to compute the distances 𝐷(𝑣𝑖 , 𝑣+ ) and 𝐷(𝑣𝑖 , 𝑣− ) as found in DIST+
𝑎𝑢𝑑𝑖𝑜 and
DIST−𝑎𝑢𝑑𝑖𝑜 respectively.
Our entity matching model 𝑇 𝑀 is based on a large language model which we train with
binary cross entropy loss:
Algorithm 2 Co-Training Iteration for One Batch (Triplet Loss with Hard Triplet Mining).
1: Initialize
2: Set of labeled items per batch 𝐿𝐵 , set of unlabeled items per batch 𝑈𝐵 , outer boundary
for text model 𝛾, lower threshold for audio model 𝜏lower , upper threshold for audio model
𝜏upper , learning rate 𝜂, audio model 𝐴𝑀 , text model 𝑇 𝑀
3:
4: Set 𝑛 = |𝐿𝐵 ∪ 𝑈𝐵 |
5:
6: Predict
ˆ 𝑎𝑢𝑑𝑖𝑜 ∈ R𝑛×𝑛
7: Init. empty matrix 𝑌
ˆ 𝑎𝑢𝑑𝑖𝑜 [𝑖, 𝑗] = 𝐴𝑀 (𝑣𝑖 , 𝑣𝑗 ) where 𝑣𝑖 , 𝑣𝑗 ∈ 𝐿𝐵 ∪ 𝑈𝐵
8: 𝑌
ˆ 𝑡𝑒𝑥𝑡 ∈ R𝑛×𝑛
9: Init. empty matrix 𝑌
ˆ 𝑡𝑒𝑥𝑡 [𝑖, 𝑗] = TM(𝑣𝑖 , 𝑣𝑗 ) where 𝑣𝑖 , 𝑣𝑗 ∈ 𝐿𝐵 ∪ 𝑈𝐵
10: 𝑌
11:
12: Ground Truth Square Mask
13: Init. empty matrix 𝑀𝑙𝑎𝑏𝑒𝑙 ∈ R𝑛×𝑛
⎧
⎨1
⎪ if 𝑤(𝑣𝑖 ) = 𝑤(𝑣𝑗 )
14: 𝑀𝑙𝑎𝑏𝑒𝑙 [𝑖, 𝑗] = −1 if 𝑤(𝑣𝑖 ) ̸= 𝑤(𝑣𝑗 )
⎪
0 if undefined ∈ {𝑤(𝑣𝑖 ), 𝑤(𝑣𝑗 )}
⎩
15:
16: Pseudo Label Masks
17: Init. empty matrices
⎧ 𝑀𝑎𝑢𝑑𝑖𝑜 ∈ R𝑛×𝑛 and 𝑀𝑡𝑒𝑥𝑡 ∈ R𝑛×𝑛
⎪
⎪
⎪𝑀𝑙𝑎𝑏𝑒𝑙 [𝑖, 𝑗] if 𝑀𝑙𝑎𝑏𝑒𝑙 [𝑖, 𝑗] ̸= 0
if 𝑌ˆ 𝑎𝑢𝑑𝑖𝑜 [𝑖, 𝑗] > 𝜏𝑢𝑝𝑝𝑒𝑟
⎪
⎨1
18: 𝑀𝑎𝑢𝑑𝑖𝑜 [𝑖, 𝑗] =
⎪
⎪
⎪ −1 if 𝑌ˆ 𝑎𝑢𝑑𝑖𝑜 [𝑖, 𝑗] < 𝜏𝑙𝑜𝑤𝑒𝑟
⎪
⎩0, otherwise
⎧
⎪
⎪
⎪𝑀𝑙𝑎𝑏𝑒𝑙 [𝑖, 𝑗] if 𝑀𝑙𝑎𝑏𝑒𝑙 [𝑖, 𝑗] ̸= 0
if 𝑌ˆ 𝑡𝑒𝑥𝑡 [𝑖, 𝑗] > 1 − 𝛾
⎪
⎨1
19: 𝑀𝑡𝑒𝑥𝑡 [𝑖, 𝑗] =
⎪
⎪
⎪−1 if 𝑌ˆ 𝑡𝑒𝑥𝑡 [𝑖, 𝑗] < 𝛾
⎪
⎩0, otherwise
20:
21: Hard Triplet Mining
+ ˆ 𝑎𝑢𝑑𝑖𝑜 ) ∈ R𝑛×1
22: DIST𝑎𝑢𝑑𝑖𝑜 = 1 − min(max(𝑀𝑡𝑒𝑥𝑡 , 0) * 𝑌
− ˆ 𝑎𝑢𝑑𝑖𝑜 ) ∈ R𝑛×1
23: DIST𝑎𝑢𝑑𝑖𝑜 = 1 − max(max(−1 * 𝑀𝑡𝑒𝑥𝑡 , 0) * 𝑌
+ ˆ 𝑡𝑒𝑥𝑡 ) ∈ R𝑛×1
24: DIST𝑡𝑒𝑥𝑡 = 1 − min(max(𝑀𝑎𝑢𝑑𝑖𝑜 , 0) * 𝑌
− ˆ 𝑡𝑒𝑥𝑡 ) ∈ R𝑛×1
25: DIST𝑡𝑒𝑥𝑡 = 1 − max(max(−1 * 𝑀𝑎𝑢𝑑𝑖𝑜 , 0) * 𝑌
26:
27: Loss Computation
+ −
28: LOSS𝑎𝑢𝑑𝑖𝑜 = 𝐿tri (DIST𝑎𝑢𝑑𝑖𝑜 , DIST𝑎𝑢𝑑𝑖𝑜 )
+ −
29: LOSS𝑡𝑒𝑥𝑡 = 𝐿ce (DIST𝑡𝑒𝑥𝑡 , DIST𝑡𝑒𝑥𝑡 )
30:
31: Update
32: 𝜃𝐴𝑀 ← 𝜃𝐴𝑀 − 𝜂∆LOSS𝑎𝑢𝑑𝑖𝑜 (𝜃𝐴𝑀 )
33: 𝜃𝑇 𝑀 ← 𝜃𝑇 𝑀 − 𝜂∆LOSS𝑡𝑒𝑥𝑡 (𝜃𝑇 𝑀 )
Table 1
Datasets with numbers of cliques and songs/videos used in our implementation for training, validation,
and testing.
Subset Dataset Cliques Items
Training Train-YT 50 50,395
Training Train-SHS 50 1,121
Validation Val-SHS 882 3,172
Test Da-Tacos 2,797 13,707
Test Test-SHS 50 1,259
Test Test-YT 50 628

𝑀
∑︁
𝐿ce
𝑖 = 𝑦𝑖,𝑐 log(𝑦ˆ𝑖 ), (2)
𝑐=1

where 𝑦ˆ𝑖 is one prediction as found in in either DIST− +
𝑡𝑒𝑥𝑡 or DIST𝑡𝑒𝑥𝑡 and hence 𝑦𝑖,𝑐 ∈ {0, 1}. In
the following, we outline details about our dataset, preprocessing and training implementation.

4. Dataset
We provide an overview of the datasets used in Table 1 and CSV files containing cliques
identifiers and YouTube identifiers1 . The cliques used for implementation rely on two datasets
from prior research in CSI: SHS100K 2 for training, validation and testing and Da-Tacos [26] for
testing.
Based on the test subset of SHS100K we formulated around 44 text queries per clique to crawl
YouTube3 to find additional songs for these cliques, similarly to our prior work [27]. We split
this crawl into two parts with 50 cliques each. One for training composed of Train-SHS (labeled
dataset 𝐷𝐿 ) and Train-YT (unlabeled dataset 𝐷𝑈 ) and one for testing which is composed of
Test-SHS and Test-YT. Test-SHS is a subset of songs that are represented by YouTube videos in
the initial SHS100K test subset and Test-YT contains other YouTube videos found by the query
procedure. We annotated these 628 crawled videos with the help of two students and up to five
workers on Mechanical Turk. We only considered labels with full agreement among students
and aggregated the worker labels by majority vote.4
For validation we use the validation subset of SHS100K which we denote by Val-SHS.5 We
additionally use the larger Da-Tacos dataset for testing.6

1
https://github.com/progsi/datasets_shs_yt_cotraining
2
cf. https://github.com/NovaFrost/SHS100K provided by Yu et al. [1]
3
cf. https://pypi.org/project/youtube-search-python/
4
We report an agreement in Krippendorff’s 𝛼 of 0.43 (workers) and a Cohen’s 𝜅 of 0.83 (students). While the worker
agreement is quite low, measuring the agreement between students and aggregated labels by majority vote for a
subset of 210 songs yields a Cohen’s 𝜅 of 0.81.
5
81% were retrievable from YouTube.
6
The authors of Da-Tacos provide CREMA features publicly. However, we needed to extract CQT spectograms of
For each video, we downloaded the MP3 files with a sampling rate of 22,050 Hertz7 to extract
audio features. We extract CREMA8 features and constant-Q transform features9 (CQT) with 84
frequency bins.
Furthermore, we retrieved the metadata for each video. To ensure that semantics are preserved
independently of the Unicode font, we mapped various Unicode fonts to basic Latin characters
using Unicodedata10 .

5. Implementation Details
We use the BERT -based entity matching model Ditto [28] as our text model which is publicly
available on Github.11 Ditto requires fine-tuning specifically to the structure of attributes in
the entities, in our case YouTube videos. We use the SHS100K-Train subset as Ditto pretraining
dataset, which does not overlap with dataset any of our other datasets shown in Table 1.
Following the splits applied by Li et al. [28] we created a training, validation, and test set with a
ratio of 3:1:1 with each containing positive and negative pairs of YouTube videos in a 1:4 ratio.
We gathered the negative pairs by randomly sampling videos from another randomly selected
work. We use only the video titles and channel names as YouTube metadata representations. We
additionally experimented with YouTube descriptions but preliminary results showed inferior
results (F1 score of 27% against 95%) to the ones achieved by using only video titles and channels.
We used all of the proposed data augmentation techniques and the best performing language
model (RoBerta) as described in [28]. We applied the best model checkpoint evaluated on the
test set after 50 epochs for our matching task.
We use two different state-of-the-art CSI models which are publicly available12 : CQTNet
[1] and Re-MOVE [3]. In both cases, we initialize the pretrained models from the best model
checkpoints provided by the authors.
Re-MOVE processes CREMA features which are a variant of pitch class profiles and mainly
represent harmonic information. CQTNet processes constant-Q transform features (CQT), which
are spectograms with a logarithmically spaced frequency axis.
Following the Co-Training approach by Yang et al. [11], we use stochastic gradient descent
as optimizer with learning rate 0.01 and momentum ∈ {0, 0.9}. We validate the used audio
model and Ditto every 100 iterations. Since the prediction of a square matrix is expensive with
Ditto, we initialize a random subset of the validation set at the beginning of each training and
use it throughout the training.

MP3s for CQTNet. Hence, we only include the subset of videos which were available on YouTube which makes up
around 92% of full Da-Tacos.
7
cf. https://github.com/yt-dlp/yt-dlp
8
cf. https://github.com/bmcfee/crema
9
cf. https://librosa.org/doc/latest/index.html
10
cf. https://docs.python.org/3/library/unicodedata.html
11
cf. https://github.com/megagonlabs/ditto
12
We experimented with the ByteCover implementation by Orfium: https://github.com/Orfium/bytecover However,
the implementation was not provided by the authors of the paper and achieves lower performance than both
models we use.
1.006 0.75
CQTNet - 0.5
CQTNet - 0.6 0.70
1.005 Re-MOVE - 0.5
Re-MOVE - 0.6 0.65
1.004 Experiment
CQTNet - 0.5
0.60
Value

Value
1.003 CQTNet - 0.6
Re-MOVE - 0.5
0.55 Re-MOVE - 0.6
1.002
0.50
1.001
0.45
1.000
0 200 400 600 800 1000 200 400 600 800 1000
Step Step
(a) Loss of audio models. (b) Validation mAP of audio models.
Figure 2: Comparison of Audio Models for first 1,000 iterations and momentum of 0.

6. Experiments
We evaluate our proposed Co-Training algorithm on ensembles with Ditto [28] as 𝑇 𝑀 paired
with one of the pretrained audio models CQTNet [1] and Re-MOVE [3] as 𝐴𝑀 . Our provided
baselines are the pretrained models before Co-Training. We further compare to a simple baseline:
the Levensthein-based function token set ratio13 . We report the mean average precision (mAP)
which is the main evaluation metric used in cover song identification [1, 2, 3, 4, 5, 6, 7]. Results
are shown in Table 2 for the two best ensembles we found per pair of 𝑇 𝑀 and 𝐴𝑀 :

• Co-CQT : with CQTNet and 𝛾 = 0.1, 𝜏upper = 0.7, 𝜏lower = 0.2.
• Co-ReM: with Re-MOVE and 𝛾 = 0.2, 𝜏upper = 0.5, 𝜏lower = 0.3.

6.1. Experiment 1: CQTNet Versus Re-MOVE
We compare the two audio models with 𝜏upper ∈ {0.5, 0.6}, 𝛾 = 0.2 and 𝜏lower = 0.3. In
Figure 2 we show the triplet loss over 1,000 iterations as well as the validation mAP. The strong
observable drop in mAP and loss for Re-MOVE strongly reflects an overfit. As we show in
Table 2, Re-MOVE generally performs worse than CQTNet. We therefore focus on experimenting
with various different thresholds for CQTNet. We further observe that the convergence of the
loss of CQTNet is rather slow. Thus, we impose a momentum of 0.9 in the next experiments.

6.2. Experiment 2: CQTNet Threshold Tuning
We experimented with different hyperparameter configurations: 𝛾 ∈ {0.1, 0.2, 0.49}, 𝜏upper ∈
{0.5, 0.6, 0.7}, 𝜏lower ∈ {0.2, 0.3, 0.4}.
In Figure 3 we show the loss and validation mAP of Co-CQT. We observe that CQTNet overfits,
shown by the jointly decreasing loss and mAP. The triplet loss converges rather close to the
margin for the triplet loss 𝑚 = 1. We observed this result consistently across configurations.
13
cf. https://github.com/maxbachmann/RapidFuzz
1.0045
0.75
1.0040
1.0035 0.70
1.0030
0.65
Value

Value
1.0025
1.0020 0.60
1.0015
0.55
1.0010
0 200 400 600 800 1000 0.50
200 400 600 800 1000
Step Step
(a) Triplet Loss. (b) CQTNet Validation mAP.

0.79
7
0.78
6
0.77
5
0.76
4
Value

Value

0.75
3
0.74
2 0.73
1 0.72
0 0.71
0 200 400 600 800 1000 200 400 600 800 1000
Step Step
(c) Binary Cross Entropy Loss. (d) Ditto mAP.
Figure 3: Losses and validation mAPs Co-CQT.

However, we as well observe an increase in loss but a constant validation mAP for Ditto14 .
As shown in Table 2, Ditto is the only model which actually improves with the Co-Training
procedure. Given these two key observations, we hypothesize that balancing the two very
different models is a key challenge. In the closing section, we therefore outline some of the
potential issues with our approach and ideas to address these.

7. Conclusion and Outlook
In this paper, we applied a Co-Training algorithm for multimodal CSI using an audio-based
CSI model along with an entity matching model. We slightly improved the entity matching
model Ditto for our task. This might suggest that further training iterations can improve Ditto.
However, both audio-based models seem to overfit quite rapidly.
In the following, we outline some ideas which might have an impact on this problem.
14
Please note that sampling of a subset of 100 items of the full Val-SHS as mentioned in Section 5, can have a major
impact on the validation mAP.
Table 2
mAP of the ensembles Co-CQT (Ditto & CQTNet) and Co-ReM (Ditto & Re-MOVE). We report for the best
ensembles achieved with our tested hyperparameter configurations. *The computation of predictions in
the case of Ditto is more time complex than for the CSI models. We therefore report the performance on
Da-Tacos for a random subset of 1,259 items (size of the Test-SHS dataset).
Ensemble Model Val-SHS Test-SHS Test-YT Da-Tacos
- Levensthein 0.30 0.50 0.26 0.12
- Ditto* 0.62 0.80 0.40 0.24
Co-CQT Ditto (best) - 0.84 0.44 0.28
- Re-MOVE 0.57 0.69 0.43 0.23
Co-ReM Re-MOVE (best) 0.57 0.70 0.44 0.23
Co-ReM Re-MOVE (last) 0.35 0.46 0.29 0.11
- CQTNet 0.76 0.83 0.57 0.73
Co-CQT CQTNet (best) 0.75 0.83 0.56 0.74
Co-CQT CQTNet (last) 0.51 0.55 0.32 0.35

Learning Rates. In comparison, Ditto seems to learn rather slow while the audio-based
models overfit. We believe that different learning rates for both models could help to prevent this
imbalance of model convergence. One potential improvement can be a grid search over different
learning rates across the models as proposed by Likhosherstov et al. [29]. Alternatively, one
could apply different learning rate schedulers like Yang et al. [11]. Our observations also suggest
the potential continuation of the pretraining of Ditto, possibly with pseudo labels generated by
the audio model. Eventually, Co-Training with both models could be done afterwards to avoid
the apparent different starting condition of both models.

Hard Triplet Mining. We sample triplets during training based on the hard triplet mining
strategy found in metric learning. In the context of Co-Training, adversarial examples can be
used as an alternative [14, 19, 20, 21] which encourage view difference. In contrast, hard triplet
mining solely ensures that the most difficult triplets are in the batch are utilized for training.

Losses. Some state-of-the-art CSI models rely on multiloss approaches [5, 6, 7] which combine
triplet loss with a softmax loss. While triplet loss encourages intra-class compactness, the
latter encourages inter-class discrimination [30]. Thus, our approach might neglect inter-class
discrimination. Another alternative to the triplet loss is the utilization of the prototypical triplet
loss [31] which considers distances between centroids of positive and negative classes instead
of distances to individual samples.

Batch Size. We tested different configurations of thresholds. However, the batch size for
labeled and unlabeled items per batch was fixed for all experiments and the number of items
for both input datasets was equal. We believe that the increase of unlabeled items per batch in
contrast to labeled items could enforce that more interesting items are used during training.
That is, due to their containment in our crawl rather than the widely used academic dataset
SHS100K, which is based on the platform Secondhandsongs. The platform itself relies on manual
labour by volunteers subject to policies to determine the boundaries between cover songs
whereas our crawl is solely subject to the creative spectrum on YouTube.

Label Confidence Estimation. As outlined in Section 2, other label confidence estimation
methods can be applied to Co-Training. In this study, we solely experimented with a
threshold-based method. Ranking-based methods or possibly more sophisticated methods could
further improve our proposed algorithm.

In future experiments, we plan to test the impact of the factors discussed. We hope that we
can find configurations of ensembles which can effectively leverage both views to improve the
task of multimodal CSI.

References
[1] Z. Yu, X. Xu, X. Chen, D. Yang, Learning a Representation for Cover Song Identifica-
tion Using Convolutional Neural Network, in: ICASSP 2020 - 2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 541–545.
doi:10.1109/ICASSP40776.2020.9053839.
[2] Z. Yu, X. Xu, X. Chen, D. Yang, Temporal pyramid pooling convolutional neural network
for cover song identification, in: Proceedings of the Twenty-Eighth International Joint
Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial
Intelligence Organization, 2019, pp. 4846–4852. URL: https://doi.org/10.24963/ijcai.2019/673.
doi:10.24963/ijcai.2019/673.
[3] J. Serrà Julià, F. Yesiler, E. Gómez Gutiérrez, Less is more: faster and better music version
identification with embedding distillation, in: Cumming J, Ha Lee J, McFee B, Schedl M,
Devaney J, McKay C, Zagerle E, de Reuse T, editors. Proceedings of the 21st International
Society for Music Information Retrieval Conference; 2020 Oct 11-16; Montréal, Canada:
ISMIR; 2020. p. 884-92, International Society for Music Information Retrieval (ISMIR), 2020.
[4] F. Yesiler, J. Serrà, E. Gómez, Accurate and scalable version identification using musically-
motivated embeddings, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2020, pp. 21–25. doi:10.1109/ICASSP40776.
2020.9053793.
[5] X. Du, Z. Yu, B. Zhu, X. Chen, Z. Ma, Bytecover: Cover song identification via multi-loss
training, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP) (2020) 551–555.
[6] X. Du, K. Chen, Z. Wang, B. Zhu, Z. Ma, Bytecover2: Towards dimensionality reduction
of latent embedding for efficient cover song identification, in: ICASSP 2022-2022 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022,
pp. 616–620.
[7] S. Hu, B. Zhang, J. Lu, Y. Jiang, W. Wang, L. Kong, W. Zhao, T. Jiang, WideResNet with Joint
Representation Learning and Data Augmentation for Cover Song Identification, in: Proc.
Interspeech 2022, 2022, pp. 4187–4191. doi:10.21437/Interspeech.2022-10600.
[8] J. B. L. Smith, M. Hamasaki, M. Goto, Classifying derivative works with search, text,
audio and video features, in: 2017 IEEE International Conference on Multimedia and Expo
(ICME), 2017, pp. 1422–1427. doi:10.1109/ICME.2017.8019444.
[9] A. A. Correya, R. Hennequin, M. Arcos, Large-scale cover song detection in digital music
libraries using metadata, lyrics and audio features, CoRR abs/1808.10351 (2018). URL:
http://arxiv.org/abs/1808.10351. arXiv:1808.10351.
[10] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, Association
for Computing Machinery, New York, NY, USA, 1998.
[11] L. Yang, Y. Wang, M. Gao, A. Shrivastava, K. Q. Weinberger, W.-L. Chao, S.-N. Lim, Deep co-
training with task decomposition for semi-supervised domain adaptation, in: Proceedings
of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8906–8916.
[12] Y. Xian, H. Hu, Enhanced multi-dataset transfer learning method for unsupervised person
re-identification using co-training strategy, IET Computer Vision 12 (2018) 1219–1227.
[13] H. Lang, M. Agrawal, Y. Kim, D. A. Sontag, Co-training improves prompt-based learning
for large language models, in: International Conference on Machine Learning, 2022.
[14] Deep co-training for semi-supervised image segmentation, Pattern Recognition 107 (2020)
107269. doi:https://doi.org/10.1016/j.patcog.2020.107269.
[15] R. Hinami, J. Liang, S. Satoh, A. G. Hauptmann, Multimodal co-training for selecting good
examples from webly labeled video, CoRR abs/1804.06057 (2018). URL: http://arxiv.org/
abs/1804.06057. arXiv:1804.06057.
[16] J. Wu, L. Li, W. Y. Wang, Reinforced co-training, in: Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Papers), Association for Computational Linguis-
tics, New Orleans, Louisiana, 2018, pp. 1252–1262. URL: https://aclanthology.org/N18-1113.
doi:10.18653/v1/N18-1113.
[17] T. Han, W. Xie, A. Zisserman, Self-supervised co-training for video representation learning,
in: Proceedings of the 34th International Conference on Neural Information Processing
Systems, NIPS’20, Curran Associates Inc., Red Hook, NY, USA, 2020.
[18] Semi-supervised learning combining co-training with active learning, Expert Systems with
Applications 41 (2014) 2372–2378. doi:https://doi.org/10.1016/j.eswa.2013.09.
035.
[19] S. Qiao, W. Shen, Z. Zhang, B. Wang, A. Yuille, Deep co-training for semi-supervised
image recognition, in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer
Vision – ECCV 2018, Springer International Publishing, Cham, 2018, pp. 142–159.
[20] H. Xie, C. Fu, X. Zheng, Y. Zheng, C.-W. Sham, X. Wang, Adversarial co-training for
semantic segmentation over medical images, Computers in biology and medicine 157
(2023) 106736.
[21] Y. Wang, Y. Zhang, Y. Liu, Z. Lin, J. Tian, C. Zhong, Z. Shi, J. Fan, Z. He, Acn: Adversarial
co-training network for brain tumor segmentation with missing modalities, in: Medical
Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International
Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VII 24,
Springer, 2021, pp. 410–420.
[22] S. D. Bhattacharjee, J. Yuan, Multimodal co-training fornbsp;fake news identification using
attention-aware fusion, Springer-Verlag, Berlin, Heidelberg, 2021.
[23] H. Xuan, A. Stylianou, X. Liu, R. Pless, Hard negative examples are hard, but useful, in:
A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, Springer
International Publishing, Cham, 2020, pp. 126–142.
[24] F. Yesiler, J. Serrà, E. Gómez, Accurate and scalable version identification using musically-
motivated embeddings, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2020, pp. 21–25. doi:10.1109/ICASSP40776.
2020.9053793.
[25] F. Yesiler, J. Serrà, E. Gómez, Less is more: Faster and better music version identification
with embedding distillation, in: International Society for Music Information Retrieval
Conference, 2020.
[26] F. Yesiler, C. J. Tralie, A. A. Correya, D. F. Silva, P. Tovstogan, E. Gómez, X. Serra, Da-tacos:
A dataset for cover song identification and understanding, in: ISMIR, 2019.
[27] S. Hachmeier, R. Jäschke, H. Saadatdoorabi, Music version retrieval from youtube: How to
formulate effective search queries?, in: P. Reuss, V. Eisenstadt, J. M. Schönborn, J. Schäfer
(Eds.), Proceedings of the LWDA 2022 Workshops: FGWM, FGKD, and FGDB, Hildesheim
(Germany), Oktober 5-7th, 2022, volume 3341 of CEUR Workshop Proceedings, CEUR-
WS.org, 2022, pp. 213–226. URL: https://ceur-ws.org/Vol-3341/WM-LWDA_2022_CRC_
7142.pdf.
[28] Y. Li, J. Li, Y. Suhara, A. Doan, W.-C. Tan, Deep entity matching with pre-trained language
models, Proceedings of the VLDB Endowment 14 (2020) 50–60. URL: https://doi.org/10.
14778%2F3421424.3421431. doi:10.14778/3421424.3421431.
[29] V. Likhosherstov, A. Arnab, K. Choromanski, M. Lucic, Y. Tay, A. Weller, M. Dehghani,
Polyvit: Co-training vision transformers on images, videos and audio, CoRR abs/2111.12993
(2021). URL: https://arxiv.org/abs/2111.12993. arXiv:2111.12993.
[30] A. Taha, Y.-T. Chen, T. Misu, A. Shrivastava, L. Davis, Boosting standard classification
architectures through a ranking regularizer, in: Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, 2020, pp. 758–766.
[31] G. Doras, G. Peeters, A prototypical triplet loss for cover detection, in: ICASSP 2020 -
2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2020, pp. 3797–3801. doi:10.1109/ICASSP40776.2020.9054619.