<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Segmentation with Chord Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicolas Lazzari</string-name>
          <email>nicolas.lazzari2@studio.unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Poltronieri</string-name>
          <email>andrea.poltronieri2@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentina Presutti</string-name>
          <email>valentina.presutti@unibo.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, University of Bologna</institution>
          ,
          <addr-line>Mura Anteo Zamboni, 7, Bologna 40126</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LILEC, University of Bologna</institution>
          ,
          <addr-line>Via Cartoleria, 5, Bologna 40124</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>Structure perception is a fundamental aspect of music cognition in humans. Historically, the hierarchical organization of music into structures served as a narrative device for conveying meaning, creating expectancy, and evoking emotions in the listener. Thereby, musical structures play an essential role in music composition, as they shape the musical discourse through which the composer organises his ideas. In this paper, we present a novel music segmentation method, pitchclass2vec, based on symbolic chord annotations, which are embedded into continuous vector representations using both natural language processing techniques and custom-made encodings. Our algorithm is based on long-short term memory (LSTM) neural network and outperforms the state-of-the-art techniques based on symbolic chord annotations in the field.</p>
      </abstract>
      <kwd-group>
        <kwd>Chord Embeddings</kwd>
        <kwd>music structure analysis</kwd>
        <kwd>structural segmentation</kwd>
        <kwd>deep learning</kwd>
        <kwd>chord embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        One of the main factors that influence music perception is the hierarchical structure of music
compositions. Regardless of their level of musical knowledge and harmonic sensitivity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
or their cultural origins [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], listeners are able to use intuitive knowledge to organize their
perception of musical structures [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Indeed, there is empirical evidence that neural activity
correlates with musical structure in listeners’ perception [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The structuring and predictability
of musical compositions is also recognised as a viable therapy in the treatment and assessment
of children and adolescents with autistic spectrum disorder [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Music structuring is one of the tools used by composers to tell a story. According to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
“Music-making is, to a large degree, the manipulation of structural elements through the use
of repetition and change.”. The repetition of harmonic progressions (sequences of chords), in
particular in the context of western tonal music, gives to artists the ability to guide listeners
through a journey that creates dramatic narratives, conveying a sense of conflict that demands
a solution [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Figure 1 shows the structure of Helter Skelter by The Beatles, highlighting the musical chords
of the song. In the example, by means of the alternation between verse and refrain the artist
establishes a common repetitive pattern. The addition of an instrumental section after the
second refrain and the repetition of the intro reinforces the repetitive aspect of the composition.
The upcoming outro section denies the expectation of a new verse, right before the song ends.
Expectation and the way it is fulfilled or denied is an essential part in musical enjoyment [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In
fact, it has been shown empirically that the emotional response to a musical composition varies
as the degree of repetition changes [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Understanding musical structures is hence fundamental
in music analysis and composition. Artists can benefit from the feedback provided by a system
able to highlight possible hierarchical structures in their compositions.
      </p>
      <p>
        Music structure segmentation is a broad term related to the study of musical form, which
describes how musical pieces are structured. In particular it can be divided in two main
categories: phrase-structure segmentation and global segmentation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Phrase-structure
consists in detecting sections from the melodic information of a piece. While the aim of
phrase-structure is not to obtain a global segmentation, the detected sections provide valuable
insights in the task of global segmentation. In the following, we will refer to music structure
segmentation as the task of global segmentation. Music structure segmentation is a music
information retrieval (MIR) task that consists in identifying and labelling key music segments
(e.g. chorus, verse, bridge) of a music piece [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Given a musical composition, its musical
segmentation consists in the identification of non-overlapping segments, which we will refer to
as sections. Each section is characterized by a label that classifies its function such as intro or
verse in figure 1. A correct segmentation does not necessarily assign the correct labels to each
section of the composition, but rather focuses on the correct estimation of the boundaries of
each section. Once boundaries has been accurately predicted, a labeling process is performed to
obtain the final annotation [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Most of the recent methods and research approaches are based on audio analysis techniques
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], nonetheless harmonic information, isolated from tempo and rhythm, have been successfully
used in several tasks in the field of music information retrieval (MIR) (e.g. [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]).
      </p>
      <p>In this paper, we focus on the music structure segmentation task by only taking into account
harmonic information extracted from symbolic notations (music chord annotations). The
assumption behind this approach is that the identification of harmonic sub-sequences (harmonic
patterns) can be influential in defining the structure of a song and the sections of which it is
composed. For instance, by taking a closer look at Figure 1 it is easy to notice how harmonic
information can provide valuable information in the structure segmentation task: all verses
are roughly based on the same harmonic progression (E, G, A, E) while refrains are based on a
diferent harmonic progression ( A, E, A, E, E). A segmentation strongly based on those recurrent
patterns is likely to be coherent with the way the composer shaped the progression in the first
place.</p>
      <p>The objective of this paper is threefold: (i) we propose pitchclass2vec, a novel chord embedding
method; (ii) we use this encoding with a recurrent neural network on a corpus of musical chords;
and (iii) we compare the performance of the encoding with the state-of-the art methods in the
ifeld.</p>
      <p>
        The chord embedding method proposed, pitchclass2vec, encodes a chord using a one-hot
encoding of the notes that compose it by making use of word embedding techniques. Each
embedded chord is defined to be similar to the embedding of its neighbouring chords in an
harmonic progression. This formalization is supposed to approximate the semantic meaning of
a chord [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and has been widely used in the natural language processing field [
        <xref ref-type="bibr" rid="ref16">16, 17</xref>
        ]. We use
pitchclass2vec embeddings to train an LSTM neural network that predicts the section of each
chord. Through its recurrent layers the neural network is able to learn relationships between
the elements of a sequence. This allows the model to detect repetitive patterns of the harmonic
progression and predict the a segmentation of the whole composition. The model provides a
baseline to test the eficacy of the proposed chord embedding method. State-of-the-art results
are achieved in the task of music structure segmentation on symbolic harmonic data, providing
evidences that pitchclass2vec is able to provide accurate chord representations. However, the
embedding method employed here for the music segmentation task, can be used in a variety of
applications in the field of Music Information Retrieval, such as retrieving harmonically related
pieces [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], automatic chord recognition [18] and music genre classification [ 19].
      </p>
      <p>The paper is organised as follows: Section 2 introduces the related works, Section 3 describes
the novel chord embedding method and the recurrent neural network used for the segmentation
task. Section 4 presents the experiments performed and Section 5 gives an overview of the
obtained results. Finally in Section 6 we discuss the results and new research directions to be
explored.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Music segmentation: state of the art</title>
      <p>
        Automatic segmentation on audio signal is a prolific research field in which many diferent
solutions have been presented, ranging from self-similarity matrices [20, 21] to neural network
based methods [22, 23, 24]. Harmonic content has been used to improve those methods both
using probabilistic models [25] and transformer based models [26]. Significant research has
been performed on phrase-level structural segmentation based on melodic [27, 28, 29] as well as
polyphonic content [30]. However, to the best of our knowledge, the only approach proposed
in literature for global music segmentation on symbolic harmonic content is FORM [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        FORM performs structural segmentation by exploiting repeated patterns extracted from
harmonic progressions encoded as sequences of strings. Each string represent a chord. In the
original work chord labels are transformed into 24 class of chords, 12 major chords and 12 minor
chords, while every other chord feature is removed. In this paper, FORM is re-implemented
in order to compare the results of the proposed method with the current state of the art (see
Section 5). FORM pattern detection algorithm is based on sufix trees. Each node on a sufix
tree represents a (possibly recurrent) sub-sequence of a string. FORM extracts sub-sequences
appearing in at least two position in the analyzed harmonic progression. A partial segmentation
is obtained by labeling each sub-sequence as a new section. The final segmentation is obtained
by labeling remaining sub-sequences as their preceding neighbouring section. The results are
compared with a random baseline that generates arbitrarily long structures and to a heuristic
that assigns to each composition the typical pop song structure ABBBBCCBBBBCCDCCE [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
in which each diferent label represent a structure in the chord progression and is stretched to
ift the whole sequence. The main issue with FORM is in way chord labels are compared. The
string representation does not take into account semantic similarity between chords nor the
algorithm is able to detect near-similar patterns, i.e. patterns whose diference can be ignored
in the context of music structure segmentation.
      </p>
      <p>
        Our structure segmentation method is based on our novel chord representation method,
pitchclass2vec, based on continuous word representation. The core idea of continuous word
representation is based on the Distributional Hyphotesis [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]: the semantic meaning of a word
 can be approximated from the distribution of words that appear within the context of  . The
objective of continuous word representation is the maximization of the following log-likelihood:

∑
      </p>
      <p>∑ log (
=1 ∈ 
 |  ),
function (
where   represents the indices of the words that appears as context of the word   and the
 |  ) is parameterized using  -dimensional vectors in ℝ , respectively u 
and v  .</p>
      <p>The problem can be framed as a binary classification task in which words are predicted to be
present (or absent) from the context of   . A similarity function (
 ,   ) between two words  
and   , can be computed as the scalar product u
  v  . The representations obtained by training
the described method on a large corpus correctly approximates the semantic meaning of words.
In the last few years, continuous word representation has been applied in a growing number
of application areas, achieving state-of-the-art results in the natural language processing field
[31] in tasks such as part of speech tagging[32], named entity recognition[33] and document
classification[ 34].</p>
      <p>
        The described approach has been first proposed by the
word2vec skipgram model [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Word2vec, however, is limited by the lack of morphological knowledge of a word. When
computing the representation of a word, none of its morphological components are taken into
account. Let’s take for instance two morphologically similar words, house and housing. Their
computation does not share any common element and the final representation of the words will
not be influenced by their similarities. Fasttext [17] was presented as a solution to this issue and
has proven to be more efective in the representation of a word using continuous representations.
The novel aspect is in the way representations are computed. At first the n-grams that compose
a word are extracted. For each n-gram a continuous vector representation is computed, using
the same methodology as word2vec. The representation of the original words is finally obtained
as the sum of its n-gram components. Using this technique, the final representation of a word
is conditioned by its morphological structure. When two words share one or more n-grams
their vectors will be the sum of at least one common element, which will bias both vectors
in being more similar to each other. An additionally advantage of the fasttext approach is
the way out-of-vocabulary words (words that never appear in the training corpus, and whose
representation is hence unknown) are handled. When using fixed approach such as word2vec,
(a) word2vec
(b) fasttext
out of vocabulary words are represented as a static vector, usually randomly sampled from a
normal distribution. Fasttext instead is able to compute the representation in a meaningful
way, given that at least one of the n-grams in the out-of-vocabulary term has been computed
previously in the training corpus. Figure 2 shows a visual comparison between word2vec (Figure
2a) and fasttext (Figure 2b).</p>
      <p>Continuous word representations have already been applied to chord symbols with promising
results. In chord2vec [35] the authors obtain state-of-the-art results on the log-likelihood
estimation task. Log-likelihood estimation is the task of correctly estimating, given one element
in a sequence, the probabilities of another element being the upcoming element in the sequence.
Chord2vec is inspired by the word2vec method, in which chords are represented by the notes
that they are composed of. The representation model proposed by chord2vec is similar to
pitchclass2vec. Instead of computing chord representations with the notes that compose a chord,
the representations of pitchclass2vec takes into account the relationship between the notes that
compose each chord. An in depth discussion is presented in section 3.</p>
      <p>More recently, word2vec-based approach on symbolic chord annotations has been analyzed
by [36]. Chord representations are based on the chord label without taking into account the
notes that compose it. The encoding is then used on two diferent tasks: chord clustering and
log-likelihood estimation. The log-likelihood estimation task is used to investigate the historical
harmonic style of diferent composers. The log-likelihood results strongly correlates with
current musicological knowledge. For instance the model finds dificult to predict chords from
artists that make a sporadic use of common harmonic progressions [36]. The chord clustering
task highlights how it’s possible to observe similarities between functionally equivalent chords
(chords that shares notes between each other) and a well defined diference between functionally
diferent chords. Continuous word representations are hence adequate to encode chords in the
ifrst place, and more importantly they are able to autonomously internalize relations between
chords that have been previously observed by domain experts.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Pitchclass2vec model</title>
      <p>Embedding approaches used in natural language processing have obvious limitations when
it comes to dealing with musical content, such as musical chords. While relying on purely
syntactical representations has been show to correctly encapsulate some forms of domain
knowledge [36], more advanced representations are needed to obtain accurate results when
dealing with harmonic progressions [35].</p>
      <p>There are, however, some ambiguous cases in which both vector representations might
C:maj131, whose notes are respectively  C:maj = {, , }
and  C:maj13 = {, , , , , }
introduce wrong similarities between chords. Let us take for instance the chords C:maj and
. Both
chords’ labels only difer by two characters, however the diference between the notes that they
are composed of can’t be neglected. A method exclusively based on syntactical information
would wrongly represent the vectors as similar between each other. Conversely, only relying
on the notes that compose a chord results in ambiguous representations of some particular
classes of chords, called enharmonic chords. For instance, the enharmonic chords C:dim and
Eb:dim share the exact same set of notes,  C:dim =  Eb:dim = {, , , }
represented as diferent chords as they serve diferent harmonic purposes.</p>
      <p>but need to be
Chord2vec would
wrongly represents both chords as the same exact vector.</p>
      <p>In order to overcome the aforementioned limitations, we propose an encoding which
requirements can be summarised as follows: 1. it has to be based on the constituent notes of a chord,
rather than its label; and 2. it must take into account the relation between those notes instead
of the notes themselves.</p>
      <p>The proposed encoding is grounded on tonal music theory: each chord  is composed of a
set of notes  ⊂</p>
      <p>, where  is the set of all notes and  is called the pitch class of a chord.</p>
      <p>An important distinction is represented by the root note, which names the chord and plays an
important role in its harmonic function.</p>
      <p>We encode each chord as the Cartesian product ℐ = root ×   between the root note and
the pitch class of the chord. The vector representation u of a chord  is computed as
u = ∑ u</p>
      <p>∈ℐ 
where u is the vector representation of the tuple   ∈ ℐ . See Figure 3b for a visual reference
on how pitchclass2vec handles enharmonic chords and Figure 3a on how chords with common
components are handled. This formalization can be seen as an extension of the chord2vec [35]
method, in which the chord inner structure is taken into consideration as well.</p>
      <p>
        Nevertheless, the label of a chord has a well-defined semantic. Chords composed of the same
set of notes may have diferent harmonic functions. For example, the chords
G:min7 and Bb:6,
despite diferent labels contain the exact same notes:  G:min7 =  Bb:6 = {, , ,  }
problem is particularly evident in datasets containing annotations made by experts, where the
. This
1In Harte[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] notation
(a) C:maj and C:maj9 chord embeddings. The final
representation is computed from common
elements and will hence share some aspects.
      </p>
      <p>(b) C:dim and Eb:dim chord embeddings. Both
chords are composed of the same notes but
using mostly diferent components.
choice of label is the result of a meticulous analysis. For this reason, we have implemented two
diferent variants of pitchclass2vec: (i) a variant combining the approach proposed by word2vec
with pitchclass2vec; and (ii) a variant combining fasttext with pitchclass2vec.</p>
      <p>In order to obtain mixed embeddings we test diferent hybrid combinations before passing
the new representation to the LSTM model:
(i) concatenating the embeddings;
(ii) concatenating the embeddings and projecting the result in a  -dimensional vector, using
a fully connected layer;
(iii) projecting the embeddings in the same  -dimensional space by using two diferent fully
connected layer and summing the  -dimensional vectors;
(iv) computing a new representation of each embedding by using two separate LSTM layers
and summing the resulting vectors;
(v) computing a new representation of each embedding by using two separate LSTM layers
and concatenating the resulting vectors.</p>
      <p>
        None of the combination used proved to be able to outperform the others and we decided to
stick to the first simpler and faster approach.
3.1. Implementation details
The model is implemented using pytorch. We train the model on a set of ≈ 16000 chord
progressions (with a total number of over 1 chord instances), taken from the Chord Corpus
(ChoCo) dataset [37]. ChoCo is a chord dataset consistsing of more than 20000 tracks taken
from 18 diferent professionally curated datasets. All datasets have been parsed in JAMS [ 38]
format and converted in Harte Notation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We train the model for at most 10 epochs on an
NVIDIA RTX 3090 with batch sizes of 512 harmonic progressions. We manually tune the batch
size to eficiently train the model on our available resources. For each chord we take a window
of 4 context chords as positive examples, 2 preceding and 2 succeeding, as it has been done
in the original fasttext implementation [17]. Then, we sample 20 random chords as negative
examples. Even though it has been shown that windows of diferent sizes yields diferent results
depending on the task they are applied to [
        <xref ref-type="bibr" rid="ref17">39</xref>
        ] here we will rely on a fixed size window to
better compare it to the related works. We subsample our corpus to obtain a more balanced
one by removing some of the most frequent chords instances. We use a factor of  = 10 −5 as
suggested by [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to allow a faster and more accurate training phase. The model is trained
using a standard training procedure where a binary cross entropy loss between a chord and its
positive and negative examples is minimized using Adam optimizer, with fixed learning rate of
0.025. We set the embedding dimension to 10 as the result of manual trials.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental setup</title>
      <p>
        This section shows how the proposed model compares to FORM [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the state-of-the-art
approach in the field of music segmentation. We develop a baseline model using a stacked
LSTM-based neural network, depicted in figure 4. The model objective is to predict in which
section each chord belongs to. We train our model on the Billboard dataset[
        <xref ref-type="bibr" rid="ref18">40</xref>
        ] provided by
mirdata[
        <xref ref-type="bibr" rid="ref19">41</xref>
        ]. The dataset is composed of 889 expert annotated tracks. Each track is composed of
a sequence of chords in Harte format[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and a sequence of structure labels. Labels are provided
in a similar format to the one presented by SALAMI[
        <xref ref-type="bibr" rid="ref20">42</xref>
        ]. 80 unique section labels are present in
the whole dataset. We preprocess each label and reduce the number of unique labels to 11 by
combining all those labels that fall under the same definition given by[
        <xref ref-type="bibr" rid="ref20">42</xref>
        ]. A complete reference
of the label conversion step is given in table 1.
      </p>
      <p>
        Although in literature there are neural network architectures that have proven to perform
better in similar task [
        <xref ref-type="bibr" rid="ref21 ref22 ref23">43, 44, 45</xref>
        ], we deliberately decided to use a very straightforward
architecture. This is due to the fact that the aim of this study is to compare diferent types of embedding,
[verse]
[prechorus, pre chorus]
[chorus]
[fadein, fade in, intro]
[outro, coda, fadeout, fade-out, ending]
[applause, bass, choir, clarinet, drums, flute, harmonica,
harpsichord, instrumental, instrumental break, noise, oboe, organ,
piano, rap, saxophone, solo, spoken, strings, synth, synthesizer,
talking, trumpet, vocal, voice, guitar, saxophone, trumpet]
[main theme, theme, secondary theme]
[transition, tran]
[modulation, key change]
Converted label
verse
prechorus
chorus
intro
outro
instrumental
theme
transition
other
rather than to achieve the best performance.
      </p>
      <p>We split our dataset in the usual training, validation and test split (respectively 800, 178 and
89 elements) and fine-tune each model hyper-parameters (number of LSTM stacked layers,
LSTM hidden size, dropout probability) to obtain the best results on the validation set. The final
configuration of each model is summarised in Table 2. Training is performed using an NVIDIA
RTX 3090 with a batch size of 128. Each model takes at most few minutes to train and average
less than 2 milion parameters.</p>
      <p>
        We compare the proposed embedding model to fasttext and word2vec as well, in which both
methods are trained on the string labels of chords in Harte format. Both the models are trained
using the highly optimized gensim [
        <xref ref-type="bibr" rid="ref24">46</xref>
        ] implementation. The hyperparameters used are the
same as the one described in section 3.1 except for the embedding dimension, which is set to
300.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>
        The results of the experiments are summarised in Table 3. We evaluate the segmentation results
by computing pairwise precision, recall and F1-score ( ,  and  1 in Table 3) [
        <xref ref-type="bibr" rid="ref25">47</xref>
        ] along with
under-segmentation, over-segmentation and normalized cross entropy F1 (  ,   and   1 in
[49]. Under-segmentation and over-segmentation are two peculiar metrics for the evaluation
of automatic music segmentation methods. When a method has an high over-segmentation
measure, the final prediction accuracy is influenced mostly by false fragmentation. Conversely
an high under-segmentation measure means that the prediction’s segments are the result of
ground-truth segments being merged together [
        <xref ref-type="bibr" rid="ref26">48</xref>
        ].
      </p>
      <p>
        Pairwise metrics are computed as the usual precision, recall and F1 scores on the set of
identically labeled pairs in the sequence. Precision and recall can be interpreted as the amount
of accuracy that is influenced respectively by under-segmentation and over-segmentation. On
the other hand under and over-segmentation scores are computed by taking into account the
normalized conditional entropy of the segmentation. In short,   gives a measure of how much
information is missing in the predicted segmentation, given the ground truth segmentation while

 gives a measure of how much noisy information are the result of the predicted segmentation
[
        <xref ref-type="bibr" rid="ref26">48</xref>
        ]. A graphical explanation of these concepts is provided in Figure 5 (all the examples are
taken from [
        <xref ref-type="bibr" rid="ref26">48</xref>
        ]).
      </p>
      <p>We evaluate our models based on the  1 and   1 scores of Table 3 since both metrics gives a
balanced measure of over and under segmentation.</p>
      <p>
        FORMsimple detect repetitive patterns from simplified chord labels, as shown in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The
chord simplification process extracts the root note from the chord and classifies it either as
major or minor. FORMraw uses the same labels used by the neural approaches. The former
performs better better than the latter. This is not surprising as more patterns between strings
can be uncovered by only taking into account 24 labels (12 root notes, each of which can be
either minor or major ).   and   however suggests that the over-segmentation based approach
of FORM ends up correctly segmenting only some particular portions of the whole composition,
while the other ones are wrongly classified. This is an expected behaviour since FORM only
relies on label-based repeated patterns. The presence of subtle diferences in an harmonic
progressions that belongs to the same section, such as the first and second verse in Figure 1, are
Evaluation metrics on the test set. FORMraw is computed on the same chord labels that are used for
all the neural approaches. FORMsimple is computed on simplified chord representation as done in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]:
only the chord root and its quality (major or minor) are kept.
      </p>
      <p>(a) High over-segmentation example.  = 1 since if we take each chord in the sequence pairwise then
each chord that should be in the same section is indeed in the same section.   = 1 since the accuracy
of the prediction can be easily explained by the over-segmentation phenomena. Conversely,  = 0.53
and   = 0.53 clearly show how the prediction is not able to capture all the needed segments but
rather merges ground truth segments together.
(b) High under-segmentation example. The exact opposite of Figure (a) is displayed. The prediction is
not able to capture segments and rather place each chord on its own segment.
(c)  ,  and   ,   compared. In this edge case the main diference between the two measures is
highlighted. While  and  suggests a decent segmentation   and   clearly states a completely wrong
segmentation. Pairwise metrics can be misguiding in absence of   and   .</p>
      <p>All the neural models in table 3 outperforms FORM. Even though each neural model shows
some diferences in term of metrics, the clear trend is that syntactical-based models ( fasttext
and word2vec) yield overly segmented results, as the low   score suggests, while the approach
taken by pitchclass2vec produces a more balanced segmentation, as suggested by the  1 score.
Surprisingly, fasttext and word2vec under-performs when compared to the FORM baseline on
  1 metric. The model is not able to generalize enough over the representation and cannot
detect patterns that are detected by FORM. Finally, the combination of pitchclass2vec with either
fasttext or word2vec doesn’t bring any remarkable benefit to our novel representation. Even
though an higher  1 score is obtained by using an hybrid approach, the lower   1 score suggests
that it has an underlying less accurate segmentation, similar to example (c) in Figure 5.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>In this article, we presented a new embedding method for musical chords, pitchclass2vec,
that considers the component notes of the chord (also called pitchclass), instead of the chord
label, as used in embedding methods in the natural language processing field. In addition, we
proposed hybrid embedding forms, which combine embedding on the chord label and the novel
pitchclass2vec. We compared diferent embedding models, including pitchclass2vec with the
state-of-the-art approach in the field of music structural segmentation. We used ChoCo, a
dataset of chord annotations, for training the embeddings and Billboard, a dataset of structurally
annotated tracks, for the music segmentation task. We used the diferent types of embeddings on
a recurrent neural network (LSTM). The results obtained by using our embeddings outperform
the state of the art in every case, with the best result obtained by pitchclass2vec, achieving a
pairwise F1 score of 0.548 and an over-under-segmentation F1 score of 0.538.</p>
      <p>Pitchclass2vec is efectively able to learn the harmonic relationships that ties diferent chords
together. Even though the experiments based on fasttext and word2vec proves to be efective as
well, the musical theoretical approach upon which we base pitchclass2vec is an essential factor
that needs to be taken into account. The presented embedding model proves to be a promising
method to improve results in MIR tasks that can be complemented with harmonic information.
Moreover, it provides a valuable tool to better understand and analyse harmonic progressions,
since it allows a richer comparison between chords and chord sequences when compared to
string labels.</p>
      <p>There are additional information that we plan to integrate on pitchclass2vec to obtain a
richer and more accurate representation. For instance, one of the main limitations of our
approach stems from the fact that we do not take temporal information into account. We plan
to test this possibility by using the temporal information directly in the embedding process
and further modify the LSTM model to condition the classification of the section of a chord
based on its duration. As discussed in Section 3, chord labels have their own semantic as
well. Since the hybrid models proposed did not directly result in more accurate results, we
plan on expanding the pitchclass2vec method to take into account the label of a chords as well
directly in its embedding model. To obtain a semantically richer representations we plan on
enhancing pitchclass2vec by using deep contextual word embeddings [50] along with knowledge
enhancement techniques [51] that combines domain-specific ontologies, such as [ 52], with deep
contextual word embeddings.</p>
      <p>
        Moreover, we plan an in-depth analysis of pitchclass2vec training parameters, described in
section 3.1, since in [
        <xref ref-type="bibr" rid="ref17">39</xref>
        ] the authors showed that, on a product recommendation task, carefully
optimized hyper-parameters nearly double the final accuracy on all the experiments.
      </p>
      <p>
        It is worth mentioning that the LSTM model that we implemented for the structure
segmentation task does not take advantage of two fundamental aspect of the task itself: the segmentation
of a musical piece should be conditioned by its musical genre. In fact, the annotation guidelines
provided in [
        <xref ref-type="bibr" rid="ref20">42</xref>
        ] defines some genre-specific labels and encourage their use whenever applicable.
Even though a relabeling process can partially solve this issue, the need of genre-specific labels
proves that the use-cases of specific structure labels, for instance theme, might be diferent in
diferent musical genres. Finally, as already discussed in section 4, many recent techniques has
shown to be efective in increasing the accuracy of diferent tasks in the NLP field [
        <xref ref-type="bibr" rid="ref21 ref22 ref23">43, 44, 45</xref>
        ]
when using continuous word representations. We plan to address these issues in future works.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under grant agreement No 101004746.
in vector space, in: Y. Bengio, Y. LeCun (Eds.), 1st International Conference on Learning
Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track
Proceedings, 2013. URL: http://arxiv.org/abs/1301.3781.
[17] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
information, Trans. Assoc. Comput. Linguistics 5 (2017) 135–146. URL: https://doi.org/10.
1162/tacl_a_00051. doi:10.1162/tacl\_a\_00051.
[18] K. O’Hanlon, M. B. Sandler, Fifthnet: Structured compact neural networks for automatic
chord recognition, IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021) 2671–2682.</p>
      <p>URL: https://doi.org/10.1109/TASLP.2021.3070158. doi:10.1109/TASLP.2021.3070158.
[19] H. Liang, W. Lei, P. Y. Chan, Z. Yang, M. Sun, T. Chua, Pirhdy: Learning pitch-, rhythm-,
and dynamics-aware embeddings for symbolic music, in: C. W. Chen, R. Cucchiara, X. Hua,
G. Qi, E. Ricci, Z. Zhang, R. Zimmermann (Eds.), MM ’20: The 28th ACM International
Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, ACM,
2020, pp. 574–582. URL: https://doi.org/10.1145/3394171.3414032. doi:10.1145/3394171.
3414032.
[20] R. J. Weiss, J. P. Bello, Unsupervised discovery of temporal structure in music, IEEE J. Sel.</p>
      <p>Top. Signal Process. 5 (2011) 1240–1251. URL: https://doi.org/10.1109/JSTSP.2011.2145356.
doi:10.1109/JSTSP.2011.2145356.
[21] R. J. Weiss, J. P. Bello, Identifying repeated patterns in music using sparse convolutive
non-negative matrix factorization, in: J. S. Downie, R. C. Veltkamp (Eds.), Proceedings
of the 11th International Society for Music Information Retrieval Conference, ISMIR
2010, Utrecht, Netherlands, August 9-13, 2010, International Society for Music Information
Retrieval, 2010, pp. 123–128. URL: http://ismir2010.ismir.net/proceedings/ismir2010-23.pdf.
[22] G. Shibata, R. Nishikimi, K. Yoshii, Music structure analysis based on an LSTM-HSMM
hybrid model, in: J. Cumming, J. H. Lee, B. McFee, M. Schedl, J. Devaney, C. McKay,
E. Zangerle, T. de Reuse (Eds.), Proceedings of the 21th International Society for Music
Information Retrieval Conference, ISMIR 2020, Montreal, Canada, October 11-16, 2020,
2020, pp. 23–29. URL: http://archives.ismir.net/ismir2020/paper/000005.pdf.
[23] J. Wang, J. B. L. Smith, W. T. Lu, X. Song, Supervised metric learning for music structure
features, in: J. H. Lee, A. Lerch, Z. Duan, J. Nam, P. Rao, P. van Kranenburg, A.
Srinivasamurthy (Eds.), Proceedings of the 22nd International Society for Music Information
Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, 2021, pp. 730–737. URL:
https://archives.ismir.net/ismir2021/paper/000091.pdf.
[24] A. Marmoret, J. E. Cohen, N. Bertin, F. Bimbot, Uncovering audio patterns in music with
nonnegative tucker decomposition for structural segmentation, CoRR abs/2104.08580
(2021). URL: https://arxiv.org/abs/2104.08580. arXiv:2104.08580.
[25] J. Pauwels, F. Kaiser, G. Peeters, Combining harmony-based and novelty-based
approaches for structural segmentation, in: A. de Souza Britto Jr., F. Gouyon, S. Dixon
(Eds.), Proceedings of the 14th International Society for Music Information Retrieval
Conference, ISMIR 2013, Curitiba, Brazil, November 4-8, 2013, 2013, pp. 601–606. URL:
http://www.ppgia.pucpr.br/ismir2013/wp-content/uploads/2013/09/138_Paper.pdf.
[26] T. Chen, L. Su, Harmony transformer: Incorporating chord segmentation into harmony
recognition, in: A. Flexer, G. Peeters, J. Urbano, A. Volk (Eds.), Proceedings of the 20th
International Society for Music Information Retrieval Conference, ISMIR 2019, Delft,
The Netherlands, November 4-8, 2019, 2019, pp. 259–267. URL: http://archives.ismir.net/
ismir2019/paper/000030.pdf.
[27] B. W. Frankland, A. J. Cohen, Parsing of melody: Quantification and testing of the local
grouping rules of lerdahl and jackendof’s a generative theory of tonal music, Music
Perception 21 (2004) 499–543.
[28] E. Cambouropoulos, The local boundary detection model (LBDM) and its application in
the study of expressive timing, in: Proceedings of the 2001 International Computer Music
Conference, ICMC 2001, Havana, Cuba, September 17-22, 2001, Michigan Publishing, 2001.</p>
      <p>URL: https://hdl.handle.net/2027/spo.bbp2372.2001.021.
[29] G. Velarde, T. Weyde, D. Meredith, An approach to melodic segmentation and classification
based on filtering with the haar-wavelet, Journal of New Music Research 42 (2013) 325–345.
[30] D. Meredith, K. Lemström, G. A. Wiggins, Algorithms for discovering repeated patterns in
multidimensional representations of polyphonic music, Journal of New Music Research 31
(2002) 321–345.
[31] Q. Jiao, S. Zhang, A brief survey of word embedding and its recent development, in:
2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control
Conference (IAEAC), volume 5, 2021, pp. 1697–1701. doi:10.1109/IAEAC50856.2021.
9390956.
[32] S. Meftah, N. Semmar, A neural network model for part-of-speech tagging of social media
texts, in: N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara,
B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, T. Tokunaga (Eds.),
Proceedings of the Eleventh International Conference on Language Resources and Evaluation,
LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association
(ELRA), 2018. URL: http://www.lrec-conf.org/proceedings/lrec2018/summaries/913.html.
[33] J. P. C. Chiu, E. Nichols, Named entity recognition with bidirectional lstm-cnns, Trans.</p>
      <p>Assoc. Comput. Linguistics 4 (2016) 357–370. URL: https://doi.org/10.1162/tacl_a_00104.
doi:10.1162/tacl\_a\_00104.
[34] J. Lilleberg, Y. Zhu, Y. Zhang, Support vector machines and word2vec for text classification
with semantic features, in: N. Ge, J. Lu, Y. Wang, N. Howard, P. Chen, X. Tao, B. Zhang, L. A.
Zadeh (Eds.), 14th IEEE International Conference on Cognitive Informatics &amp; Cognitive
Computing, ICCI*CC 2015, Beijing, China, July 6-8, 2015, IEEE Computer Society, 2015,
pp. 136–140. URL: https://doi.org/10.1109/ICCI-CC.2015.7259377. doi:10.1109/ICCI- CC.
2015.7259377.
[35] S. Madjiheurem, L. Qu, C. Walder, Chord2vec: Learning musical chord embeddings, in:
Proceedings of the constructive machine learning workshop at 30th conference on neural
information processing systems (NIPS2016), Barcelona, Spain, 2016.
[36] E. Anzuoni, S. Ayhan, F. Dutto, A. McLeod, admin, M. Rohrmeier, A historical analysis of
harmonic progressions using chord embeddings, in: Proceedings of the 18th Sound and
Music Computing Conference, 2021, pp. 284–291.
[37] J. de Berardinis, A. Meroño-Peñuela, A. Poltronieri, V. Presutti, Choco: a chord corpus and
a data transformation workflow for musical harmony knowledge graphs, in: Manuscript
under review, 2022.
[38] E. J. Humphrey, J. Salamon, O. Nieto, J. Forsyth, R. M. Bittner, J. P. Bello, JAMS: A JSON
annotated music specification for reproducible MIR research, in: H. Wang, Y. Yang, J. H.
Information Retrieval, Drexel University, Philadelphia, PA, USA, September 14-18, 2008,
2008, pp. 375–380. URL: http://ismir2008.ismir.net/papers/ISMIR2008_219.pdf.
[49] C. Rafel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, D. P. W. Ellis, Mir_eval:
A transparent implementation of common MIR metrics, in: H. Wang, Y. Yang, J. H. Lee
(Eds.), Proceedings of the 15th International Society for Music Information Retrieval
Conference, ISMIR 2014, Taipei, Taiwan, October 27-31, 2014, 2014, pp. 367–372. URL:
http://www.terasoft.com.tw/conf/ismir2014/proceedings/T066_320_Paper.pdf.
[50] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
contextualized word representations, in: Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New
Orleans, Louisiana, 2018, pp. 2227–2237. URL: https://aclanthology.org/N18-1202. doi:10.
18653/v1/N18-1202.
[51] M. E. Peters, M. Neumann, R. Logan, R. Schwartz, V. Joshi, S. Singh, N. A. Smith, Knowledge
enhanced contextual word representations, in: Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational
Linguistics, Hong Kong, China, 2019, pp. 43–54. URL: https://aclanthology.org/D19-1005.
doi:10.18653/v1/D19-1005.
[52] S. Kantarelis, E. Dervakos, N. Kotsani, G. Stamou, Functional harmony ontology: Musical
harmony analysis with description logics, Journal of Web Semantics 75 (2023) 100754.
URL: https://www.sciencedirect.com/science/article/pii/S1570826822000385. doi:https:
//doi.org/10.1016/j.websem.2022.100754.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aiello</surname>
          </string-name>
          , T. Bever,
          <article-title>Harmonic structure as a determinant of melodic organization</article-title>
          ,
          <source>Memory &amp; cognition 9</source>
          (
          <year>1981</year>
          )
          <fpage>533</fpage>
          -
          <lpage>9</lpage>
          . doi:
          <volume>10</volume>
          .3758/BF03202347.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <article-title>Music perception and cognition: A review of recent cross-cultural research</article-title>
          ,
          <source>Top. Cogn. Sci. 4</source>
          (
          <year>2012</year>
          )
          <fpage>653</fpage>
          -
          <lpage>667</lpage>
          . URL: https://doi.org/10.1111/j.1756-
          <fpage>8765</fpage>
          .
          <year>2012</year>
          .
          <volume>01215</volume>
          .x. doi:
          <volume>10</volume>
          .1111/j.1756-
          <fpage>8765</fpage>
          .
          <year>2012</year>
          .
          <volume>01215</volume>
          .x.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Collins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmuckler</surname>
          </string-name>
          ,
          <article-title>Phrasing influences the recognition of melodies</article-title>
          ,
          <source>Psychonomic bulletin &amp; review 4</source>
          (
          <year>1997</year>
          )
          <fpage>254</fpage>
          -
          <lpage>9</lpage>
          . doi:
          <volume>10</volume>
          .3758/BF03209402.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Krumhansl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Jusczyk</surname>
          </string-name>
          ,
          <article-title>Infants' perception of phrase structure in music</article-title>
          ,
          <source>Psychological Science</source>
          <volume>1</volume>
          (
          <year>1990</year>
          )
          <fpage>70</fpage>
          -
          <lpage>73</lpage>
          . URL: http://www.jstor.org/stable/40062394.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wigram</surname>
          </string-name>
          , C. Gold,
          <article-title>Music therapy in the assessment and treatment of autistic spectrum disorder: clinical application and research evidence</article-title>
          ,
          <source>Child: care, health and development</source>
          <volume>32</volume>
          (
          <year>2006</year>
          )
          <fpage>535</fpage>
          -
          <lpage>542</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <article-title>A typology of 'hooks' in popular records</article-title>
          ,
          <source>Popular Music</source>
          <volume>6</volume>
          (
          <year>1987</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          . 1017/S0261143000006577.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Temperley</surname>
          </string-name>
          ,
          <article-title>The cognition of basic musical structures</article-title>
          , MIT press,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Harte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Abdallah</surname>
          </string-name>
          , E. Gómez,
          <article-title>Symbolic representation of musical chords: A proposed syntax for text annotations</article-title>
          ,
          <source>in: ISMIR</source>
          <year>2005</year>
          , 6th International Conference on Music Information Retrieval, London, UK,
          <fpage>11</fpage>
          -15
          <source>September</source>
          <year>2005</year>
          , Proceedings,
          <year>2005</year>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>71</lpage>
          . URL: http://ismir2005.ismir.net/proceedings/1080.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Livingstone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Palmer</surname>
          </string-name>
          , E. Schubert,
          <article-title>Emotional response to musical repetition</article-title>
          .,
          <source>Emotion</source>
          <volume>12</volume>
          (
          <year>2012</year>
          )
          <fpage>552</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Giraud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Groult</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Levé</surname>
          </string-name>
          ,
          <article-title>Computational analysis of musical form</article-title>
          , in: D.
          <string-name>
            <surname>Meredith</surname>
          </string-name>
          (Ed.),
          <source>Computational Music Analysis</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>136</lpage>
          . URL: https://doi.org/10. 1007/978-3-
          <fpage>319</fpage>
          -25931-
          <issue>4</issue>
          _5. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -25931-4\_5.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>M. C. McCallum</surname>
          </string-name>
          ,
          <article-title>Unsupervised learning of deep features for music segmentation</article-title>
          , in: IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing</source>
          ,
          <string-name>
            <surname>ICASSP</surname>
          </string-name>
          <year>2019</year>
          , Brighton, United Kingdom, May
          <volume>12</volume>
          -17,
          <year>2019</year>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>346</fpage>
          -
          <lpage>350</lpage>
          . URL: https://doi.org/ 10.1109/ICASSP.
          <year>2019</year>
          .
          <volume>8683407</volume>
          . doi:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2019</year>
          .
          <volume>8683407</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Mysore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B. L.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlüter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McFee</surname>
          </string-name>
          ,
          <article-title>Audio-based music structure analysis: Current trends, open challenges, and applications</article-title>
          ,
          <source>Trans. Int. Soc. Music. Inf. Retr</source>
          .
          <volume>3</volume>
          (
          <year>2020</year>
          )
          <fpage>246</fpage>
          -
          <lpage>263</lpage>
          . URL: https://doi.org/10.5334/tismir.54. doi:
          <volume>10</volume>
          .5334/ tismir.54.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>W. B. de Haas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Wiering</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          <string-name>
            <surname>Veltkamp</surname>
          </string-name>
          ,
          <article-title>A geometrical distance measure for determining the similarity of musical harmony</article-title>
          ,
          <source>Int. J. Multim. Inf. Retr</source>
          .
          <volume>2</volume>
          (
          <year>2013</year>
          )
          <fpage>189</fpage>
          -
          <lpage>202</lpage>
          . URL: https: //doi.org/10.1007/s13735-013-0036-6. doi:
          <volume>10</volume>
          .1007/s13735-013-0036-6.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>W. B. de Haas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Volk</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Wiering</surname>
          </string-name>
          ,
          <article-title>Structural segmentation of music based on repeated harmonies</article-title>
          ,
          <source>in: 2013 IEEE International Symposium on Multimedia, ISM</source>
          <year>2013</year>
          , Anaheim, CA, USA, December 9-
          <issue>11</issue>
          ,
          <year>2013</year>
          , IEEE Computer Society,
          <year>2013</year>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>258</lpage>
          . URL: https: //doi.org/10.1109/ISM.
          <year>2013</year>
          .
          <volume>48</volume>
          . doi:
          <volume>10</volume>
          .1109/ISM.
          <year>2013</year>
          .
          <volume>48</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <article-title>The distributional hypothesis</article-title>
          ,
          <source>Rivista di Linguistica (Italian Journal of Linguistics)</source>
          <volume>20</volume>
          (
          <year>2008</year>
          )
          <fpage>33</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations Lee (Eds</article-title>
          .),
          <source>Proceedings of the 15th International Society for Music Information Retrieval Conference</source>
          ,
          <string-name>
            <surname>ISMIR</surname>
          </string-name>
          <year>2014</year>
          , Taipei, Taiwan,
          <source>October 27-31</source>
          ,
          <year>2014</year>
          ,
          <year>2014</year>
          , pp.
          <fpage>591</fpage>
          -
          <lpage>596</lpage>
          . URL: http://www.terasoft.com.tw/conf/ismir2014/proceedings/T106_355_Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>H.</given-names>
            <surname>Caselles-Dupré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lesaint</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Royo-Letelier</surname>
          </string-name>
          ,
          <article-title>Word2vec applied to recommendation: hyperparameters matter</article-title>
          , in: S. Pera,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Ekstrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Amatriain</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. O'Donovan</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 12th ACM Conference on Recommender Systems, RecSys</source>
          <year>2018</year>
          , Vancouver, BC, Canada, October 2-
          <issue>7</issue>
          ,
          <year>2018</year>
          , ACM,
          <year>2018</year>
          , pp.
          <fpage>352</fpage>
          -
          <lpage>356</lpage>
          . URL: https://doi.org/10.1145/3240323.3240377. doi:
          <volume>10</volume>
          .1145/3240323.3240377.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Burgoyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wild</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Fujinaga</surname>
          </string-name>
          ,
          <article-title>An expert ground truth set for audio chord recognition and music analysis</article-title>
          , in: A.
          <string-name>
            <surname>Klapuri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Leider (Eds.),
          <source>Proceedings of the 12th International Society for Music Information Retrieval Conference</source>
          ,
          <string-name>
            <surname>ISMIR</surname>
          </string-name>
          <year>2011</year>
          , Miami, Florida, USA, October
          <volume>24</volume>
          -
          <issue>28</issue>
          ,
          <year>2011</year>
          , University of Miami,
          <year>2011</year>
          , pp.
          <fpage>633</fpage>
          -
          <lpage>638</lpage>
          . URL: http://ismir2011.ismir. net/papers/OS8-1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [41]
          <string-name>
            <surname>R. M. Bittner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Fuentes</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Rubinstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Jansson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choi</surname>
          </string-name>
          , T. Kell,
          <article-title>mirdata: Software for reproducible usage of datasets</article-title>
          , in: A.
          <string-name>
            <surname>Flexer</surname>
            , G. Peeters,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Urbano</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Volk (Eds.),
          <source>Proceedings of the 20th International Society for Music Information Retrieval Conference</source>
          ,
          <string-name>
            <surname>ISMIR</surname>
          </string-name>
          <year>2019</year>
          , Delft,
          <source>The Netherlands, November 4-8</source>
          ,
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>99</fpage>
          -
          <lpage>106</lpage>
          . URL: http: //archives.ismir.net/ismir2019/paper/000009.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [42]
          <string-name>
            <surname>J. B. L. Smith</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Burgoyne</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Fujinaga</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          <string-name>
            <surname>Roure</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <string-name>
            <surname>Downie</surname>
          </string-name>
          ,
          <article-title>Design and creation of a large-scale database of structural annotations</article-title>
          , in: A.
          <string-name>
            <surname>Klapuri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Leider (Eds.),
          <source>Proceedings of the 12th International Society for Music Information Retrieval Conference</source>
          ,
          <string-name>
            <surname>ISMIR</surname>
          </string-name>
          <year>2011</year>
          , Miami, Florida, USA, October
          <volume>24</volume>
          -
          <issue>28</issue>
          ,
          <year>2011</year>
          , University of Miami,
          <year>2011</year>
          , pp.
          <fpage>555</fpage>
          -
          <lpage>560</lpage>
          . URL: http://ismir2011.ismir.net/papers/PS4-14.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Bidirectional LSTM-CRF models for sequence tagging</article-title>
          ,
          <source>CoRR abs/1508</source>
          .
          <year>01991</year>
          (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1508.
          <year>01991</year>
          . arXiv:
          <fpage>1508</fpage>
          .
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>X.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yeung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wong</surname>
          </string-name>
          , W. Woo,
          <article-title>Convolutional LSTM network: A machine learning approach for precipitation nowcasting</article-title>
          , in: C.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>N. D.</given-names>
          </string-name>
          <string-name>
            <surname>Lawrence</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sugiyama</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems</source>
          <year>2015</year>
          , December 7-
          <issue>12</issue>
          ,
          <year>2015</year>
          , Montreal, Quebec, Canada,
          <year>2015</year>
          , pp.
          <fpage>802</fpage>
          -
          <lpage>810</lpage>
          . URL: https://proceedings.neurips. cc/paper/2015/hash/07563a3fe3bbe7e3ba84431ad9d055af-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Attention-based LSTM for aspect-level sentiment classification</article-title>
          , in: J.
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Carreras</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Duh (Eds.),
          <source>Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2016</year>
          , Austin, Texas, USA, November 1-
          <issue>4</issue>
          ,
          <year>2016</year>
          , The Association for Computational Linguistics,
          <year>2016</year>
          , pp.
          <fpage>606</fpage>
          -
          <lpage>615</lpage>
          . URL: https://doi.org/10.18653/v1/d16-
          <fpage>1058</fpage>
          . doi:
          <volume>10</volume>
          .18653/v1/d16-
          <fpage>1058</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>R.</given-names>
            <surname>Řehůřek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sojka</surname>
          </string-name>
          ,
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          ,
          <source>in: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          ,
          <string-name>
            <surname>ELRA</surname>
          </string-name>
          , Valletta, Malta,
          <year>2010</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . http://is.muni.cz/publication/884893/en.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>M.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <article-title>Structural segmentation of musical audio by constrained clustering</article-title>
          ,
          <source>IEEE Trans. Speech Audio Process</source>
          .
          <volume>16</volume>
          (
          <year>2008</year>
          )
          <fpage>318</fpage>
          -
          <lpage>326</lpage>
          . URL: https://doi.org/10.1109/TASL.
          <year>2007</year>
          .
          <volume>910781</volume>
          . doi:
          <volume>10</volume>
          .1109/TASL.
          <year>2007</year>
          .
          <volume>910781</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [48]
          <string-name>
            <surname>H. M. Lukashevich</surname>
          </string-name>
          ,
          <article-title>Towards quantitative measures of evaluating song segmentation</article-title>
          , in: J. P.
          <string-name>
            <surname>Bello</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Chew</surname>
          </string-name>
          , D. Turnbull (Eds.),
          <source>ISMIR</source>
          <year>2008</year>
          , 9th International Conference on Music
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>