=Paper= {{Paper |id=Vol-3810/paper4 |storemode=property |title=Expert-MusiComb: Injective Domain Knowledge in a Neuro-Symbolic Approach for Music Generation |pdfUrl=https://ceur-ws.org/Vol-3810/paper4.pdf |volume=Vol-3810 |authors=Lorenzo Tribuiani,Luca Giuliani,Allegra De Filippo,Andrea Borghesi |dblpUrl=https://dblp.org/rec/conf/creai/TribuianiGFB24 }} ==Expert-MusiComb: Injective Domain Knowledge in a Neuro-Symbolic Approach for Music Generation== https://ceur-ws.org/Vol-3810/paper4.pdf
                         Expert-MusiComb: Injective Domain Knowledge in a
                         Neuro-Symbolic Approach for Music Generation
                         Lorenzo Tribuiani* , Luca Giuliani* , Allegra De Filippo and Andrea Borghesi
                         Department of Computer Science and Engineering (DISI), University of Bologna


                                      Abstract
                                      The significant expansion of data-driven technologies in the past decade has highlighted the crucial role of
                                      structured data, given the more relevant and meaningful informative content that they can provide to artificial
                                      intelligence (AI) applications. Nonetheless, there are domains based on inherently unstructured data, such as the
                                      audio domain. In those cases, the possibility of relying on an automated system capable of extracting structured
                                      features from raw data could serve as a pivotal element in enhancing and strengthening the capabilities of an AI
                                      system. In this work, we propose an automated feature extractor which leverages machine and deep learning
                                      methodologies to retrieve two higher-level musical attributes from short MIDI samples, namely the harmonic
                                      content of the sample – through its chords progression – and the role that such sample could have within a
                                      multi-track composition – i.e., melody, bass, or accompaniment. We perform our tests on a dataset containing
                                      ground truth information to assess quantitative results and later integrate our models within the state-of-the-art
                                      framework for combinatorial music generation MusiComb to check for harmonic and melodic consonance on the
                                      downstream generative task.

                                      Keywords
                                      Music Generation Systems, Generative AI, Chord Prediction, Machine Learning, Constraint Programming




                         1. Introduction
                         Computer-aided music generation combines computer science, machine learning, and music theory to
                         compose, produce, or assist in creating music. This interdisciplinary field poses a significant challenge
                         due to the complex mix of creativity, emotion, and technical requirements involved, making it one of
                         the most demanding tasks for AI to undertake. MusiComb, originally conceived as an implementation
                         of the work theorized by Hyun et al. in the paper [1], emerges as a framework for combinatorial music
                         generation. It employs Constraint Programming to generate the final piece, while utilizing deep learning
                         and machine learning techniques for data preparation and generation. Through the fusion of short MIDI
                         samples, this system excels at crafting well-structured compositions and empowers users by allowing
                         them to shape the creative process through the customization of various music-related parameters.
                         As well summarized in the related paper [2]: it represents a novel music generation approach aimed
                         at overcoming generative model limitations, by properly combining a set of samples under user-defined
                         constraints.
                            MusiComb, alongside ComMU, the MIDI sample dataset introduced in [1] and primarily utilized
                         during the initial development of the framework, has established a standardized set of significant
                         features of harmonic and structural attributes of each sample within the dataset. These features serve
                         as fundamental components utilized by the framework for sample combination and music generation.
                         While ComMU serves as an exemplary dataset for the tasks, the imperative for new datasets has become
                         evident. This necessity arises not only to incorporate fresh samples but also to furnish MusiComb with
                         a broader array of potential features, such as music genre, enabling users to explore a wider spectrum of
                         potential outcomes. This, coupled with the challenge of locating MIDI datasets labeled consistently
                         with ComMU, underscored the necessity for an automated feature extractor. Our primary focus lies in

                           CREAI 2024 – Workshop on Artificial Intelligence and Creativity, Oct.19 – Oct.24, 2024, Santiago de Compostela, Spain
                         *
                           Corresponding author.
                          $ lorenzo.tribuiani@studio.unibo.it (L. Tribuiani); luca.giuliani13@unibo.it (L. Giuliani); allegra.defilippo@unibo.it (A. De
                          Filippo); andrea.borghesi3@unibo.it (A. Borghesi)
                           0000-0002-9120-6949 (L. Giuliani); 0000-0002-1954-7271 (A. De Filippo); 0000-0002-2298-2944 (A. Borghesi)
                                   © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
estimating a subset of features that are not readily accessible or discernible through the MIDI protocol.
These features, such as track roles and chord progressions, pose greater complexity in estimation due to
their indirect nature. They hold significant relationships with the harmonic and structural attributes
of the samples. Establishing correlations between known properties or elements of the sample and
these desired features necessitates the use of machine learning and deep learning systems. Moreover,
evaluating the results, particularly in the chord progression domain, presents challenges as existing
systems often rely on human intervention.
   The concluding phase will focus primarily on the modifications and additions implemented within
the MusiComb framework, aiming to address various aspects and limitations inherent in the system
itself. A key addition involves an automatic sample extraction algorithm designed to extract small,
repetitive sequences (samples) from complex and lengthy MIDI files, facilitating the integration of new
datasets. Additionally, to enhance the flexibility of sample selection mechanisms and mitigate their
inherent rigidity based on user-selected parameters, minor adjustments have been incorporated into
the pipeline. The introduction of new MIDI datasets, coupled with these modifications, is intended to
render the framework more adaptable and operate more flexibly.
   This approach enables a broader range of possible combinations, thereby reducing the deterministic
nature of the overall process relative to the initial set of user parameters. Further elaboration on the
rationale for those approaches will be provided in subsequent discussions, specifically: Section 2 will
offer a comprehensive overview of relevant existing works and the main rationale behind this work;
Section 3 will delve deeply into the challenges and complexities associated with feature extraction and
the main problematic aspects and difficulties encountered in addressing these challenges; Section 4
will introduce the primary integrations and modifications made to the MusiComb framework, offering
a detailed explanation of the sample extraction algorithm and the key advantages resulting from the
adjustments to the sample selection pipeline, particularly in relaxing selection criteria. Some of the
compositions generated by Expert-MusiComb can be heard at the following link: https://soundcloud.
com/lorenzo-tribuiani/sets/musicomb.


2. Background and Motivation
The field of computer-aided music generation has recently experienced a significant advancement
in the use of end-to-end neural systems. Transformer-based models such as Music Transformer [3]
paved the way to more successful projects like MusicLM [4], Jukebox [5], Noise2Music [6], and even
professional tools like Suno. However, the adoption of these models introduces several drawbacks.
Among all, these systems are still subject to abrupt timbre changes and noisy outputs, which restrict
their professional use as well as offering very limited user control and requiring high computational
demands, thus preventing them from real-time use [7]. On top of that, the inherently opaque nature
of neural architectures has been proven to lead to unintended plagiarism, with a consequent lack of
recognition of the human artists whose compositions have been used to train the models.
    Driven by the goal of addressing these challenges, there is renewed interest in symbolic-based
generative models within the research community. Traditionally, Probabilistic and Hidden Markov
Models (HMMs) have been widely used for both chord and melody generation. For example, [8] uses
an HMM for Bach-inspired chorale harmonies, [9] involves pattern recognition and recombination
techniques to create compositions that replicate th style of various classical composers while [10]
and [11] apply complex graphical models to melody harmonization. More recently, systems such as
Morpheus [12] and GEDMAS [13] employed explicit rules and probabilistic methods to generate melodies
or entire tracks according to certain constraints, while Pachet et al. [14] combined symbolic models with
neural architectures to exploit the power of both frameworks. Our former work, MusiComb [2], aligns
as well within this research area, as it employs a combinatorial approach to music generation although
it focuses on the arrangement of predefined segments of notes (samples) rather than generating music
on a note-by-note basis.
    The main strength of sample-based music composition comes from its higher compatibility with
contemporary pop music compositional and production standards since, over the past thirty years, the
introduction of samplers and Digital Audio Workstations (DAWs) has significantly shifted the music
industry’s workflow towards extensive use of sample libraries [15]. However, given that the arranged
samples must meet specific properties and constraints to harmonically integrate into longer sequences,
the challenge of correctly locating them rapidly became problematic and task-intensive, especially as
the size of the databases started to expand [16]. Current sample libraries often provide some metadata
such as key signature and tempo, along with additional high-level labelling that can be used to create
filters, but information about chord progressions, instrument type, and track role is most of the time
lacking, hence preventing a fully automated procedure. Similarly, research in sample-based music
generation is restricted by technical limitations, as major datasets for synthetic music generation such
as the Lakh MIDI Dataset [17] or MuseDB18 [18] lack this critical information. For this reason, our aim
is to build an automated pipeline that could work as well with larger datasets by employing machine
learning models to extract this kind of metadata whenever it is missing.


3. MIDI Feature Estimation
The core challenge of this work revolved around MIDI feature estimation. As previously mentioned,
MusiComb established a standardized set of features for each sample, essential for the proper functioning
of the framework. Specifically, eight distinct features were identified: Beats Per Minute (BPM), number
of measures, key signature, genre, track role, chord progression, time signature, and rhythm. These eight
features have been categorized into two main groups for clarity:
Direct features including BPM, number of measures, key signature, genre, time signature, and rhythm,
     refer to those characteristics whose values are either explicitly written or easily extractable from
     the MIDI data itself. Key signature can indeed present a minor obstacle in estimation, as it may not
     always be explicitly encoded within the MIDI data. However, there are existing algorithms capable
     of estimating the key signature with a reasonable level of confidence, such as the Krumhansl-
     Schmuckler algorithm utilized in this study. Also, is commonly assumed that genre can be
     implicitly inferred from the properties of the dataset itself.
Indirect features such as track-role and chord progression, are attributes that are unlikely to be explic-
     itly encoded within the MIDI protocol of the sample.
  The primary challenge in this study is to handle indirect features. While track-role estimation can
be addressed using classical classification techniques (i.e. SVM), chord progression estimation differs
fundamentally. This calls for alternative methods, such as using GRU layers typically employed for
Natural Language Processing tasks, and different evaluation metrics to effectively tackle these tasks.

3.1. Track Role Estimation
The track role (i.e., the function of a sample within a music piece) is challenging to estimate due to its
contextual nature. Defining track roles often relies on human interpretation and overall musical context,
posing challenges in establishing clear class boundaries without nuanced distinctions. Another challenge
arises from the similarity between track roles, with MusiComb standardizing six distinct classes: main
melody, sub melody, riff, accompaniment, pad, and bass. Some roles, like main and sub-melody, share
similar concepts and structures, making the classification process harder. The six-track role classes
can be grouped into three primary macro-groups: Melody, Accompaniment, and Bass, as shown in
Table 1. This grouping highlights structural similarities within each macro-group, complicating their
differentiation. Additionally, the riff class exhibits similarities with both melody and accompaniment,
as illustrated in Fig. (1).
   In our study, significant effort was devoted to identifying a feature set suitable for training a Support
Vector Machine, due to a preliminary empirical evaluation we made of different ML algorithms, for
classifying six distinct MusiComb track roles. We focus on structural musical elements that vary across
classes while maintaining consistency or similarity within each class. The feature set for classification
                                                                        Main melody
                                            Melody                      Sub melody

                                                                             Riff

                                       Accompaniment                  Accompaniment
                                                                           Pad
                                              Bass                         Bass
Table 1
Track role classes. Elements within the same macro-group (melody, accompaniment, and bass) exhibit similarities.




            (a) Chord related features            (b) Notes related features           (c) Octave related features


Figure 1: In the feature space related to chords, as shown in Figure 1a, there is a clear boundary separating
Accompaniment and Melody-like tracks. However, no distinct boundary can be observed for Riffs. Conversely,
Figure 1b demonstrates that note-related features exhibit a less pronounced but more discernible separation
among the three classes. Lastly, Figure 1c illustrates that octave-related features provide a good distinction
between Accompaniment-like and Melody-like tracks, yet they do not offer a clear differentiation between
Melody and Riff.


was selected from those directly accessible from the MIDI protocol or modified versions thereof, ensuring
their availability in external datasets. Since track role definition is independent of harmonic properties,
only structural features were used for classification. In a data-informed approach, using personal
domain knowledge, and based on the primary characteristics of each group, a set of eleven independent
features was chosen for classification.

1-2. Mean chords number & Mean notes number: The number of chords and individual notes is
     crucial for the track role. We normalize them to mitigate the impact of sample length differences,
     ensuring a consistent representation of chord and note densities across samples.
                                                                         𝑐ℎ𝑜𝑟𝑑𝑠_𝑛𝑢𝑚𝑏𝑒𝑟
                                      𝑚𝑒𝑎𝑛_𝑐ℎ𝑜𝑟𝑑𝑠_𝑛𝑢𝑚𝑏𝑒𝑟 =
                                                                        𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑠_𝑛𝑢𝑚𝑏𝑒𝑟
                                                                        𝑛𝑜𝑡𝑒𝑠_𝑛𝑢𝑚𝑏𝑒𝑟
                                       𝑚𝑒𝑎𝑛_𝑛𝑜𝑡𝑒𝑠_𝑛𝑢𝑚𝑏𝑒𝑟 =
                                                                       𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑠_𝑛𝑢𝑚𝑏𝑒𝑟
3-4. Chords duration & notes duration: Longer durations indicate greater importance in the score,
      potentially influencing classification. To maintain consistency across samples of different lengths,
      durations are normalized based on the number of measures.

5. Chords note distance: Not all simultaneous note sets adhere to conventional chord definitions1 .
     Hence, we consider the distances between notes within these groups, leveraging the consistent
1
    Some instances, like those with fewer than three notes, serve to reinforce a melody or enrich harmony.
          ratios between note distances in chord modes well-known in music theory. We use a modulo
          12 representation for note distances, disregarding octave information, to maintain consistent
          distance measurements across octaves.

6. Notes distance: In accompaniments, chords can be played individually or as arpeggios, complicating
     classification. To address this, we use the mean distance between individual notes. This metric,
     expressed modulo 12, helps differentiate chord arpeggios based on their mean distances.

7. Number of chord’s note: A single chord provides insights into the sample’s role. While traditionally
     three notes define a chord, fewer may indicate bicords, and three or more suggest various structural
     elements. This value is represented as the mean number of notes per chord.

8-9-10. Minimum octave, maximum octave and mean octave: Minimum and maximum octaves
      set pitch variability boundaries within the track. The mean octave offers a central reference,
      emphasizing the most relevant octave in the piece. Recognizing these boundaries aids in excluding
      certain track roles based on expected octave characteristics.

11. Instrument: Instrument features are included to account for their association with specific track
      roles or octave ranges.

3.1.1. Preprocessing
The ComMU dataset, the only dataset including all the metadata needed for the functioning of MusiComb,
was used for training. Preprocessing involved extracting the 11 relevant features and making adjustments
for robustness against minor variations. All data vectors were scaled using the z-score equation2 to
standardize the range, preventing larger-ranged data points from disproportionately influencing the
SVM model.

3.1.2. Training and results
The SVM model was trained on the ComMU dataset and fine-tuned using Grid Search with cross-
validation to identify the best hyperparameters. Grid search was performed on three kernel types:
RBF, polynomial, and linear, with five C values (0.5, 1, 5, 10, 100) and two tolerance values (0.01 and
0.001). The dataset was split into five subsets for cross-validation. Table 2 shows the top 5 SVM models,
including their parameters and performance on each dataset subdivision, together with a K-Nearest
Neighbours baseline.

    Model     Kernel     C    Tolerance     TAS1* TAS2        TAS3     TAS4    TAS5     Mean test score   Mean F1 score

     SVM        RBF      10      0.001       0.81    0.81     0.81     0.81     0.82           81.1%         78.6%
     SVM        RBF      10       0.01       0.81    0.81     0.82     0.81     0.81           81.08%        78.6%
     SVM        RBF      5       0.001       0.79     0.8     0.81     0.81     0.81           80.53%        77.9%
     SVM        RBF      5        0.01       0.79     0.8     0.81     0.81     0.81           80.53%        77.6%
     SVM        Poly     10      0.001        0.8    0.79     0.81     0.79     0.81           79.89%        76.8%
     KNN                                                                                        57%           50%

*
    Test Accuracy Score for the given dataset subdivision
Table 2
Results from dataset splits and averages for the top 5 scoring results and KNN baseline.

  The SVM model outperformed the KNN baseline, achieving an overall accuracy of 81.1% and an F1
score of 78.9%. The analysis of the confusion matrix indicates that the lower F1 score is mainly due to
the sub-melody and riff classes. Discriminating the riff class is challenging, especially in distinguishing

2
    𝑧 = 𝑋−𝜇
         𝜎
            , where 𝑋 is the current data point, 𝜇 is the mean and 𝜎 the standard deviation.
it from the main melody and accompaniment. The sub-melody class is frequently misclassified as the
main melody, while the main melody is sometimes incorrectly classified as sub-melody. This unexpected
behaviour could be influenced by the class distribution in the training set. Figure 2 illustrates that the




                      (a) Train                            (b) Test                       (c) Test predictions



Figure 2: Distribution of classes in the training set, test set, and SVM predictions on the test set.

main melody is more prevalent in the training dataset (2a), and the model has replicated this distribution
in its predictions (2c) on the test set3 .

3.2. Chord Progression Estimation
Chord progression estimation involves finding the ordered sequence of chords that accompany a melody.
It relies heavily on harmonic elements such as notes and pitches, leading to variations based on the
sample type.
    1. Chord progressions can be non-unique, influenced by the track role of a sample. Melody-like
        samples present challenges as a single melody can harmonically match multiple chord progres-
        sions, each producing a unique sound. In contrast, accompaniment and bass-like samples are
        typically based on specific chord progressions, requiring a stricter classification approach. For
        chord progression estimation, the macro groups from Table 1 will be used to define the three
        primary classification domains.
    2. The classification task has an extensive solution space defined by the unique possible chord names,
        resulting in thousands of potential chord combinations based on harmonic rules. Balancing the
        retention of harmonic information by simplifying the space is essential for ease of classification.
    3. A harmonic metric tailored for chord progression estimation in melody samples is missing from
        the literature. While classification accuracy is important, ensuring harmonic soundness in the
        final results is crucial. A metric emphasizing harmonic aspects over mere accuracy is required.
Point 1 suggests training a unique model for each of the three macro groups (Melody, Accompaniment,
and Bass) using a Shared Model Split Weights approach. Although there are significant differences
between track roles, the estimation tasks are similar. Respectively, points 2 and 3 lead to a simplification
of the labels based on their names described in Section 3.2.1 and the adoption of a metric emphasizing
harmonic soundness, inspired by the work on Mathematical Harmony Analysis [19]. Moreover, under-
standing the temporal or positional context of the sample representation is crucial. Both samples and
chord progressions can be viewed as time series and a time-aware representation is crucial. Building
on the work of Hyungui et al. [20], recurrent neural networks based on GRU layers combined with an
autoregressive pipeline have been selected to address this task.

3.2.1. Preprocessing
We adopted a modified version of the Pitch Class Vector (PCV) representation from [20] for feature
preprocessing in our study. The PCV consists of 12 bins representing the 12 possible notes; we scale
these values using normalized Velocity4 to reintroduce accent information by emphasizing notes played
3
    The dataset split is the same adopted for ComMU and indicated in the dataset’s metadata.
4
    Velocity is a MIDI parameter that indicates the force applied when pressing a key on a MIDI keyboard. Normalized velocity
    is a scaled representation in the [0,1] range.
with greater force. Fig. 3 illustrates the modified PCV for a MIDI sample.

                                                                  original              simplification
                                                                   major                    major
                                                                   minor                   minor
                                                                diminished                    /*
                                                                augmented                   major
                                                          alterations (7th, 8th, ...)    major/minor
                                                         * Due to their dissonant nature, diminished

                                                         chords have been excluded from the dataset.
       Figure 3: Modified PCV of a MIDI sample.              Table 3: Chords name simplification rules.


   Conversely, a classical approach using one-hot encoding has been employed for representing chord
names. As illustrated in table 3 all chord names present in the ComMU dataset have been mapped to just
two categories: Major and Minor chords. This reduction narrows down the solution space to 24 possible
elements. Finally, data augmentation was applied to the dataset by transposing samples and their
corresponding chord progressions across various note intervals, resulting in 11 additional entries for
each sample. This technique has been fundamental in developing the chord progression estimator. This
is primarily because transpositions of the same sample should result in corresponding transpositions of
the chord progression output, enhancing the system’s robustness. Injecting this knowledge is neither
direct nor easy; it is something the model must learn during training. This approach offers a dual
benefit: it increases the dataset size by augmenting the existing samples, a technique known to improve
generalization, and it incorporates transposition invariance into the model during training.

3.2.2. Model Architecture
The model architecture uses an Autoregressive approach with GRU layers to capture temporal patterns.
For a sample with 𝑁 measures, each measure is analyzed independently to reconstruct temporal
information focusing on chords. The input structure is as follows: Initial conditions, representing the
encoding of the three previous chords at time steps 𝑡 − 1, 𝑡 − 2, and 𝑡 − 3 where all values are set to
zero for the first measure (START token), measure n, the Pitch Class Vector (PCV) for the considered
feature and measure n+1, the features of the next measure to provide information on the harmonic
direction. Individual learned embeddings are used for each feature, and the final model input is a tensor
obtained by concatenating these embeddings, resulting in a shape of (5, 64).




                           (a)                                               (b)

Figure 4: Visual representation of model inputs (4a) and the model’s autoregressive pipeline (4b).


  The main model architecture features a modular design, with each module consisting of a BiGRU
layer with 128 cells and hyperbolic tangent activation, followed by Layer Normalization and Dropout
with a probability of 0.3. In this specific application, two modules were utilized with a final classification
head included in the model. Figure 4b illustrates the model’s autoregressive pipeline. The one-hot
encoding of each chord is added to the initial conditions as the window shifts forward, allowing the
model to use current and past chord information.

3.2.3. Training & Results
The model predicts chords for each 4-sized window of samples. The ComMU dataset was augmented
and split into three datasets based on track-role macro groups: melody, accompaniment, and bass. The
model was trained thrice on each dataset with consistent hyperparameters5 . Table 4 presents the results
of the best model weights evaluated on the corresponding test set for the specific task.

                                                        Test Accuracy     Test F1 Score

                             Accompaniment Model             0.7511            0.7243
                                 Melody Model                0.6297            0.6134
                                  Bass Model                 0.7514            0.7252

Table 4
Optimal results for the chord progression estimation task.

   The accompaniment and bass models performed better than the melody model, which still achieved
a solid 63% accuracy and 61% F1 score. The confusion matrix highlighted frequent misclassifications,
especially between minor and specific major chord sequences, known as relative minors. This emphasizes
the importance of a metric focusing on harmonic aspects. In [19], a general rule describing the
pleasantness of a note interval is introduced: two notes played together sound harmonious (consonant)
when their frequency ratio uses small whole numbers. Building upon this concept, we developed a more
robust metric by examining the distribution of frequency ratio values across the entire dataset for both
labels and predictions. Particularly, being 𝑛 the number of measures in a sample, 𝑛𝑜𝑡𝑒𝑠(𝑛) the set of
frequencies of notes in that measure and 𝑐ℎ𝑜𝑟𝑑(𝑛) the set of frequencies of notes of the chord of that
measure, we define the set of frequency ratios for a measure as:
                         {︁    𝑎                                   }︁
             𝑟𝑎𝑡𝑖𝑜𝑠(𝑖) = ∼ | ∀ 𝑎 ∈ 𝑛𝑜𝑡𝑒𝑠(𝑖), ∀ 𝑏 ∈ 𝑐ℎ𝑜𝑟𝑑(𝑖)            𝑓 𝑜𝑟 𝑖 ∈ [0, . . . , 𝑛]
                                𝑏
These two distributions, represented as matrices of numerator-denominator values, were then evaluated
in terms of Cosine Similarity.




                                   (a)                                           (b)
Figure 5: Example of frequencies-ratio distribution in matrix representation (Melody class) for both labels (5a)
and predictions (5b).


  Figure 5 displays the Frequency Ratio Distribution (FRD) matrices for both labels and predictions,
while Table 5 presents the cosine similarities between these distributions for all individual tasks. As
5
    Initial LR: 10−3 , minimum LR: 10−7 , LR scheduler: OnPlateau, Optimizer: AdamW, Epochs: 100, Batch size: 1024.
                                                        FRDs Cosine Similarity

                             Accompaniment Model                   0.994
                                 Melody Model                      0.986
                                  Bass Model                        0.99

Table 5
FRDs Cosine similarity for each model’s weights in the specific classification task.


observed, the model has adjusted its weights to closely replicate the frequency ratio distribution of the
original labels. Given the assumption that the chord progressions labelled in the dataset sound good,
mimicking this distribution indicates the quality of the model’s predictions which is, thus, as good as
the labelled one.


4. Expert-MusiComb
All the technologies discussed in Section 3 were applied to extend MusiComb, especially for dataset
expansion. Additionally, new features were incorporated for automatic sample extraction and flexible
data querying, enhancing the framework’s original functionalities in the newly established Expert-
Musicomb.

4.1. Automatic Sample Extraction
The automated sample extraction algorithm employs a maximization strategy to identify the longest
uninterrupted sequences of silence in the composition after replacing all detected samples with empty
elements. Given the function:

  Function __inner__(sequence, min_ws, max_ws, initial_index):
     if 𝑚𝑖𝑛_𝑤𝑠 > ⌈𝑙𝑒𝑛(𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒)/2⌉ then
         return [], 0
     else
         𝑠𝑢𝑏𝑠𝑒𝑞 ← [];
         for 𝑤𝑠 in 𝑚𝑖𝑛_𝑤𝑠, . . . , 𝑚𝑎𝑥_𝑤𝑠 do
             𝑗 ← 𝑖𝑛𝑖𝑡𝑖𝑎𝑙_𝑖𝑛𝑑𝑒𝑥;
            while 𝑗 < 𝑙𝑒𝑛(𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒) − 𝑤𝑠 do
                if 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒[𝑗 : 𝑗 + 𝑤𝑠] == 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒[𝑗 + 𝑤𝑠 : 𝑗 + 2𝑤𝑠] then
                     append 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒[𝑗 : 𝑗 + 𝑤𝑠] to 𝑠𝑢𝑏𝑠𝑒𝑞;
                     𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒[𝑗 : 𝑗 + 2𝑤𝑠] ← ∅;
                     𝑗 ← 𝑗 + 2𝑤𝑠
                else
                     𝑗 ←𝑗+1
                end
             end
         end
         return subseq, silence_length
     end
  end

  where, min_ws and max_ws represent the size range of potential samples, and initial_index denotes
the starting position for sliding windows within the sequence. By varying initial_index, different sample
collections are generated based on their initial positions. The goal is to identify the collection of samples
specified by initial_index such that the following statement is satisfied:
  for 𝑖 in 0, . . . , 𝑚𝑎𝑥_𝑤𝑠 do
      𝑠𝑢𝑏𝑠𝑒𝑞, 𝑚𝑎𝑥_𝑠𝑖𝑙𝑒𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡ℎ ← __inner__(samples, min_ws, max_ws, i);
      return 𝑠𝑢𝑏𝑠𝑒𝑞 if 𝑚𝑎𝑥_𝑠𝑖𝑙𝑒𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡ℎ == 𝑙𝑒𝑛(𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒) ∨ 𝑚𝑎𝑥_𝑠𝑖𝑙𝑒𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡ℎ is
       MAX;
  end


  Indeed, if the minimum and maximum sample sizes remain constant, as is typically the case since
excessively long samples are avoided, the algorithm operates in linear time.

4.2. Dataset Enlargement
Expanding the dataset is crucial for advancing MusiComb. Integrating additional datasets into the
framework enhances its capacity by augmenting the sample pool. This, in turn, broadens the spectrum
of potential outcomes during generation. Moreover, it introduces fresh values for user-selectable
parameters, such as a wider array of genres and an extended range of chord progressions. The dataset
expansion employed two primary approaches:

    • Incorporating additional MIDI datasets and extracting the required features using the methods
      previously described in Section 3.
    • Creating multiple chord progressions for each melody sample by adjusting the initial conditions
      of the chord estimation model.

Introducing multiple chord progressions for the same melody sample offers a significant advantage. This
approach expands the dataset without the need for additional samples. Simultaneously, it empowers
the framework to merge the same melody sample with various chord progressions, fostering a higher
level of music generation. Because of the recurrent nature of the chord progression estimation model,
a slight alteration in the initial condition beyond the START token results in different outputs for the
same input sample. To implement this effectively in the final approach, a set of varied initial conditions
that guarantee distinctiveness and harmonic consistency in predictions needs to be identified. To
ensure harmonic consistency, the initial conditions are chosen from a subset of possible conditions
encountered by the model during training. Leveraging the RFDs Similarity discussed in Section 3 we can
trust that the model has familiarity with these initial conditions. To reduce the likelihood of repetitions,
a small subset, chosen as the nine most common initial conditions present in the training set, is selected.

                         sample         initial conditions* estimated Chord Progression

                    commu00001.mid                []              F-C-Dm-A#-F-C-Em-D
                    commu00001.mid            vi-IV-V           Dm-Am-Dm-A#-Gm-C-Am-D
                    commu00001.mid            V-vi-IV             Am-F-C-F-Am-C-Em-D
                    commu00001.mid             I-IV-V            Am-C-Dm-F-A#-C-Dm-D
                    commu00001.mid            V-IV-III           Am-G-Dm-F-A#-C-Dm-D
                    commu00001.mid            vi-iii-IV          F-Am-Em-F-F-Am-Em-D
                    commu00001.mid              ii-V-I            Am-Am-G-F-F-C-Em-D
                    commu00001.mid            I-VIIb-I          Dm-Am-A#-A#-Dm-Am-A#-D
                    commu00001.mid             IV-I-V             Am-F-C-F-Am-C-Em-D

                *
                    Initial conditions are represented in terms of Roman Literals
Table 6
Different predictions for Chord Progression over the same melody sample with different initial conditions.
Highlighted in red is the only repeating element.

  Table 6 presents an example of multiple chord progression estimations for the same MIDI sample
together with the set of chosen initial conditions. This method enables the framework to effectively
capture the relationship between a melody and potential chord progressions. Consequently, each
melody sample can be paired with various chord progressions, enhancing the truthfulness of the overall
music generation process without limiting each sample to a single chord progression.

4.3. Non-strict Data Query
In each generation cycle, once the user has specified their desired parameters, the framework seeks
out samples that match the specified parameters and then employs a subset of these samples for music
generation. Following dataset enlargement, the subsequent step to enhance the framework’s capabilities
involves loosening the criteria used for sample selection. The goal is to increase the number of selected
samples for a given set of user-defined parameters (BPMs, genre, time signature, chord progression and
key) while maintaining the harmonic and structural consistency ensured by strict selection rules. Due
to the nature of the parameter, not all selection rules can be made less strict, but certain modifications
are possible, details will be provided in the following subsections.

4.3.1. BPMs
The BPM parameter allows for a more flexible selection approach. Staying close to the desired BPM
value, rather than fixing a specific one, adjustments can be made without significantly altering the
sample’s structure while enlarging the set of eligible samples. For a desired BPM represented by 𝑥, the
sample’s BPM by 𝑏𝑝𝑚(𝑛), and a selection indicator 𝑐ℎ𝑜𝑜𝑠𝑒(𝑛), the selection rule becomes less strict to
accommodate minor BPM variations by changing from (1) to (2).

                                       𝑐ℎ𝑜𝑜𝑠𝑒(𝑛) ⇐⇒ 𝑏𝑝𝑚(𝑛) = 𝑥                                         (1)

                             𝑐ℎ𝑜𝑜𝑠𝑒(𝑛) ⇐⇒ 𝑥 − 𝛼𝐾 ≤ 𝑏𝑝𝑚(𝑛) ≤ 𝑥 + 𝛼𝐾                                     (2)
where, 𝐾 represents half of the maximum interval for the neighborhood, and 𝛼 is a user-selectable
parameter ranging from 0 to 1. Table 7 demonstrates that neighbourhood selection rules enable a

                                                     Mean samples per BPM value

                                rigid selection                 506
                            neighborhood selection              2204*

                        *
                            neighbourhood selection parameters: 𝐾 = 20 and 𝛼 = 0.5
Table 7
Mean number of samples for BPM values in the ComMU dataset with different selection rules.

broader range of samples to be included for the same BPM value. To prevent the neighbourhood interval
from becoming excessively large and causing matching issues with the samples, the parameter 𝐾
remains constant. Additionally, with the neighbourhood selection rule, BPM values that are not present
in the dataset can be chosen as long as they fall within the interval of an existing BPM value. This
marks another advancement compared to MusiComb, where only BPM values existing in the dataset
were available for selection.

4.3.2. Harmonic Key
While the harmonic key is typically considered a less flexible parameter since Samples with different
keys may not sound harmonious together, it is still possible to transpose a music piece to the desired
harmonic key. We maintain consistency in the dataset by transposing all samples to the same harmonic
key (A minor/C major) during the feature extraction process. The user can then freely choose the desired
harmonic key. This ensures that most samples in the dataset can be selected, contingent on whether
the user’s chosen key is minor or major. Subsequently, a straightforward transposition to the desired
key is performed after generating the final music piece.
                     Harmonic Key      Rigid selection (ComMU)   transposition (ComMU)

                        A minor                 4729                      4729
                        C major                 6415                      6415
                    other minor keys              /*                      4729
                    other major keys              /*                      6415

                *
                    original ComMU dataset only contains A minor and C major keys.
Table 8
Number of samples that can be chosen for a given harmonic key for both rigid selection and selection with
transposition.


4.3.3. Chord Progression & Time Signature
We also study how chord progression and time signature can be treated less strictly. The incorporation of
multiple chord progressions for each melody sample, as described Section 4.2, relaxes the selection rule
for chord progressions. This expands the range of possible combinations, enabling a single melody to
harmonize with various chord progressions. Moreover, while two samples with different time signatures
cannot be seamlessly layered together, users can be allowed to choose between multiple time signature
selections; this leads to music samples with time signature changes.


5. Conclusions
In our efforts to expand and enhance the existing MusiComb framework, we introduced a series of
methodologies and techniques that are crucial for its future development. Specifically, we presented
two models for Track Role and Chord Progression estimation, along with a set of new rules for the sample
selection process, which significantly bolster the framework’s capabilities, and an automatic sample
extraction algorithm. Our work underscores the importance of data quality in modern Generative AI
systems, demonstrating that it plays a pivotal role alongside the capabilities of the framework itself.
Combinatorial systems, such as MusiComb, heavily depend on the quality of the information available
about the involved elements. It’s evident that even a state-of-the-art framework, when operating under
incorrect preconditions (such as inaccurate sample information), may produce low-quality outputs.
Conversely, datasets annotated by humans may be limited in size and require substantial time to expand.
Striking a balance between data quality and the time needed to acquire it is therefore crucial for a
continually growing and evolving framework like MusiComb. Our systems have demonstrated robust
consistency with the data labels, serving not merely as tools for feature extraction but also introducing
new degrees of freedom. Even by solely utilizing the original ComMU dataset, we can explore new and
diverse generations while maintaining a high level of reliability in the quality of output.
   Furthermore, there is ample opportunity for new introductions and future advancements. As previ-
ously mentioned, MusiComb is an ever-expanding framework capable of further expression beyond its
current capabilities. Among the potential areas for research, significant developments may involve: (1)
modifying the combinatorial backbone to enable the framework to integrate various chord progressions
and time signatures within the same musical piece. (2) Utilizing the Transformer XL, as presented in [1],
to generate missing samples or roles for specific generations. This approach allows the framework to
seamlessly incorporate both dataset and generated samples within the same piece.


Acknowledgments
This work has been supported by the project TAILOR (funded by European Union’s Horizon 2020
research and innovation programme, GA No. 952215).
References
 [1] L. Hyun, T. Kim, H. Kang, M. Ki, H. Hwang, K. Park, S. Han, S. J. Kim, Commu: Dataset for
     combinatorial music generation, ArXiv abs/2211.09385 (2022).
 [2] L. Giuliani, F. Ballerini, A. D. Filippo, A. Borghesi, Musicomb: a sample-based approach to music
     generation through constraints, in: 2023 IEEE 35th International Conference on Tools with
     Artificial Intelligence (ICTAI), IEEE Computer Society, Los Alamitos, CA, USA, 2023, pp. 194–198.
 [3] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon, C. Hawthorne, A. M. Dai, M. D.
     Hoffman, M. Dinculescu, D. Eck, Music transformer, 2018. arXiv:1809.04281.
 [4] A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts,
     M. Tagliasacchi, M. Sharifi, N. Zeghidour, C. Frank, Musiclm: Generating music from text, 2023.
     arXiv:2301.11325.
 [5] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, I. Sutskever, Jukebox: A generative model for
     music, ArXiv abs/2005.00341 (2020).
 [6] Q. Huang, D. S. Park, T. Wang, T. I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. H. Frank,
     J. Engel, Q. V. Le, W. Chan, W. Han, Noise2music: Text-conditioned music generation with diffusion
     models, ArXiv abs/2302.03917 (2023).
 [7] S. Dadman, B. A. Bremdal, B. Bang, R. Dalmo, Toward interactive music generation: A position
     paper, IEEE Access 10 (2022) 125679–125695.
 [8] M. Allan, C. Williams, Harmonising chorales by probabilistic inference, in: L. Saul, Y. Weiss,
     L. Bottou (Eds.), Advances in Neural Information Processing Systems, volume 17, MIT Press, 2004.
 [9] D. Cope, Computer modeling of musical intelligence in emi, Computer Music Journal 16 (1992)
     69–83. URL: http://www.jstor.org/stable/3680717.
[10] S. F. Stanisław A. Raczyński, E. Vincent, Melody harmonization with interpolated probabilistic
     models, Journal of New Music Research 42 (2013) 223–235.
[11] J.-F. Paiement, D. Eck, S. Bengio, Probabilistic melodic harmonization, in: L. Lamontagne, M. Marc-
     hand (Eds.), Advances in Artificial Intelligence, Springer Berlin Heidelberg, Berlin, Heidelberg,
     2006, pp. 218–229.
[12] D. Herremans, E. Chew, Morpheus: Generating structured music with constrained patterns and
     tension, IEEE Transactions on Affective Computing 10 (2019) 510–523.
[13] C. Anderson, A. Eigenfeldt, P. Pasquier, The generative electronic dance music algorithmic system
     (GEDMAS), Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital
     Entertainment 9 (2021) 5–8. URL: https://doi.org/10.1609%2Faiide.v9i5.12649. doi:10.1609/aiide.
     v9i5.12649.
[14] F. Pachet, A. Papadopoulos, P. Roy, Sampling variations of sequences for structured music
     generation, in: International Society for Music Information Retrieval Conference, 2017.
[15] C. Nardi, Library music: technology, copyright and authorship, Current Issues in Music Research:
     Copyright, Power and Transnational Musical Processes. Lisboa: Edições Colibri (2012) 73–83.
[16] A. Zils, F. Pachet, Musical mosaicing, in: Digital Audio Effects (DAFx), volume 2, 2001, p. 135.
[17] C. Raffel, Learning-based methods for comparing sequences, with applications to audio-to-midi
     alignment and matching, Columbia University, 2016.
[18] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, R. Bittner, The MUSDB18 corpus for music
     separation, 2017. URL: https://doi.org/10.5281/zenodo.1117372. doi:10.5281/zenodo.1117372.
[19] R. David, Mathematical harmony analysis (2016).
[20] L. Hyungui, R. Seungyeon, L. Kyogu, Chord generation from symbolic melody using blstm
     networks, 18th International Society for Music Information Retrieval Conference (ISMIR 2017)
     (2017).