=Paper= {{Paper |id=Vol-3810/paper11 |storemode=property |title=Hybrid Symbolic-Waveform Modeling of Music – Opportunities and Challenges |pdfUrl=https://ceur-ws.org/Vol-3810/paper11.pdf |volume=Vol-3810 |authors=Jens Johannsmeier,Sebastian Stober |dblpUrl=https://dblp.org/rec/conf/creai/JohannsmeierS24 }} ==Hybrid Symbolic-Waveform Modeling of Music – Opportunities and Challenges== https://ceur-ws.org/Vol-3810/paper11.pdf
                         Hybrid Symbolic-Waveform
                         Modeling of Music – Opportunities and Challenges
                         Jens Johannsmeier1,* , Sebastian Stober1
                         1
                             Otto-von-Guericke University Magdeburg, Universitätsplatz 2, 39106 Magdeburg, Germany


                                        Abstract
                                        Generative modeling of music is a challenging task due to the hierarchical structure and long-term dependencies
                                        inherent to musical data. Many existing methods, especially powerful deep learning models, operate exclusively on
                                        one of two levels: The symbolic level, e.g. notes, or the waveform level, i.e. outputting raw audio without reference
                                        to any musical symbols. In this position paper, we argue that both approaches have fundamental issues which limit
                                        their potential for applications in computational creativity, particularly open-ended creative processes. Namely,
                                        symbolic models lack grounding in the reality of sound, whereas waveform models lack a level of abstraction
                                        beyond sound, such as notes. We argue that hybrid models, encompassing components at both levels, combine the
                                        best of both, while circumventing their respective disadvantages. While such models already exist, they generally
                                        consist of separate components that are combined only after training has finished. In contrast, we advocate for
                                        fully integrating both levels from the start. We discuss the opportunities afforded by this approach, as well as
                                        ensuing challenges, along with possible solutions. Our belief is that end-to-end hybrid modeling of musical data
                                        can substantially advance the quality of generative models as well as our understanding of musical creativity.

                                        Keywords
                                        computational creativity, generative models, music, hybrid models, position paper




                         1. Introduction
                         Music, besides images and text, is currently one of the most researched domains both in the field of (deep)
                         generative modeling as well as the computational creativity community. It is also exceptionally challenging
                         due to factors such as the enormous volume of data, hierarchical and repetitive structure, and extremely
                         long-term dependencies. Successfully generating pieces of music with sensible structure spanning
                         minutes has only recently become possible. A main driving factor for this success have been generative
                         models using deep neural networks trained on vast amounts of data [1]. Existing approaches for modeling
                         musical data often work on one of two levels [2]: At the symbolic level, we regard music as a sequence of
                         symbols. Examples are sheet music, ABC notation, piano rolls or MIDI. These symbols can be turned into
                         sound via pre-built digital instruments, or by a human performer. At the audio, or waveform level, we
                         directly model the realization of music as sound, without reference to any symbols. We may also include
                         spectrogram-based models in this definition. Both levels have respective advantages, disadvantages and
                         limitations, which are to some extent complementary. While symbols are generally easier to handle due
                         to their smaller vocabulary and high degree of abstraction, they are also limited with regard to what they
                         can express. On the other hand, modeling waveforms is conceptually straightforward, but difficult in
                         practice due to the high resolution and extreme repetitiveness inherent to oscillating audio waves.
                            Aside from these more technical aspects, we argue that neither approach is ideal for many applications
                         in computational creativity. These often go beyond merely modeling a fixed dataset via straightforward
                         reduction of some loss function, as is commonly done in deep generative modeling. One such task that
                         we want to pay special attention to is that of open-ended learning and creation [3]. This requires models
                         to continuously adapt within an ongoing, ever-changing process, necessarily moving beyond any fixed
                         dataset eventually. In this paper, we present arguments for why neither purely symbolic nor waveform

                         International Workshop on Artificial Intelligence and Creativity, co-located with ECAI, October 19–24, 2024, Santiago de Compostela,
                         Spain
                         *
                           Corresponding author.
                         $ jens.johannsmeier@ovgu.de (J. Johannsmeier); stober@ovgu.de (S. Stober)
                         € https://ai.ovgu.de/Staff/Johannsmeier.html (J. Johannsmeier); https://ai.ovgu.de/Staff/Stober.html (S. Stober)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
models are appropriate for such open-ended tasks. We base our arguments not on properties of specific
model architectures or symbolic systems, but rather on fundamental properties of either level of modeling.
   Instead, we argue for a hybrid approach which encompasses aspects of both symbolic and waveform
levels. Specifically, we envision a setup where the model generates symbolic data, which is then carried
into the waveform level and evaluated there. Other than in waveform models, these two aspects are
explicitly separated in the hybrid approach. At the same time, we do not completely lose access to the
waveform level, as do purely symbolic models. We believe that this combines the advantages of both
levels, while circumventing the fundamental limitations of either. While the idea of hybrid modeling is
not new per se (e.g. [4]), we put emphasis on combining it in an end-to-end fashion with the usage of deep
neural networks and gradient-based learning, as these have been shown to produce impressive results
for generative modeling in recent years (e.g. [5]). Previous works using hybrid models generally do not
train or evolve both components jointly, rather concatenating them after the fact. Such joint end-to-end
training leads to unique challenges, to which we also discuss potential solutions. Our main goal with
this work is to stimulate investigations into solving outstanding issues with hybrid modeling.


2. The Symbolic Level and its Shortcomings
As previously discussed, symbolic approaches model music as sequences of symbols with some fixed mu-
sical interpretation. To this date, most systems for so-called metacreation [6] in music operate at this level.
Most symbolic systems are fundamentally discrete, which imposes a certain maximum granularity on the
representations. In addition, many aspects of music, such as timbre, are not represented at all. Depending
on the context, these limitations can become problematic. In particular, it can be difficult to model aspects of
performance, such as variations in tempo or dynamics. Music lacking these factors may sound robotic and
uninspired. It can also be argued that the dimension of sound is a crucial part of musical creativity itself [7].
   On the other hand, a limited vocabulary generally simplifies modeling – if there are not that many
choices, it is easier to make the correct one. Also, symbols generally represent sensible musical concepts.
As such, the model only needs to choose between these concepts at any given time, making it less
likely (or even impossible) to only produce unstructured noise, for example. Furthermore, the fact that
we can simply use instruments (real or digital) to turn them into sound removes the need to model
fine-grained harmonic oscillations, as is the case for waveform models. Another major advantage of
common symbolic representations is that they are relatively compact in time: A handful of notes can
represent several seconds of audio. This presents a reduction by orders of magnitude over modeling
at the waveform level, where a few seconds of audio may already require tens of thousands of values.
This, in turn, makes it simpler for models to represent the long-term dependencies we see in music.
   And yet, we want to highlight a fundamental issue that all symbolic representations share: They are,
themselves, meaningless, with any musical meaning imposed by human interpretation of the symbols.
Before we expand on this, to understand why it is an issue we need to discuss the relation between
data-driven modeling and computational creativity.

2.1. On Data-Driven Modeling
Arguably, the most (superficially) impressive computer-generated artworks are produced by contempo-
rary generative models based on deep learning, such as Stable Diffusion [8] for images, ChatGPT [9] for text,
or MusicGen [10] for music. These models are trained to approximate the underlying probability distribu-
tions of vast amounts of already existing data. While it is impossible to deny the massive improvements in
this field in recent years, the computational creativity community tends to concern itself with different is-
sues. For example, in the taxonomy of Ventura [11], even these impressive models lack important capacities
such as judging and filtering their own results, referring to explicit knowledge bases and/or previous gen-
erations, as well as any sort of intentionality. Berns and Colton [12] put forward similar arguments specif-
ically for deep generative models, noting they are “currently only good at producing more of the same”.1

1
    It should be noted that both cited papers were written before the recent wave of large generative models.
   A fundamental limit of all these approaches is that they have a fixed target distribution that they
attempt to reach, and once they have fulfilled this goal, they are essentially static. We believe that a more
fruitful basis for computational creativity is the idea of open-endedness [13]. Here, we model the creative
process as such, a process that is never finished, but rather continually evolving with reference to itself.
   The challenge is to design this process in such a fashion that it does not evolve into meaningless chaos.
A necessary condition to achieve this, we argue, is that the data have some degree of inherent meaning,
a grounding in something real and tangible. To see this, consider asking someone to compose pieces
of music in a symbolic system they are entirely unfamiliar with. They are informed about the vocabulary,
but not what any of it means. Also, they are not allowed to listen to any renditions of their pieces. As
such, they are unable to judge what any of their compositions sound like. Clearly, this will not work.2
There are two solutions: First, show them many examples of compositions in that system, so they may
learn from and copy (aspects of) them. This is what straightforward generative modeling does. Second,
inform them about the meaning of the symbols, so they may use them skillfully and with intention. In
a similar vein, they could be allowed to listen to their compositions, and thus improve by trial-and-error.
As we want to avoid (or at least go beyond) option 1, we necessarily end up with option 2 – grounding
the symbols with meaning, by connecting them with what they actually sound like.
   A full treatment of open-endedness is beyond the scope of this work. We discuss some approaches in
Section 5. For now, we simply observe that symbols require grounding, and this grounding to be made avail-
able to the models in some fashion, to achieve meaningful (or even sensible) open-ended creative output.

2.2. Back to the Symbolic Level
In light of these circumstances, we argue that purely symbolic data does not meet this grounding
condition. Let us take western sheet music and twelve-tone equal temperament as an example. Within
this context, certain tonal intervals are viewed as consonant (low tension) or dissonant (high tension). For
example, the interval of a fifth (seven semitones) is often judged as a consonant interval. However, this is
obviously not due to the fact that the notes are seven semitones apart, but rather that the base frequencies
of the two tones are in a certain relation (approximately 3:2 in this case) which people generally find
pleasant [14, 15]. Accordingly, if one where to play in a microtonal system with more than twelve tones
per octave, this correspondence would no longer hold, and seven semitones may or may not represent
a consonant interval. This shows that rules on the symbolic level cannot stand on their own; they are
rather shaped by the actual tones that the symbols represent.
   So how come that contemporary symbolic models can create pleasant music? This is because they
essentially learn to mimic existing data, i.e. compositions by humans, which have been created with
sound in mind. A human may use the interval of a fifth because they want to achieve a certain sound.
A symbolic model will use the interval of a fifth because it has a high probability given what it has
learned from the training data. Thus, a symbolic model cannot sensibly create new symbolic rules, since
it does not have access to the meaning of the symbols. It follows that such models are inappropriate
for data-independent learning tasks. That this is actually an issue can be seen, for instance, in existing
approaches using genetic algorithms to evolve music. These tend to hard-code human assumptions into
their fitness function, for example rewarding consonant intervals and punishing undesirable melodic
sequences [16, 17]. Alternatively, human-in-the-loop methods employ a human “fitness function”
to evaluate outputs [18, 19, 20]. While an important field of study in its own right, this is slow and
cumbersome to scale with regards to automatic music generation.
   A more flexible approach is used by Mitrano et al. [21]: They use pre-trained Recurrent Neural Net-
works (RNNs) to replace human critics in fitness functions. The networks are trained on symbolic datasets
of human-composed music. In a similar vein, Ostermann et al. [22] use the discriminator of a Generative
Adversarial Network (GAN) to differentiate between real and artificial compositions. Since neural net-
works can express highly complex functions, the fitness functions they represent can be similarly complex.

2
    We disregard the possibility that a person may be able to guess the meaning of symbols, as this clearly cannot be done by a
    straightforward generative computer system without any world knowledge.
  Despite these advances, by formulating purely symbolic rules, none of these approaches tackle the
fundamental issue: In all these cases, the musical meaning is imposed from the outside, as there is no way
by which it could arise from itself. We want to emphasize that we do not mean to imply that using human
preconceptions, or training on a fixed dataset, is somehow less desirable than open-ended processes.
Also, the problem of grounding goes far beyond the use of symbolic or waveform data; see e.g. [23, 24]
for more detailed discussions.


3. The Waveform Level and its Shortcomings
Many state-of-the-art music generation models approach the task directly at the waveform level. Due to
the high temporal resolution of high-quality audio, this requires very large models capable of modeling
extremely long-term dependencies. The main advantage here is that such models can produce any
arbitrary audio waves in principle, without relying on limited symbolic systems and digital instruments.
In particular, even complex polyphonic pieces of music can be modeled as a single stream of samples.
   And yet, humans would likely be completely overwhelmed if asked to create a musical piece sample
by sample. Rather, we do operate at a more abstract, quasi-symbolic level, where we manipulate pre-built
instruments via a set of gestures. While this obviously imposes limitations, it also makes “making music”
affordable in the first place. For example, when playing the piano, we do not need to worry about being
off-tune, or to even be tonal at all; we simply hit a key to produce a fixed pitch. When playing the violin,
there is no need to model the correct timbre along with a possibility to accidentally produce, say, a flute
sound, or an undesirable constant buzz in the background. Thus, we argue that waveform models have
two main disadvantages:

   1. They spend enormous capacity on learning what audio is in the first place (oscillations) and how
      to produce the correct kind of audio (specific instruments). Further capacity is spent on having
      to model each sample individually.
   2. The lack of well-defined symbolic controls makes them hardly useful for non-data-driven
      computational creativity tasks. This is because emergent behavior and new rules should be much
      simpler to produce in the restricted lower-dimensional symbolic space.


4. Hybrid Approach
As we have seen, neither purely symbolic nor waveform modeling seems satisfactory for open-ended
creativity. We argue that a hybrid modeling approach can provide all the components necessary for
creative music systems. We envision a setup that includes aspects of both symbolic and waveform levels:
In a first step, musical symbols are created; these symbols have a specific meaning and are thus inherently
interpretable. Then, these symbols are sonified through dedicated instruments, which may be fixed
or themselves part of the model. Crucially, the evaluation of the produced musical sequences happens
at the waveform level. This guarantees that the symbolic sequences are grounded through the sound
they represent. With the hybrid setup, we have removed the main disadvantages of both levels: Symbols
are no longer meaningless, as they are connected to the audio level within the model itself. On the other
hand, the model no longer needs to produce musical audio sample by sample, instead being able to rely
on symbolic abstractions and instruments.

4.1. Revisiting the Grounding Problem
We have argued that a symbolic approach is not suitable for data-independent learning, as the symbols
are not grounded. We claimed that this grounding can be achieved by linking symbols to sound within the
model itself. But without any data, how would the audio be grounded? If there is no reference, there is no
reason for a model to prefer musical audio to, say, random noise, or some more structured but non-musical
patterns. Thus, we still require some kind of grounding even at the waveform level. We argue that, if one
wants to investigate the process of creativity from the ground up, this should not take the form of musical
knowledge, e.g. hardcoding a preference for certain pitches, or even harmonic oscillations at all. Rather,
a more fundamental grounding may be sufficient. Heath and Ventura [25] argue that “before a computer
can draw, it must first learn to see”. We believe a similar statement can be made for music and hearing.
    As such, our hypothesis is that training a deep neural network as a general feature extractor for (not
necessarily musical) audio can provide enough grounding for open-ended generation to be feasible. Such
a model should learn to efficiently encode perceptually relevant features, such as harmonic oscillations,
if these are found in the data [26]. This could provide sufficient grounding without forcing any “musical
ideas” onto the model, which in turn may allow for investigation of computational creativity “from the
ground up”. We do not see any comparable opportunity for models operating purely at the symbolic
level without any grounding.
    Perhaps a similar direction is afforded by the idea of using hallucinations as creative outputs [27]. Meth-
ods such as Deep Dream [28] certainly produce aesthetically pleasing outputs, and the same framework
can be leveraged for audio, as well. Wyse [29] similarly puts emphasis on creative behaviors emerging
from networks that are purely intended for perception (i.e. classifiers). Now, one might counter that such
a model is unlikely to hallucinate music specifically, rather than arbitrary sounds. However, if we limit
the expressiveness of the model to symbolic control of pre-made instruments, we believe we have created
the necessary preconditions for expression that is musical, and yet less constrained by human biases.
    Finally, we need to emphasize that this is conjecture. We intend to put these hypotheses to the test
in future work, and we hope to convince other researchers that this is a worthwhile direction of inquiry.
In particular, general computational theories for the rise of creativity in artificial systems have been
proposed in the past [30, 31, 32]. We believe it would be exciting to apply such models of creative
evolution within the context of hybrid symbolic-waveform models for music.

4.2. Challenges
Jointly modeling symbolic and waveform levels comes with its own issues. Chief among them may be
gradient propagation. Most state-of-the-art generative audio models are trained using gradient descent
to minimize a specific loss function [33]. Given how successful this approach has been, it makes sense
to want to adopt it for training hybrid models.
   Gradient descent requires the entire model, as well as the loss, to be differentiable all the way through.
However, many operations at the symbolic level are fundamentally discrete. For example, a specific
piano key is either pressed, or not. Usually, only a few keys, at most, are pressed down at any given time.
Symbolic generative models, however, most often return a soft probability distribution over keys (or
symbols, more generally), which can then be sampled from to choose a key. This sampling operation
is not differentiable. This is fine for a purely symbolic model, as during training, we can directly compare
the soft outputs to the targets, so no sampling is necessary. Conversely, when using the trained model
to generate pieces of music, sampling is fine as no gradients are required.
   However, when training a hybrid model, we most likely do not want to keep the soft distribution, as
this corresponds to pressing all keys at the same time, at varying strengths (which may be likened to MIDI
velocity). Thus, we arrive at the main dilemma of the hybrid approach: To sensibly transform symbols
into audio, we require discrete operations, but these do not allow for backpropagation of gradients,
seemingly preventing use of the most successful learning method of our time, gradient descent.
   There are several ways to tackle this issue. First off, the Gumbel-Softmax, or concrete distribu-
tion, [34, 35] uses the reparameterization trick to draw samples from a soft distribution in a differentiable
manner. Furthermore, these samples can be smoothly interpolated towards being approximately discrete
using a hyperparameter. Usually, this is changed during the learning procedure, such that samples are
soft initially, and become progressively closer to the discrete ideal over time. After training is finished,
one can simply draw discrete samples instead. While powerful and simple, this approach has limits;
for some actions it simply does not make sense to have a soft choice. Also, we are strictly speaking still
training on soft distributions, incurring a gap between training and deployment of the model.
   Aside from that, there are methods that do not require gradients. Genetic or evolutionary algorithms
are such approaches. These have been used extensively in computational creativity research. However,
such methods are most often used for relatively constrained models with small search spaces. In contrast,
random mutations are unlikely to succeed in the context of deep neural networks. This is due to their very
large number of parameters; the bigger the search space, the less likely it is that unprincipled random
changes to the weights will lead to improvements. As such, a purely genetic approach seems feasible
only for very small networks. One example would be the NEAT algorithm [36] applied to Compositional
Pattern Producing Networks (CPPNs) [37]. These networks have been shown to be able to produce
complex structures with only a handful of neurons. It is also possible to evolve network structures and
then further train the weights [38].
   Alternatively, methods from reinforcement learning could be employed. In particular, policy gradi-
ents [39] are used in deep reinforcement learning to compute neural network gradients despite making
discrete choices among a set of possible actions. This is done by rewriting the gradients such that they can
be approximated by sampling actions. The downside of this method is that the sampling can give imprecise
approximations of the true expected gradient. To offset this, we would need to sample many actions
independently for each gradient step. Combined with the high sampling rate common with high-quality
audio, this can quickly become unmanageable in terms of computational resource requirements.
   Finally, there is already a kind of hybrid symbolic model in use in state-of-the-art generative models,
namely the Vector-Quantized Variational Autoencoder (VQ-VAE) [40]. When applied to audio, this will
superficially act like a regular autoencoder, outputting audio directly, with no reference to the symbolic
level. However, the latent space is discretized through the method of vector quantization. Strictly speaking,
this limits the autoencoder to producing realizations of these discrete vectors, which could be classified
as symbols. However, these symbols are not pre-determined, like an instrument with specific controls,
for example. Rather, the quantization is learned along with the autoencoder. Furthermore, the vectors
need not map to any interpretable or discernible concepts. Rather, they tend to simply perform a kind of
clustering of the latent space. Next, the codes are usually processed through many layers of convolutions,
which intermingles the effects of the different codes and makes it difficult do ascribe a specific meaning to
any one of them. Still, given the issues outlined with other approaches, this could be a promising starting
point if the discrete codes could somehow be forced towards a more meaningful representation. In general,
VAE latent spaces have been judged as appropriate to learn conceptual spaces for musical data [41].
   As we can see, there are a number of established methods that may be leveraged for hybrid models.
Future work needs to compare how the different methods work in practice. After a well-performing
method has been found, hybrid models can then be evaluated in open-ended creative learning tasks.


5. Related Work
In this section, we will first review exemplary state-of-the-art models at the symbolic and waveform levels,
before relating our arguments to more integrative works from the computational creativity community.
   There is a long history of modeling at the level of musical symbols. Some early examples are Mozer [42]
or Eck and Schmidhuber [43]. Modern approaches tend to use powerful Transformer [44] models,
trained on large datasets, to generate symbolic sequences autoregressively. Examples include Music
Transformer [45] and MuseNet [46]. This approach is conceptually very simple, but is the same idea
that has been shown to be extremely effective in the recent surge of Large Language Models [47].
   Other works have incorporated hierarchical structures into the models [48, 49, 50]. These are clearly
present in music [51]. For example, modern popular music generally consists of a sequence of structures
such as verse, chorus, bridge, etc., which in turn consist of repeating motives, and so on. Still, it seems
more common to ignore any such inductive biases and simply model musical data as a flat sequence.
   For raw audio, early successful work in modeling at scale relied on autoregressive models to generate
samples one by one. A key breakthrough here was WaveNet [52]. While originally developed for
text-to-speech applications, it was also shown to work for generating music at the waveform level. The
main disadvantage of WaveNet is that it is extremely slow due to the sample-by-sample generation,
taking minutes to generate just one second of audio, although this was improved by future work [53, 54].
   A different approach is taken by the Differentiable Digital Signal Processing (DDSP) model [55]: Here,
audio is modeled via a fundamental frequency and a distribution of overtones of that frequency, which is
turned into waveforms via sinusoidal oscillators. Alternatively, more flexible learnable wavetable synthe-
sizers can be used. Non-tonal audio can be created via filtered noise. Other effects, such as reverb, can be
added in a differentiable fashion. These choices mean that the model does not rely on autoregressive gen-
eration of samples, and does not need to learn to generate oscillations in the first place. There is, however,
a loss of flexibility: In particular, polyphonic audio is difficult to model with only a single F0 contour.
   Recent models use yet another method: JukeBox [56] was the first model to tackle generation of diverse
multi-genre, polyphonic music directly at the waveform level, including various conditioning options. The
model uses a hierarchical VQ-VAE to downsample the high-resolution audio data; an autoregressive Trans-
former model is trained to generate sequences of codes which are then decoded back to audio space. This
is somewhat close to our proposed approach, but as we mentioned, the learned code “symbols” are rarely
interpretable. Furthermore, the codes still have a relatively high sampling rate (on the order of 100s of Hz).
Beyond JukeBox, the recent generation of text-to-music models, such as MusicLM [5] or MusicGen [10],
use similar approaches, although the hierarchical VQ-VAE is replaced by a single-level residual one.
   As for hybrid models, we are not aware of previous works that employ these for open-ended music gener-
ation. As such, we will review works that have used hybrid models in some capacity. Manzelli et al. [4] con-
dition a Wavenet model on symbolic (MIDI) sequences generated by a biaxial LSTM [57]. They raise many
of the same concerns we have with either purely symbolic or waveform models. However, in their setup the
symbolic model is trained first, and its outputs are used as-is for the waveform model. There is no real inter-
action between the levels, with the symbolic model receiving no further learning signal from the waveform
level. This once again removes the possibility of “learning to play the instrument” from audio feedback.
   Wu et al. [58] propose MIDI-DDSP, an extension of the original DDSP architecture. This is a three-level
model that maps notes to expression parameters, and expression parameters to synthesis parameters,
which are in turn used by a DDSP model to produce audio. This allows for intuitive control of the DDSP
model by manipulating notes and/or how those notes are expressed. However, the three components of this
model are trained separately, so MIDI-DDSP is not an end-to-end hybrid model in the sense we envision.
   Prang and Esling [59] propose mapping symbolic data to a signal-like space to improve embeddings
learned with a Variational Autoencoder [60]. They show that their representation results in better
reconstructions than symbolic data, as well as more musically sensible structure in the latent space. This
presents yet more evidence that disregarding the waveform level may be harmful to any model working
on musical data, including creative output and generation.
   Finally, we can also look outside the musical domain. Colton et al. [61] propose Stable Evolusion, where
text prompts to a Stable Diffusion model are evolved, instead of evolving images directly. This is somewhat
analogous to the proposed symbolic-waveform hybrids: Instead of searching the high-dimensional,
semantically unstructured image space, the search is relegated to the more meaningful textual level.
   Regarding open-endedness, a detailed discussion of why this is itself a desirable goal is beyond the scope
of this paper. As such, we will only provide pointers to other works which have previously made the case
for this direction of research. Lehman and Stanley [3] propose novelty search as a method to solve complex
machine learning problems (see also [62]). Here, the “fitness function” rewards new emerging behaviors,
rather than progress on an objective function. Complex patterns emerge automatically, as there are
simply not many options to do something novel with simple behaviors. We believe such a paradigm to be
particularly well-suited for creative systems. Later work proposes minimal criterion coevolution [63]; here,
two populations evolve together with minimal evolutionary pressure. Any organism that is capable of
solving some minimal task is allowed to evolve. Once again, we believe this to fit particularly well with a
two-sided musician-listener model, and in fact similar setups have been applied to music in the past [64, 65].
For a more philosophical discussion focused on applications in music, see [66]. They argue against seeing
human evaluation, or objective evaluation against human standard (e.g. by computing loss metrics compar-
ing to human-composed music), as the ideal way of evaluating artificial creative systems. Guckelsberger
et al. [67] offer a different perspective, suggesting a framework where agents strive to increase their (and
others) empowerment. The authors put emphasis on multi-agent settings, where agents may also strive for
minimizing or maximizing the empowerment of other individuals. As such, their coupled empowerment
maximization also includes the social dimension, which is evidently important to many creative activities.
   Few works have tackled open-ended generation with large neural networks. Elgammal et al. [68]
propose Creative Adversarial Networks (CANs), a framework similar to Generative Adversarial
Networks. The generator is trained to fool a discriminator into classifying its outputs as real art, as with
standard GANs. However, a second loss term encourages the generator to produce outputs that cannot
be classified into any style known by the discriminator, essentially creating new styles that still conform
to the overall art distribution. Chemla-Romeu-Santos and Esling [69] propose a framework of divergence
maximization, where a pre-trained generative model is purposefully made to extrapolate beyond its
learned distribution in order to generate novel outputs. The authors also implemented prototypes
for simple image data [70]. Here, a VAE is first trained as normal; afterwards it is trained to produce
outputs diverging from the given class distributions, while regularized to still remain in the overall data
distribution. Similar to CANs, this objective is best fulfilled by new types of outputs that do not conform
to any known class, yet are still similar to the given data overall.
   As we can see, there are powerful models available both for symbolic as well as waveform generation.
Furthermore, the computational creativity community has recently provided several examples of
hybrid models, incorporating both levels of modeling, producing favorable results. Finally, open-ended
generation and the search for novelty has been a long-running topic both in computational creativity
and in areas such as artificial life. The ingredients are there; it is time to start cooking.


6. Conclusion
In this paper, we have presented arguments for the limitations of modeling music purely at either the
symbolic or waveform level. A purely symbolic representation lacks grounding, limiting its usefulness
especially with regards to open-ended creativity beyond human-imposed rules and preferences. Modeling
waveforms directly, on the other hand, makes the task significantly more complex due to the high
sampling rate and the need to learn concepts such as harmonic oscillations from scratch. It also does not
provide a good fit to human musical activity, which generally evolves around manipulating instruments
in a (quasi-)symbolic fashion. Instead, we make a case for hybrid models that work at the symbolic level,
but additionally perform the transformation to the waveform level, including a learning signal at that
level, as well. This combines the best aspects of both levels into one, and is particularly well-suited for
open-ended learning tasks with much creative potential. Systems built in such a way could learn to
play instruments based on the sound they actually create, rather than purely symbolic notions. They may
even modify existing instruments, or create entirely new ones to suit their “needs”, given the capacity.
    We also presented challenges with this approach, however. Chief among these is the transition from
symbols to waveforms, as this needs to be carried out in a fashion that allows learning signals to pass
through. This is particularly troublesome for gradient-based learning approaches. We believe that such
approaches, used in powerful deep neural networks, are the method of choice for difficult modeling
tasks due to their unparalleled potential to model very complex data distributions with long-term
dependencies. As we have presented, though, several methods to tackle this issue are already available.
A proper investigation of these options is needed next.
    For a long time, symbolic modeling of music has been the dominant paradigm in the computational
creativity community. However, we have seen that a variety of authors have recently made the case
for including information from the raw waveforms, as well. Such models require significantly more
computation, and this is true for hybrid models, too: While we may not need to learn how to produce
proper musical audio in this framework, we still have to represent and work with this high-dimensional
data. However, with the recent advances in computing hardware, and the software necessary to use
it [71, 72], this has become far more achievable.
    As such, we hope to inspire further research into technical solutions to the unique challenges hybrid
modeling of music brings with it. We believe that such an approach could power new insights into
open-ended creative processes for generating music, bring such systems closer to creative autonomy,
and teach us about the roots of our own creativity, as well.
References
 [1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (2015) 436–444.
 [2] G. Wiggins, E. Miranda, A. Smaill, M. Harris, A framework for the evaluation of music representation
     systems, Computer Music Journal 17 (1993) 31–42.
 [3] J. Lehman, K. O. Stanley, Exploiting open-endedness to solve problems through the search for
     novelty., in: ALIFE, 2008, pp. 329–336.
 [4] R. Manzelli, V. Thakkar, A. Siahkamari, B. Kulis, An end to end model for automatic music generation:
     Combining deep raw and symbolic audio networks, in: Musical metacreation workshop, 2018.
 [5] A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts,
     M. Tagliasacchi, et al., Musiclm: Generating music from text, arXiv preprint arXiv:2301.11325 (2023).
 [6] P. M. Bodily, D. Ventura, Musical metacreation: past, present, and future, in: Proceedings of the
     sixth international workshop on musical metacreation, 2018.
 [7] M. Windsor, Using raw audio neural network systems to define musical creativity, in: AIMC, 2022.
     URL: https://doi.org/10.5281/zenodo.7088438. doi:10.5281/zenodo.7088438.
 [8] StabilityAI,        Stable diffusion,          https://stability.ai/stable-image (2022). URL: https:
     //stability.ai/stable-image.
 [9] OpenAI, Chatgpt, https://chat.openai.com/ (2022). URL: https://chat.openai.com/.
[10] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, A. Défossez, Simple and controllable
     music generation, arXiv preprint arXiv:2306.05284 (2023).
[11] D. Ventura, Mere generation: Essential barometer or dated concept, in: ICCC, 2016, pp. 17–24.
[12] S. Berns, S. Colton, Bridging generative deep learning and computational creativity., in: ICCC,
     2020, pp. 406–409.
[13] K. O. Stanley, Why open-endedness matters, Artificial life 25 (2019) 232–235.
[14] E. G. Schellenberg, S. E. Trehub, Natural musical intervals: Evidence from infant listeners,
     Psychological science 7 (1996) 272–277.
[15] C. L. Krumhansl, The cognition of tonality–as we know it today, Journal of New Music Research
     33 (2004) 253–268.
[16] R. A. McIntyre, Bach in a box: The evolution of four part baroque harmony using the genetic
     algorithm, in: Proceedings of the first ieee conference on evolutionary computation. ieee world
     congress on computational intelligence, IEEE, 1994, pp. 852–857.
[17] Z. Ren, Style composition with an evolutionary algorithm, in: AIMC, 2020.
[18] A. Jordanous, A fitness function for creativity in jazz improvisation and beyond, in: ICCC, 2010.
[19] Y. Zhou, Y. Koyama, M. Goto, T. Igarashi, Generative melody composition with human-in-the-loop
     bayesian optimization, arXiv preprint arXiv:2010.03190 (2020).
[20] S. Dasari, J. Freeman, Directed evolution in live coding music performance, 2020.
[21] P. Mitrano, A. Lockman, J. Honicker, S. Barton, Using recurrent neural networks to judge fitness
     in musical genetic algorithms, in: Musical metacreation workshop, 2017.
[22] F. Ostermann, I. Vatolkin, G. Rudolph, Artificial Music Producer: Filtering Music Compo-
     sitions by Artificial Taste, in: AIMC, 2022. URL: https://doi.org/10.5281/zenodo.7088395.
     doi:10.5281/zenodo.7088395.
[23] M. M. Al-Rifaie, M. Bishop, Weak and strong computational creativity, in: Computational creativity
     research: Towards creative machines, Springer, 2014, pp. 37–49.
[24] C. Guckelsberger, C. Salge, S. Colton, Addressing the “why?” in computational creativity: A
     non-anthropocentric, minimal model of intentional creative agency, in: ICCC, 2017.
[25] D. Heath, D. Ventura, Before a computer can draw, it must first learn to see, in: ICCC, 2016, pp.
     172–179.
[26] Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new perspectives, IEEE
     transactions on pattern analysis and machine intelligence 35 (2013) 1798–1828.
[27] L. Berov, K.-U. Kühnberger, Visual hallucination for computational creation, in: ICCC, 2016, pp.
     107–114.
[28] A. Mordvintsev, C. Olah, M. Tyka, Inceptionism: Going deeper into neural networks (2015). URL:
     TODOBLOG.
[29] L. Wyse, Mechanisms of artistic creativity in deep learning neural networks, in: ICCC, 2019.
[30] R. Saunders, Artificial creative systems and the evolution of language, in: ICCC, 2011, pp. 36–41.
[31] L. Gabora, S. DiPaola, How did humans become so creative? a computational approach, in: ICCC,
     2012.
[32] O. Bown, A model of runaway evolution of creative domains., in: ICCC, 2014, pp. 247–253.
[33] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back-propagating errors,
     nature 323 (1986) 533–536.
[34] E. Jang, S. Gu, B. Poole, Categorical reparameterization with gumbel-softmax, arXiv preprint
     arXiv:1611.01144 (2016).
[35] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution: A continuous relaxation of discrete
     random variables, arXiv preprint arXiv:1611.00712 (2016).
[36] K. O. Stanley, R. Miikkulainen, Evolving neural networks through augmenting topologies,
     Evolutionary computation 10 (2002) 99–127.
[37] K. O. Stanley, Exploiting regularity without development., in: AAAI Fall Symposium: Developmental
     Systems, 2006, p. 49.
[38] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg, M. Lanctot, D. Wierstra,
     Convolution by evolution: Differentiable pattern producing networks, in: Proceedings of the
     Genetic and Evolutionary Computation Conference 2016, 2016, pp. 109–116.
[39] R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement
     learning, Machine learning 8 (1992) 229–256.
[40] A. v. d. Oord, O. Vinyals, K. Kavukcuoglu, Neural discrete representation learning, in: NeurIPS,
     2017, pp. 6309–6318.
[41] M. Peeperkorn, R. Saunders, O. Bown, A. Jordanous, Mechanising conceptual spaces using
     variational autoencoders, in: ICCC, 2022.
[42] M. C. Mozer, Neural network music composition by prediction: Exploring the benefits of
     psychoacoustic constraints and multi-scale processing, Connection Science 6 (1994) 247–280.
[43] D. Eck, J. Schmidhuber, Learning the long-term structure of the blues, in: International Conference
     on Artificial Neural Networks, Springer, 2002, pp. 284–289.
[44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
     Attention is all you need, Advances in neural information processing systems 30 (2017).
[45] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D.
     Hoffman, M. Dinculescu, D. Eck, Music transformer: Generating music with long-term structure,
     in: ICLR, 2018.
[46] OpenAI, Musenet, https://openai.com/research/musenet (2019). URL: https://openai.com/research/
     musenet.
[47] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu,
     D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020).
[48] B. Smith, G. Garnett, Improvising musical structure with hierarchical neural nets, in: Proceedings
     of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 8,
     2012, pp. 63–67.
[49] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, D. Eck, A hierarchical latent vector model for learning
     long-term structure in music, in: ICML, 2018, pp. 4364–4373.
[50] G. Zixun, D. Makris, D. Herremans, Hierarchical recurrent neural networks for conditional melody
     generation with long-term structure, in: 2021 International Joint Conference on Neural Networks
     (IJCNN), IEEE, 2021, pp. 1–8.
[51] F. Lerdahl, R. Jackendoff, An overview of hierarchical structure in music, Music Perception (1983)
     229–252.
[52] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,
     K. Kavukcuoglu, Wavenet: A generative model for raw audio, arXiv preprint arXiv:1609.03499 (2016).
[53] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart,
     L. Cobo, F. Stimberg, et al., Parallel wavenet: Fast high-fidelity speech synthesis, in: ICML, 2018,
     pp. 3918–3926.
[54] L. Hantrakul, J. H. Engel, A. Roberts, C. Gu, Fast and flexible neural audio synthesis., in: ISMIR,
     2019, pp. 524–530.
[55] J. Engel, C. Gu, A. Roberts, et al., Ddsp: Differentiable digital signal processing, in: ICLR, 2019.
[56] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, I. Sutskever, Jukebox: A generative model
     for music, arXiv preprint arXiv:2005.00341 (2020).
[57] D. D. Johnson, Generating polyphonic music using tied parallel networks, in: International
     conference on evolutionary and biologically inspired music and art, Springer, 2017, pp. 128–143.
[58] Y. Wu, E. Manilow, Y. Deng, R. Swavely, K. Kastner, T. Cooijmans, A. Courville, C.-Z. A. Huang, J. En-
     gel, Midi-ddsp: Detailed control of musical performance via hierarchical modeling, in: ICLR, 2021.
[59] M. Prang, P. Esling, Signal-domain representation of symbolic music for learning embedding spaces,
     in: AIMC, 2020. URL: https://doi.org/10.5281/zenodo.4285386. doi:10.5281/zenodo.4285386.
[60] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013).
[61] S. Colton, A. Smith, B. Pérez Ferrer, S. Berns, Artist discovery with stable evolusion, in: ICCC, 2023.
[62] J. Schmidhuber, Art & science as by-products of the search for novel patterns, or data compressible
     in unknown yet learnable ways, Multiple ways to design research. Research cases that reshape
     the design discipline, Swiss Design Network-Et al. Edizioni (2009) 98–112.
[63] J. C. Brant, K. O. Stanley, Minimal criterion coevolution: a new approach to open-ended search,
     in: GECCO, 2017, pp. 67–74.
[64] P. M. Todd, G. M. Werner, Frankensteinian methods for evolutionary music composition, Musical
     networks: Parallel distributed perception and performance 3 (1999) 7.
[65] B. Andrus, N. Fulda, A data-driven architecture for social behavior in creator networks, in: ICCC,
     2022, pp. 339–348.
[66] R. Loughran, M. O’Neill, Generative music evaluation: why do we limit to ‘human’, in: Proceedings
     of the first Conference on Computer Simulation of Musical Creativity (CSMC), 2016.
[67] C. Guckelsberger, C. Salge, R. Saunders, S. Colton, Supportive and antagonistic behaviour in dis-
     tributed computational creativity via coupled empowerment maximisation, in: ICCC, 2016, pp. 9–16.
[68] A. Elgammal, B. Liu, M. Elhoseiny, M. Mazzone, Can: Creative adversarial networks, generating"
     art" by learning about styles and deviating from style norms, arXiv preprint arXiv:1706.07068 (2017).
[69] A. Chemla–Romeu-Santos, P. Esling, Challenges in creative generative models for music: a
     divergence maximization perspective, in: AIMC, 2022. URL: https://doi.org/10.5281/zenodo.7088272.
     doi:10.5281/zenodo.7088272.
[70] A. Chemla–Romeu-Santos, P. Esling, Creative divergent synthesis with generative models, arXiv
     preprint arXiv:2211.08861 (2022).
[71] M. Abadi, Tensorflow: learning functions at scale, in: Proceedings of the 21st ACM SIGPLAN
     international conference on functional programming, 2016, pp. 1–1.
[72] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
     L. Antiga, et al., Pytorch: An imperative style, high-performance deep learning library, Advances
     in neural information processing systems 32 (2019).