1. Introduction

TunesFormer: Forming Irish Tunes with Control Codes by Bar Patching

Shangda Wu

shangda@mail.ccom.edu.cn 0 1 2

Xiaobing Li

lxiaobing@ccom.edu.cn 0 1 2

Feng Yu

yufeng@ccom.edu.cn 0 1 2

Maosong Sun

0 1 2 0 Department of Computer Science and Technology, Tsinghua University , Beijing , China 1 Department of Music AI and Information Technology, Central Conservatory of Music , Beijing , China 2 HCMIR23: 2nd Workshop on Human-Centric Music Information Research

This paper introduces TunesFormer, an eficient Transformer-based dual-decoder model specifically designed for the generation of melodies that adhere to user-defined musical forms. Trained on 214,122 Irish tunes, TunesFormer utilizes techniques including bar patching and control codes. Bar patching reduces sequence length and generation time, while control codes guide TunesFormer in producing melodies that conform to desired musical forms. Our evaluation demonstrates TunesFormer's superior eficiency, being 3.22 times faster than GPT-2 and 1.79 times faster than a model with linear complexity of equal scale while ofering comparable performance in controllability and other metrics. TunesFormer provides a novel tool for musicians, composers, and music enthusiasts alike to explore the vast landscape of Irish music. Our model and code are available at GitHub.

Irish music melody generation control codes bar patching dual-decoder architecture

1. Introduction

L : 1 / 8 SOE M : 4 / 4 SOE K : E m i n SOE | : PS F SP | EOS G 2 PS F G PS B G F E PS | ] SOE … ESO

Character-level Transformer Decoder

L : 1 / 8 M : 4 / 4 K : E m i n | : PS F SP | G 2 PS F G PS B G F E PS | ] … … …

Patch-level Transformer Decoder Linear Projection of Flattened Bar Patches

0 1 2 3 4 N

Shifted Outputs Patch Features Position + Patch Embeds Bar Patches

START

L:1/8

M:4/4

K:Emin |: F |

G2 FG BGFE |]

The key contributions of this paper are as follows: • As a dual-decoder model based on bar patching, TunesFormer significantly accelerates generation speed while maintaining the quality of the generated music. • TunesFormer enables users to generate melodies with diverse musical forms, providing lfexibility and alignment with artistic vision through control codes. • To support future research, we release the Irish Massive ABC Notation (IrishMAN) dataset, an open-source collection of 216,284 Irish tunes in the ABC notation format.

2. Methodology

2.1. TunesFormer TunesFormer uses bar patching [ 14 ] for melody generation, leveraging the ABC notation format1 ideal for representing Irish music. Bar patching divides scores into segments, such as bars, shortening sequences and enhancing eficiency without sacrificing musical integrity.

Fig. 1 showcases TunesFormer’s dual-decoder design. Bar patches are converted into embeddings that input to the patch-level decoder, producing patch features. These are input to the character-level decoder, which translates the patch features into the ABC notation sequences.

Given as sequence length and as patch size, bar patching reduces the patch-level decoder complexity from ( 2) to ( 22 ). Meanwhile, the character-level decoder complexity becomes ( ) . Considering and as parameter sizes for patch and character-level decoders respectively, computational need shifts from ( + ) ⋅ 2 to ⋅ ( 22 ) + ⋅ . This is particularly advantageous for large sequences, high to ratios, and optimal choices.

In our implementation, = 4096 , = 32 , yielding a 128 patch-length. The patch-level has 9 layers, and the character-level has 3, both with a 768 hidden size. 2.2. Control Codes Inspired by CTRL [ 15 ], TunesFormer integrates control codes to denote musical forms. These codes precede the ABC notation, letting users dictate tune structures. Introduced codes are: • S:number of sections - Dictates melody sections, ranging 1-8 (e.g., S : 1 for a singlesection melody, and S : 8 for a melody with eight sections), based on symbols like [ | ,| | ,| ] ,| : ,: : , and : | used to represent section boundaries. • B:number of bars - Sets number of bars within a section. It counts on the bar symbol | .

The range is 1 to 32 (e.g., B : 1 for a one-bar section, and B : 3 2 for a section with 32 bars). • E:edit distance similarity - Manages similarity between section and previous section . Derived from Levenshtein distance [ 16 ] (, ) , it measures section diferences: (, ) = 1 −

(, ) (||, ||) (1) where || and || are the string lengths of the two sections. It is discretized into 11 levels, ranging from 0 to 10 (e.g., E : 0 for no similarity, and E : 1 0 for an exact match). For the -th section, there are − 1 previous sections to compare with.

While earlier methods leaned on hand-crafted rules or limited training data [ 8, 17 ], our control codes directly extract precise musical form information from ABC notation, thus leveraging large datasets to improve understanding of musical structures. 2.3. Dataset The IrishMAN dataset2 has 216,284 Irish ABC tunes. 99% (214,122) are for training and 1% (2,162) for validation, sourced from thesession.org and abcnotation.com. Uniformity is maintained by converting tunes to XML and back using scripts3, with natural language fields removed.

Tunes have control codes from ABC symbols (Section 2.2) indicating musical forms. The music21-filtered subset[ 18 ] contains 34,211 human-annotated lead sheets. This subset helped TunesFormer generate harmonized melodies. In addition, all tunes are public domain, ensuring ethical and legal use for research and creative projects.

3. Experiments

In the experiments, we used baselines like LSTM [ 9 ] for generating ABC notation, GPT-2 [ 19 ] for music generation [ 12, 13 ], and RWKV [ 20 ], which rivals Transformers in performance. All models were trained on the same IrishMAN dataset split with character-level ABC tokenization, using random sampling for decoding. The evaluation involved two objective metrics based on 1,000 tunes generated from scratch per model: 2https://huggingface.co/datasets/sander-wood/irishman 3https://wim.vree.org/svgParse/

We used comparative evaluations due to the inconsistency in human values. Thirteen Irish musicians compared melody pairs: one from thesession.org with chord symbols, and a modelgenerated continuation from the initial two bars. Tune choice and order were randomized to avoid bias. Participants selected the melody that best aligned with the below descriptions: • Engagement: Captivating to the ear, evokes emotional resonance, and maintains the listener’s interest. • Authenticity: Representing the distinctive characteristics of Irish traditional music. • Harmoniousness: Creating a natural flow that unifies melody and harmony into a cohesive and pleasing musical experience.

• Playability: Well-suited for performance and ofers a wide range of playing techniques.

Participants chose between three options for each melody pair: 0 for human-composed, 1 for model-generated, and 0.5 for no preference. Thus, scores ranged from 0 to 1. Participants were instructed to skip melodies they were already familiar with to avoid bias.

Table 1 shows the evaluation of music generation models. TunesFormer, with 88,425,984 parameters and a Transformer base, is 3.22 times faster than GPT-2 and 1.79 times faster than RWKV. Its dual-decoder architecture focuses on character generation, explaining its eficiency despite its large size. It is worth highlighting that TunesFormer’s eficiency does not come at the expense of its performance. Particularly noteworthy is its remarkable controllability, matching the highest scores achieved in authenticity and playability. The performance is enhanced by the interaction between the patch-level and character-level decoders, where the former contextualizes bar features, enabling the latter to create coherent compositions. In essence, TunesFormer’s dual-decoder design boosts eficiency in melody generation without sacrificing quality, and shows a significant advantage over its competitors in the field.

4. Conclusions

This paper presents TunesFormer, a model that generates melodies using control codes and bar patching. The use of control codes enhances user interaction, enabling personalized and customizable music generation. The dual-decoder architecture employed by TunesFormer, combined with its bar patching mechanism, yields significant improvements in generation speed without compromising the quality of the generated music. Future directions include incorporating more musical features and applying TunesFormer to various cultural traditions.

Acknowledgments

The authors gratefully acknowledge the financial support from the Special Program of National Natural Science Foundation of China (Grant No. T2341003), the Advanced Discipline Construction Project of Beijing Universities, the Major Program of National Social Science Fund of China (Grant No. 21ZD19), and the Nation Culture and Tourism Technological Innovation Engineering Project (Research and Application of 3D Music).

[1]

Chen ,

Zhang , S. Dubnov, G. Xia, The efect of explicit structure encoding of deep neural networks for symbolic music generation , CoRR ( 2018 ). a r X i v : 1 8 1 1 . 0 8 3 8 0 .

[2]

Guo ,

Makris , D. Herremans, Hierarchical recurrent neural networks for conditional melody generation with long-term structure , in: International Joint Conference on Neural Networks, IJCNN 2021 , Shenzhen, China, July 18-22 , 2021 , IEEE, 2021 . doi:1 0 . 1 1 0 9 / I J C N N 5 2 3 8 7 . 2 0 2 1 . 9 5 3 3 4 9 3 .

[3]

Naruse ,

Takahata ,

Mukuta , T. Harada, Pop music generation with controllable phrase lengths , in: Proc. of the 23rd Int. Society for Music Information Retrieval Conf ., Bengaluru , India, 2022 .

[4]

Zhang , J. Zhang,

Qiu ,

Wang ,

Zhou , Structure-enhanced pop music generation via harmony-aware learning , in: MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14 , 2022 , ACM, 2022 . doi:1 0 . 1 1 4 5 / 3 5 0 3 1 6 1 . 3 5 4 8 0 8 4 .

[5]

Wu ,

Liu ,

Hu ,

Zhu , Popmnet: Generating structured pop music melodies using neural networks , Artif. Intell . ( 2020 ). doi:1 0 . 1 0 1 6 / j . a r t i n t . 2 0 2 0 . 1 0 3 3 0 3 .

[6]

Zou ,

Zhao ,

Zhang ,

Wang , Melons: generating melody with long-term structure using transformers and structure graph , 2021 . URL: https://arxiv.org/ abs/2110.05020. doi:1 0 . 4 8 5 5 0

/ A R X I

V . 2 1 1 0 . 0 5 0 2 0 .

[7]

Dai ,

Jin ,

Gomes ,

R. B.

Dannenberg , Controllable deep melody generation via hierarchical music structure representation , in: Proceedings of the 22nd International Society for Music Information Retrieval Conference , ISMIR 2021 , Online, November 7- 12 , 2021 , 2021 .

[8]

Lu ,

Tan ,

Yu ,

Qin ,

Zhao , T. Liu, Meloform: Generating melody with musical form based on expert systems and neural networks , CoRR ( 2022 ). a r X i v : 2 2 0 8 . 1 4 3 4 5 .

[9]

B. L.

Sturm ,

J. F.

Santos ,

Ben-Tal , I. Korshunova , Music transcription modelling and composition using deep learning , CoRR ( 2016 ). a r X i v : 1 6 0 4 . 0 8 7 2 3 .

[10]

Geerlings ,

Merono-Penuela , Interacting with gpt-2 to generate controlled and believable musical sequences in abc notation , in: Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA) , 2020 .

[11]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9 , 2017 , Long Beach, CA, USA, 2017 .

[12]

C. A.

Huang ,

Vaswani ,

Uszkoreit , I. Simon,

Hawthorne ,

Shazeer ,

A. M.

Dai ,

M. D.

Hofman ,

Dinculescu ,

Eck , Music transformer: Generating music with long-term structure , in: 7th International Conference on Learning Representations, ICLR 2019 , New Orleans , LA, USA, May 6- 9 , 2019 , OpenReview.net, 2019 .

[13]

Wu , M. Sun, Exploring the eficacy of pre-trained checkpoints in text-to-music generation task , in: The AAAI-23 Workshop on Creative AI Across Modalities , 2023 . URL: https: //openreview.net/forum?id=QmWXskBhesn.

[14]

Wu ,

Yu ,

Tan ,

Sun , Clamp: Contrastive language-music pre-training for crossmodal symbolic music information retrieval , CoRR ( 2023 ). URL: https://doi.org/10.48550/ arXiv.2304.11029. doi:1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 4 . 1 1 0 2 9 . a r X i v : 2 3 0 4 . 1 1 0 2 9 .

[15]

N. S.

Keskar ,

McCann ,

L. R.

Varshney ,

Xiong ,

Socher , CTRL: A conditional transformer language model for controllable generation , CoRR ( 2019 ). URL: http://arxiv. org/abs/ 1909 .05858. a r X i v : 1 9 0 9 . 0 5 8 5 8 .

[16]

V. I.

Levenshtein , et al., Binary codes capable of correcting deletions, insertions, and reversals , in: Soviet physics doklady, Soviet Union , 1966 .

[17]

Jiang ,

Chin ,

Zhang , G. Xia, Learning hierarchical metrical structure beyond measures , in: Proceedings of the 23rd International Society for Music Information Retrieval Conference , ISMIR 2022 , Bengaluru, India, December 4- 8 , 2022 , 2022 . URL: https://archives. ismir.net/ismir2022/paper/000023.pdf.

[18]

M. S.

Cuthbert , C. Ariza, Music21: A toolkit for computer-aided musicology and symbolic music data , International Society for Music Information Retrieval , 2010 .

[19]

Radford , J. Wu ,

Child ,

Luan ,

Amodei ,

Sutskever , et al., Language models are unsupervised multitask learners, OpenAI blog ( 2019 ).

[20]

Peng ,

Alcaide ,

Anthony ,

Albalak ,

Arcadinho ,

Cao , X. Cheng, M. Chung,

Grella , K. K. G. V. , X.

He , H.

Hou , P.

Kazienko , J.

Kocon , J.

Kong , B.

Koptyra , H.

Lau , K. S. I.

Mantri , F.

Mom , A.

Saito , X.

Tang , B.

Wang , J. S.

Wind , S.

Wozniak , R.

Zhang , Z.

Zhang , Q.

Zhao , P.

Zhou , J.

Zhu , R.

Zhu , RWKV: reinventing rnns for the transformer era , CoRR ( 2023 ). URL: https://doi.org/10.48550/arXiv.2305.13048. doi:1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 5 . 1 3 0 4 8 . a r X i v : 2 3 0 5 . 1 3 0 4 8 .