The C@merata Task at MediaEval 2014: Natural Language Queries on Classical Music Scores Richard Sutcliffe Tim Crawford Chris Fox School of CSEE Department of Computing School of CSEE University of Essex Goldsmiths, University of London University of Essex Colchester, UK London, UK Colchester, UK rsutcl@essex.ac.uk t.crawford@gold.ac.uk foxcj@essex.ac.uk Deane L. Root Eduard Hovy Department of Music Language Technologies Institute University of Pittsburgh Carnegie-Mellon University Pittsburgh, PA, USA Pittsburgh, PA, USA dlr@pitt.edu hovy@cmu.edu ABSTRACT This paper summarises the C@merata task in which participants built systems to answer short natural language queries about classical music scores in MusicXML. The task thus combined natural language processing with music information retrieval. Five groups from four countries submitted eight runs. The best t: followed_by submission scored Beat Precision 0.713 and Beat Recall 0.904. q: G sharp followed by B s: corelli_allegro_tr_clementi.xml [ 4/4, 4, 2:5-2:6 ] 1. INTRODUCTION [ 4/4, 4, 5:15-5:16 ] [ 4/4, 4, 8:11-8:12 ] A text-based Question Answering (QA) system takes as input [ 4/4, 4, 19:1-19:2 ] a short natural language query together with a document Figure 1. Score Extract and Example Question collection, and produces in response an exact answer [1]. There has been considerable progress in the development of such The time signature is 4/4. After this, 4 means we count in systems over the last ten years. At the same time, Music semiquavers (sixteenth notes). The passage starts in bar (measure) Information Retrieval (MIR) has been an active field for more 2 at the fifth semiquaver and ends after the sixth semiquaver. In than a decade. However, until now, there has been little or no the task, participants are provided with the question, the score and work which draws these two fields together. The key aim of the the divisions value. They must return the answer passages. For C@merata evaluation, therefore, was to formulate a task which full details of the task, see [2]. combines simple QA with MIR, working with Western classical art music. 2.2 Music Scores In C@merata (Cl@ssical Music Extraction of Relevant The music for the task was chosen from works by well- Aspects by Text Analysis), participants were provided with a known composers active in the Renaissance and Baroque periods. series of short questions referring to musical features of a The MusicXML format was chosen because it is widely used, it is corresponding score in MusicXML. The task was to identify the relatively simple and it can capture most important aspects of a locations of all such features. Five groups participated. Submitted score. runs were evaluated automatically by reference to a gold standard For the test collection there were twenty scores, with ten prepared by the organisers. questions being set for each. Scores were on one, two, three, four or five staves according to a prescribed distribution. 2. APPROACH Instrumentation was typically voices (e.g. SATB, SSA etc), Harpsichord, Lute, Violin and Harpsichord etc. 2.1 The C@merata Task There is a series of questions with required answers: 2.3 Evaluation Metrics Provided Question: We adapted the well-known Precision and Recall metrics of • A short noun phrase in English referring to musical features Cyril Cleverdon which are universally used in NLP and IR. We in a score, say a passage is Beat Correct if it starts in the correct bar • A short classical music score in MusicXML. (measure) and at the right beat offset and it ends in the correct bar Required Answer: and at the right beat offset. Conversely a passage is Measure • The location(s) in the score of the requested musical feature. Correct if it starts in the correct bar and ends in the correct bar. Figure 1 shows a score extract and a corresponding We define Beat Precision as the number of beat-correct question+answer. The type of the query is followed_by, which in passages returned by a system divided by the number of passages this case requires G# to be followed by B. There are four answer (correct or incorrect) returned. Similarly, Beat Recall is the passages, the first being [ 4/4, 4, 2:5-2:6 ]. number of beat-correct passages returned by a system divided by the total number of answer passages in the Gold Standard. On the other hand, Measure Precision is the number of Copyright is held by the author/owner(s). measure-correct passages returned by a system divided by the MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain. number of passages (correct or incorrect) returned. Measure worked with Python and Music21 while others adapted their own Recall is the number of measure-correct passages returned by a pre-existing systems in C++ and Common Lisp. system divided by the total number of answer passages. 4. CONCLUSIONS 2.4 Test Queries This was a new task at MediaEval and indeed we know of no 200 test queries were drawn up, based on twenty scores with other work combining NLP and MIR in this way. Many technical ten questions being asked on each. American terminology (e.g. details had to be solved which sometimes took us to the limits of quarter note) was used for ten scores and English terminology western classical music notation. A lot was learned from the (e.g. crotchet) for ten scores. Queries were devised in twelve exercise both about evaluation (e.g in devising versions of P and different types according to a prescribed distribution as shown in R to use) and about music (e.g. where does a cadence begin and Table 1 which also shows examples of each type. The Gold end). A future task could tackle a wider range of questions Standard answers were drawn up by the first author and then each involving more complicated natural language structures, as well as file was carefully checked by one of the other authors. addressing some loose ends in the task design. Table 1. Query Types Table 2. C@merata Participants Type No Example Runtag Leader Affiliation Country simple_pitch 30 G5 CLAS Stephen Wan CSIRO Australia simple_length 30 dotted quarter note De Montfort DMUN Tom Collins England University pitch_and_length 30 D# crotchet Donncha Ó University of OMDN Ireland perf_spec 10 D sharp trill Maidín Limerick Tata stave_spec 20 D4 in the right hand TCSL Nikhil Kini Consultancy India word_spec 5 word "Se" on an A flat Services followed_by 30 crotchet followed by semibreve UNLP Kartik Asooja NUI Galway Ireland melodic_interval 19 melodic octave Table 3. Results: CLAS01 is best run, LACG01 is baseline run harmonic_interval 11 harmonic major sixth Run BP BR MP MR cadence_spec 5 perfect cadence CLAS01 0.713 0.904 0.764 0.967 triad_spec 5 tonic triad DMUN01 0.372 0.712 0.409 0.784 texture_spec 5 polyphony DMUN02 0.380 0.748 0.417 0.820 All 200 DMUN03 0.440 0.868 0.462 0.910 3. RESULTS AND DISCUSSION LACG01 0.135 0.101 0.188 0.142 3.1 Participation and Runs OMDN01 0.415 0.150 0.424 0.154 Five groups from four countries (Table 2) submitted eight TCSL01 0.633 0.821 0.652 0.845 runs (Table 3) which were evaluated automatically using Beat Precision (BP), Beat Recall (BR), Measure Precision (MP) and UNLP01 0.113 0.516 0.155 0.703 Measure Recall (MR). BP and BR are much stricter, since the UNLP02 0.290 0.512 0.393 0.692 exact passage must be specified. However, MP and MR are also included because in practical contexts it is often sufficient to know the bar numbers - the required feature can usually be 5. REFERENCES spotted very quickly by an expert. [1] Sutcliffe, R., Peñas, A., Hovy, E., Forner, P., Rodrigo, A., Results were generally very good. The best was CLAS01 Forascu, C., Benajiba, Y., Osenova, P. 2013. Overview of with Beat Precision 0.713 and Beat Recall 0.904. However, QA4MRE Main Task at CLEF 2013. Proceedings of almost all runs beat the baseline run LACG01 which was prepared QA4MRE-2013. with the Baseline System distributed to all participants at the start [3]. Questions were intentionally easy as there were many [2] Sutcliffe, R., Crawford, T., Hovy, E., Root, D.L. and Fox, C. unknown aspects of the task which had to be worked out by 2014. Task Description v7: C@merata 14: Question participants and organisers alike. Answering on Classical Music Scores. http://csee.essex.ac. uk/camerata. 3.2 How Task was Approached [3] Sutcliffe, R. 2014. A Description of the C@merata Baseline Most participants used hand crafted dictionaries and string System in Python 2.7 for Answering Natural Language processing to analyse the queries, rather than parsing. Generally Queries on MusicXML Scores. University of Essex people converted a score into feature information, extracted the Technical Report, 21st May, 2014. required features from the query and then matched the two. Some