The C@merata Task at MediaEval 2014: Natural Language
          Queries on Classical Music Scores
           Richard Sutcliffe                                 Tim Crawford                                  Chris Fox
            School of CSEE                            Department of Computing                           School of CSEE
           University of Essex                     Goldsmiths, University of London                    University of Essex
            Colchester, UK                                  London, UK                                  Colchester, UK
         rsutcl@essex.ac.uk                           t.crawford@gold.ac.uk                       foxcj@essex.ac.uk
                 Deane L. Root                                                                 Eduard Hovy
                   Department of Music                                               Language Technologies Institute
                  University of Pittsburgh                                             Carnegie-Mellon University
                   Pittsburgh, PA, USA                                                    Pittsburgh, PA, USA
                       dlr@pitt.edu                                                          hovy@cmu.edu
ABSTRACT
This paper summarises the C@merata task in which participants
built systems to answer short natural language queries about
classical music scores in MusicXML. The task thus combined
natural language processing with music information retrieval. Five
groups from four countries submitted eight runs. The best                 t: followed_by
submission scored Beat Precision 0.713 and Beat Recall 0.904.             q: G sharp followed by B
                                                                          s: corelli_allegro_tr_clementi.xml
                                                                          [ 4/4, 4, 2:5-2:6 ]
1. INTRODUCTION                                                           [ 4/4, 4, 5:15-5:16 ]
                                                                          [ 4/4, 4, 8:11-8:12 ]
     A text-based Question Answering (QA) system takes as input           [ 4/4, 4, 19:1-19:2 ]
a short natural language query together with a document
                                                                              Figure 1. Score Extract and Example Question
collection, and produces in response an exact answer [1]. There
has been considerable progress in the development of such                   The time signature is 4/4. After this, 4 means we count in
systems over the last ten years. At the same time, Music              semiquavers (sixteenth notes). The passage starts in bar (measure)
Information Retrieval (MIR) has been an active field for more         2 at the fifth semiquaver and ends after the sixth semiquaver. In
than a decade. However, until now, there has been little or no        the task, participants are provided with the question, the score and
work which draws these two fields together. The key aim of the        the divisions value. They must return the answer passages. For
C@merata evaluation, therefore, was to formulate a task which         full details of the task, see [2].
combines simple QA with MIR, working with Western classical
art music.                                                            2.2 Music Scores
     In C@merata (Cl@ssical Music Extraction of Relevant                    The music for the task was chosen from works by well-
Aspects by Text Analysis), participants were provided with a          known composers active in the Renaissance and Baroque periods.
series of short questions referring to musical features of a          The MusicXML format was chosen because it is widely used, it is
corresponding score in MusicXML. The task was to identify the         relatively simple and it can capture most important aspects of a
locations of all such features. Five groups participated. Submitted   score.
runs were evaluated automatically by reference to a gold standard           For the test collection there were twenty scores, with ten
prepared by the organisers.                                           questions being set for each. Scores were on one, two, three, four
                                                                      or five staves according to a prescribed distribution.
2. APPROACH                                                           Instrumentation was typically voices (e.g. SATB, SSA etc),
                                                                      Harpsichord, Lute, Violin and Harpsichord etc.
2.1 The C@merata Task
      There is a series of questions with required answers:           2.3 Evaluation Metrics
Provided Question:                                                         We adapted the well-known Precision and Recall metrics of
•    A short noun phrase in English referring to musical features     Cyril Cleverdon which are universally used in NLP and IR. We
     in a score,                                                      say a passage is Beat Correct if it starts in the correct bar
•    A short classical music score in MusicXML.                       (measure) and at the right beat offset and it ends in the correct bar
Required Answer:                                                      and at the right beat offset. Conversely a passage is Measure
•    The location(s) in the score of the requested musical feature.   Correct if it starts in the correct bar and ends in the correct bar.
      Figure 1 shows a score extract and a corresponding                   We define Beat Precision as the number of beat-correct
question+answer. The type of the query is followed_by, which in       passages returned by a system divided by the number of passages
this case requires G# to be followed by B. There are four answer      (correct or incorrect) returned. Similarly, Beat Recall is the
passages, the first being [ 4/4, 4, 2:5-2:6 ].                        number of beat-correct passages returned by a system divided by
                                                                      the total number of answer passages in the Gold Standard.
                                                                           On the other hand, Measure Precision is the number of
Copyright is held by the author/owner(s).                             measure-correct passages returned by a system divided by the
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain.
number of passages (correct or incorrect) returned. Measure             worked with Python and Music21 while others adapted their own
Recall is the number of measure-correct passages returned by a          pre-existing systems in C++ and Common Lisp.
system divided by the total number of answer passages.
                                                                        4. CONCLUSIONS
2.4 Test Queries                                                             This was a new task at MediaEval and indeed we know of no
      200 test queries were drawn up, based on twenty scores with       other work combining NLP and MIR in this way. Many technical
ten questions being asked on each. American terminology (e.g.           details had to be solved which sometimes took us to the limits of
quarter note) was used for ten scores and English terminology           western classical music notation. A lot was learned from the
(e.g. crotchet) for ten scores. Queries were devised in twelve          exercise both about evaluation (e.g in devising versions of P and
different types according to a prescribed distribution as shown in      R to use) and about music (e.g. where does a cadence begin and
Table 1 which also shows examples of each type. The Gold                end). A future task could tackle a wider range of questions
Standard answers were drawn up by the first author and then each        involving more complicated natural language structures, as well as
file was carefully checked by one of the other authors.                 addressing some loose ends in the task design.

                      Table 1. Query Types                                              Table 2. C@merata Participants
        Type          No                  Example                            Runtag          Leader           Affiliation    Country
    simple_pitch      30                     G5                               CLAS        Stephen Wan        CSIRO           Australia
   simple_length      30             dotted quarter note                                                   De Montfort
                                                                             DMUN          Tom Collins                       England
                                                                                                            University
  pitch_and_length    30                 D# crotchet                                       Donncha Ó       University of
                                                                             OMDN                                             Ireland
      perf_spec       10                 D sharp trill                                      Maidín          Limerick
                                                                                                               Tata
     stave_spec       20            D4 in the right hand                      TCSL         Nikhil Kini     Consultancy         India
     word_spec         5           word "Se" on an A flat                                                    Services
    followed_by       30       crotchet followed by semibreve                 UNLP        Kartik Asooja NUI Galway            Ireland

  melodic_interval    19               melodic octave
                                                                         Table 3. Results: CLAS01 is best run, LACG01 is baseline run
 harmonic_interval 11               harmonic major sixth
                                                                             Run           BP            BR           MP         MR
    cadence_spec       5               perfect cadence
                                                                           CLAS01         0.713       0.904          0.764      0.967
     triad_spec        5                  tonic triad
                                                                           DMUN01         0.372       0.712          0.409      0.784
    texture_spec       5                 polyphony
                                                                           DMUN02         0.380       0.748          0.417      0.820
         All          200
                                                                           DMUN03         0.440       0.868          0.462      0.910
3. RESULTS AND DISCUSSION                                                  LACG01         0.135       0.101          0.188      0.142

3.1 Participation and Runs                                                 OMDN01         0.415       0.150          0.424      0.154
     Five groups from four countries (Table 2) submitted eight             TCSL01         0.633       0.821          0.652      0.845
runs (Table 3) which were evaluated automatically using Beat
Precision (BP), Beat Recall (BR), Measure Precision (MP) and               UNLP01         0.113       0.516          0.155      0.703
Measure Recall (MR). BP and BR are much stricter, since the                UNLP02         0.290       0.512          0.393      0.692
exact passage must be specified. However, MP and MR are also
included because in practical contexts it is often sufficient to
know the bar numbers - the required feature can usually be              5. REFERENCES
spotted very quickly by an expert.                                      [1] Sutcliffe, R., Peñas, A., Hovy, E., Forner, P., Rodrigo, A.,
     Results were generally very good. The best was CLAS01                  Forascu, C., Benajiba, Y., Osenova, P. 2013. Overview of
with Beat Precision 0.713 and Beat Recall 0.904. However,                   QA4MRE Main Task at CLEF 2013. Proceedings of
almost all runs beat the baseline run LACG01 which was prepared             QA4MRE-2013.
with the Baseline System distributed to all participants at the start
[3]. Questions were intentionally easy as there were many               [2] Sutcliffe, R., Crawford, T., Hovy, E., Root, D.L. and Fox, C.
unknown aspects of the task which had to be worked out by                   2014. Task Description v7: C@merata 14: Question
participants and organisers alike.                                          Answering on Classical Music Scores. http://csee.essex.ac.
                                                                            uk/camerata.
3.2 How Task was Approached                                             [3] Sutcliffe, R. 2014. A Description of the C@merata Baseline
     Most participants used hand crafted dictionaries and string            System in Python 2.7 for Answering Natural Language
processing to analyse the queries, rather than parsing. Generally           Queries on MusicXML Scores. University of Essex
people converted a score into feature information, extracted the            Technical Report, 21st May, 2014.
required features from the query and then matched the two. Some