=Paper= {{Paper |id=Vol-1263/paper49 |storemode=property |title=The CLAS System at the MediaEval 2014 C@merata Task |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_49.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/Wan14 }} ==The CLAS System at the MediaEval 2014 C@merata Task== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_49.pdf
     The CLAS system at the MediaEval 2014 C@merata Task
                                                             Stephen Wan
                                                                CSIRO
                                                            Sydney, Australia
                                                     Stephen.Wan@csiro.au

ABSTRACT                                                               2.   scans the CR and consumes concepts if they define the scope
     This paper describes the CLAS system which accepts natural             of the answer.
language queries in the domain of music theory to perform              3.   parses the remaining CR list to construct the query
passage retrieval from a musical score. This system was produced            representation (QR), a sequence of feature structures that
for participation in the C@merata MediaEval 2014 shared task.               indicate the type of answer required, using handwritten
The system uses a domain-specific parser to interpret the query             parsing rules which implicitly capture the domain-specific
and answer generation methods based on feature unification.                 interpretation of the NLQ.
Performance on this task was encouraging with 0.76 precision and       4.   Compares the QR with a subset of the data in the XML,
0.96 recall.                                                                referred to as the Scoped Data (SD), represented as a list of
                                                                            FS, from which candidate answers can be found using feature
1. INTRODUCTION                                                             unification.
     This paper describes the CLAS system which selects
processes and retrieves potentially relevant answers from              2.1 Mapping Query Terms to Concepts
structured data given a natural language query. In this work, the           The system uses a handcrafted lexicon that maps from terms
queries and the structured data are in the domain of music theory,     in the NLQ to concepts in the music theory domain, using the
as defined by the C@merata 2014 task [1]. The CLAS system              following five steps.
produces candidate answers by selecting passages from an                    In Step 1, multi-word entities such as “down bow” are
musical score (in XML). Answers may be any consecutive time            mapped to a single token “down_bow” to allow correct
points spanning multiple whole and partial bars.                       tokenisation. In Step 2, tokens such as “Vb”, denoting the
     For example, a query ``4 crotchets'' should retrieve any          dominant chord (“V”) in the first inversion (“b”), are separated
sequence of four consecutive elements in the score where each          into the two components. In Step 3, quotation marks are used tag
element is a note and each note has the time duration of a crotchet    quoted words as being lyrics (Note: the lexicon used here is
(one quarter of a whole note). In such a system, expert knowledge      limited to music theory terms only and does not include the wider
is needed to interpret the query. However, this not just limited to    language from which lyrics may originate). In Step 4, tokens are
definitions of musical concepts (e.g., ``crotchet''). For example,     separated using whitespace as a delimiter. Finally, in Step 5,
the query ``4 crotchets'' should be interpreted not just as any four   tokens are mapped to their conceptual form using the lexicon.
notes with crotchet duration within the music (compare this to a       Non-contentful words that are not used to construct the QR (e.g.,
general knowledge query ``4 composers'' requiring any four             the article “a” or redundant information about sequence order like
musical composers to be provided) but specifically four notes in       “followed by”) are mapped to a null token and are thus ignored.
sequence. Furthermore, these four notes would typically be                  For example, the word "crotchet" is mapped to
expected to be in the same voice or part; for example, if it were a    "_note:length.1", indicating that the word relates to a “note” FS,
piano score for two hands, the four crotchets might be a sequence      where the feature “length” takes the value “1”. Similarly, the
written in the treble clef, played by the right hand.                  word "quarter" (as in “quarter note”) is also mapped to this sense
     In this paper, we describe a system that processes the input      "_note:length.1".
query, mapping from words in English to music metadata                      Words can have multiple meanings. For example, the word
corresponding to the search criteria, or features, represented as a    "perfect" is mapped to "_sequence:int_quality.PERFECT;
set of attribute-value pairs. An exhaustive search of an XML           _chord_sequence:cadence.PERFECT", indicating two senses: one
score is performed, note by note, for candidate answers using          referring to the quality of an interval (e.g., “a perfect fifth) , or a
feature unification.                                                   type of chord sequence (e.g., “a perfect cadence”).
     This system achieved an overall performance of 0.76
precision and 0.96 recall. The remainder of the paper outlines the
                                                                       2.2 Building Scoped Data
                                                                            The system labels each NLQ with a type T specifying the
system in more detail and presents the C@merata evaluation
                                                                       type of answer required and the scope of the XML data to be
results.
                                                                       examined for an answer (i.e., the SD). In this work, we defined
2. APPROACH                                                            four types: (i) harmonic, (ii) cadence, (iii) style; and (iv) note.
     The CLAS system interprets the natural language query             Each type specifies rules for: (1) converting from the XML
(NLQ) to find candidate answer passages from the score. Briefly,       representation into an SD; (2) parsing rules to convert the CR into
the system:                                                            a QR; and (3) candidate generation rules.
                                                                            A scan of the CR is used to determine the type T by
1.   pre-processes tokens and maps these to a list of concepts, or     searching for concepts specifying the data “granularity”. If any
     the concept representation (CR).                                  are found, these are removed from CR and used to set the type.
                                                                       For example, “simultaneous”, as in “simultaneous second”
Copyright is held by the author/owner(s).
                                                                       (referring to an interval of a second where both notes are sounded
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain.
                                                                       concurrently),       is       mapped      to      the       concept
"_data:granularity.HARMONIC", indicating the harmonic type.            (e.g., “half note C”), “expression” (e.g., “fermata A natural”),
In this case, the SD is defined as a list of chordal notes, taken      precision and recall is above 0.86. Indeed in some cases, recall
from a block chord view of the score.1                                 and precision is 1.0.
     The cadence and style types also scope the data as a list of            The general approach of creating sequences of feature
chords. If no other type is indicated by a concept in CR, the          structures (the “followed by” query type, e.g., “quaver C#
default note type is used, defining the SD as the concatenation of     followed by crotchet B” performed reasonably, with precision of
the sequence of notes in each voice.                                   0.748 and recall of 0.859 for the beat answer types (performance
     For queries where the voice or clef is specified, for example     increases for the bar answer type). From this, we infer that the
“treble clef” or “soprano part”, the corresponding concepts are        general assumptions underpinning the way noun phrases about
used to filter the data to include just that voice.                    notes are transformed into the query representations using the
                                                                       reduction process are sound.
2.3 Building a Query Representation (QR)
      The remaining tokens in CR are used to create a list of FSs of                     Granularity Precision     Recall
type T following a bespoke rule-based parsing process. The CR is                         Beat         0.713        0.904
processed in reverse order (assuming head-final noun phrases)                            Bar          0.764        0.967
and FSs are constructed in a process loosely based on reduction in                           Table 1. Overall Results
a shift-reduce parser.
      For example, the query “a C sharp crotchet and a D minim”
is mapped to the CR “[_note:name.C, _note:accidental.SHARP,
                                                                       3.2 Future Work
                                                                             In this work, time constraints affected the choice of methods
_note:length.1, _note:name.D, _note:length.2]”. The concepts
                                                                       used in the CLAS system. For example, instead of the bespoke
“[_note:name.D, _note:length.2]” are consumed first and used to
                                                                       parsing process used here to map from the query tokens to the
populate a FS. At this point, the “_note:length.1” concept is
                                                                       feature structures in the Query Representation, an alternative
encountered. Because the current FS already has a note length
                                                                       method might be to create a context-free grammar for the domain
value (a “minim”), the FS is popped off and pushed onto the QR
                                                                       sublanguage and to use a tool like NLTK2 to parse the tokens,
list. A new FS is then used to consume the remaining tokens:
                                                                       resulting in a syntactic parse. This linguistic structure can then be
“[_note:name.C, _note:accidental.SHARP, _note:length.1]”. The
                                                                       mapped to the feature structures. In future work, we will examine
CR is now empty and the QR is a list of two FSs corresponding to
                                                                       the parsing of noun phrase structures in which the features for
the notes. Parsing works similarly for the other types. For
                                                                       matching are propagated up to an appropriate node in the tree.
example, cadences are sequences of chord FSs.
                                                                       These can then be collected to form the Query Representation.
2.4 Matching a Query Representation to                                       Finally, instead of enumerating exhaustively through all
                                                                       notes, in future work, we will examine the use of search engines
Scoped Data                                                            to find candidate starting positions, from which feature unification
      Once a QR is generated, the SD sequence is then iterated         processes can then start. In this approach, notes might be treated
through and at each position a match to the QR is attempted using      as quasi-documents, allowing them to be indexed by metadata
feature unification. If a match is found, then a candidate answer      based on musical properties.
passage is stored.
      For style answers, a different process is used based on simple   4. CONCLUSION
heuristics. For example, the homophony and polyphony answer                 In this work, expert knowledge in music theory was directly
generation processes consider chords for passing notes, indicated      incorporated into a bespoke parser and lexicon. These were used
by implicit ties. Consequently, the QR for this type is an empty       to interpret a music NLQ, and a scoping process to reduce the
list since no feature unification takes place.                         space for candidate answers. Parsing was performed using a
                                                                       reduce-style process. Matches were performed using feature
3. RESULTS AND DISCUSSION                                              unification. Performance on this task was encouraging with 0.76
                                                                       precision and 0.96 recall.
3.1 Results
      Performance for this system is encouraging. The overall          5. REFERENCES
results are presented in Table 1, which lists the recall and           [1] Sutcliffe, R., Crawford, T., Fox, C., Root, D.L., and Hovy, E.
precision for answers at two granularities of answers: the correct         2014. The C@merata Task at MediaEval 2014: Natural
bars and also the correct beats. Considering the hand-crafted              language queries on classical music scores. In MediaEval
lexicon and the bespoke parsing mechanism, the system performs             2014 Workshop, Barcelona, Spain, October 16-17 2014.
reasonably well at both granularity answer types, with precision
around 0.7 and recall at around 0.9. At the time of writing, the
average performance of systems participating in the C@merata
task is not available.
      The C@merata evaluation also provides additional statistics
regarding performance based on the type of query. The system
does well with queries related to the properties of notes in a
sequence. For these categories, “simple pitch” (e.g., “G”),
“simple length” (e.g., “quarter note rest”), “pitch and length”

1
      The method chordify from the music21 package
                                                                       2
    (http://web.mit.edu/music21/) is used to produce this view.            http://www.nltk.org/