=Paper= {{Paper |id=Vol-2068/milc4 |storemode=property |title=An Environment for Machine Pedagogy: Learning How to Teach Computers to Read Music |pdfUrl=https://ceur-ws.org/Vol-2068/milc4.pdf |volume=Vol-2068 |authors=Gabriel Vigliensoni,Jorge Calvo-Zaragoza,Ichiro Fujinaga |dblpUrl=https://dblp.org/rec/conf/iui/VigliensoniCF18 }} ==An Environment for Machine Pedagogy: Learning How to Teach Computers to Read Music== https://ceur-ws.org/Vol-2068/milc4.pdf
   An environment for machine pedagogy: Learning how to
              teach computers to read music

                         Gabriel Vigliensoni, Jorge Calvo-Zaragoza, and Ichiro Fujinaga
                                Schulich School of Music, McGill University, CIRMMT
                                                 Montréal, QC, Canada
                      {gabriel.vigliensonimartin, jorge.calvozaragoza, ichiro.fujinaga}@mcgill.ca


ABSTRACT                                                               One of the strengths of current learning machines lies in their
We believe that in many machine learning systems it would              ability to recognize complex patterns, provided that there is a
be effective to create a pedagogical environment where both            large amount of labeled training data (ground truth). In cases
the machines and the humans can incrementally learn to solve           where massive ground-truth datasets are not readily available,
problems through interaction and adaptation.                           one solution is to incrementally and interactively train an
                                                                       adaptive system, with gradual exposure of new data. We argue
We are designing an optical music recognition (OMR) work-              that in these supervised adaptive learning environments, it is
flow system where human operators can intervene to correct             important to study how humans impart their knowledge to the
and teach the system at certain stages so that they can learn          machine: what are the different teaching methods (pedagogy)
from the errors and the overall performance can be improved            for the machine to achieve a desired performance and how do
progressively as more music scores are processed.                      humans learn these effective strategies.
In order to instantiate this pedagogical process, we have de-
veloped a series of browser-based interfaces for the different         A Pedagogy for “Learning Machines”
stages of our OMR workflow: image preprocessing, music                 In this paper, we propose the idea of a pedagogy for learning
symbol recognition, musical notation recognition, and final            machines as the study of the methods and activities of teaching
representation construction. In most of these stages we inte-          machines. This pedagogy is about creating an environment
grate human input with the aim of teaching the computers to            where humans can learn the art of how to teach machines run-
improve the performance.                                               ning learning algorithms in an incremental learning process.
                                                                       Turing also anticipated [14, p. 472] that learning machines
ACM Classification Keywords
H.5.5. Information interfaces and presentation (e.g., HCI):              will make mistakes at times, and at times they may make
Sound and Music Computing—Systems; H.5.2. Information                    new and very interesting statements, and on the whole
interfaces and presentation (e.g., HCI): User Interfaces—User-           the output of them will be worth attention to the same
centered design; I.5.5. Pattern recognition: Implementation—             sort of extent as the output of a human mind.
Interactive systems
                                                                       Following Turing’s vision, we propose to exploit human skills
Author Keywords                                                        and knowledge to teach machines to optimize their perfor-
Optical music recognition; interactive machine learning;               mance. In order to achieve this, we first need to understand
artificial pedagogy;                                                   how humans interact with a machine-learning component and
                                                                       then we need to build a clever workflow in order to take ad-
INTRODUCTION                                                           vantages of the intelligence of the human and the ability to
The idea of achieving intellectual development of a machine—           perform fast calculations of the computer.
or making computers smarter when creating algorithmic mod-             Bieger et al. proposed a conceptual framework for teach-
els, is not new. Alan Turing stated in the middle of the last          ing intelligent systems [1]. They identified the constituent
century that the interaction of machines with humans would             elements of that framework and stated that the interaction
be necessary to adapt machines to the human standard and to            between teachers (e.g., a human actor) and learners (e.g., a
achieve intellectual or performance parity with humans [14].           computer system) has the goal of teaching the learning system
He envisioned that human guidance and feedback are desir-              to gain knowledge about something or about a specific task.
able at various points of the machine’s process of learning.           As a pedagogical strategy, we hypothesize that by knowing
However, Turing also anticipated that humans can act as a              the learner, and how the learner reacts to correction and new
“brake” in fast machine computational processes, and so the            input, teachers can adapt their teaching tactics to improve the
places and levels of interaction between machines and humans           pedagogy.
should be studied and considered carefully.
                                                                       The impact of human supervision in the loop of supervised
©2018. Copyright for the individual papers remains with the authors.   machine learning workflows has been also empirically studied.
Copying permitted for private and academic purposes.                   For example, Fails and Olsen built a system for creating image
MILC ’18, March 11, 2018, Tokyo, Japan
Figure 1. Our end-to-end optical music recognition (OMR) workflow. Places where human intervention or human-entered data is needed are indicated
by a human icon. The interfaces that humans use to visualize intermediate outputs of the system as well as to teach the system are shaded in grey.



classifiers and proposed the concept of interactive machine                 In order to work at a larger scale, we have taken a different
learning [7] for those environments where human teachers                    route to OMR of Medieval and Renaissance music by using a
evaluate a model created by a learning machine, then edit the               machine learning-based approach. Instead of using heuristics
training data, and retrain the model according to their expert              and features that take advantage of specific characteristics of
judgment to improve the performance of the system in the                    the documents, we teach the computer to classify the different
given task. Also, Fiebrink et al. studied evaluation practices              elements in a music score by training it with a large number
of human actors interactively building supervised learning                  of examples for each category to be classified. The computer
systems for gesture analysis [8].                                           learns the regularities in these examples and creates a model
                                                                            of the data. Once a model is created, it is used to classify new
In the next section we will detail how we have incorporated                 examples that the computer has not yet seen. In other words,
interactive checkpoints between human teachers and learning-                the computer learns by examples from the teacher.
machine systems in the development of an intelligent interface
for encoding symbolic music, so that people can access cul-                 In the standard OMR workflow, a human intervention is re-
tural music heritage in an unprecedented manner.                            quired to correct the errors generated by the automated process.
                                                                            Hence, we can take advantage of this by incorporating the
TEACHING MACHINES HOW TO READ MUSIC SCORES                                  previously corrected scores, as ground truth, for subsequent
Our aim is to read and extract the content from digitized im-               processing in an adaptive OMR system [9]. Pugin et al. experi-
ages of music documents. This process is called optical music               mented with this idea by building book-adaptive OMR models
recognition (OMR) and, despite more than 30 years of re-                    for music from microfilms [12]. Their experiments showed
search, it remains to be a difficult problem. The slow develop-             that human editing costs were substantially reduced and that
ment in OMR, particularly when dealing with older music doc-                the approach was especially well suited to handle the vari-
uments, lies mainly in the large variability of musical sources             ous degradation levels of music documents from typographic
(i.e., degradation, bleed-through, handwriting and notation                 prints.
style, among others). Since most approaches for extracting the              Our entire OMR workflow is depicted in Figure 1. This pro-
musical content in the different layers of these manuscripts                cess is divided into four stages: image preprocessing, music
(e.g., musical notes, lyrics, staff lines, ornamental letters, etc.)        symbol recognition, musical notation recognition, and final
have been developed using heuristic approaches, they rely on                representation reconstruction. Digitized music scores are the
specific characteristics of the documents, and so these meth-               input to the system and image preprocessing is applied to seg-
ods usually do not generalize well to music documents of a                  ment the constituent parts of the music document into layers.
different type or era.                                                      The recognition of the music symbols and the analysis of their
Fully manual OMR projects have been developed to overcome                   relationship is achieved once the symbols are isolated and
the large degree of variability in handwritten music scores. Al-            classified in the found layers. Finally, the retrieved musical
legro, for example, is a recently developed web-based crowd-                information is encoded into a machine-readable format. We
sourcing tool to transcribe and encode scores of a corpus of                want to automate the process of extracting and digitizing the
folk songs in Common Western Music Notation [2].                            content of music scores. However, since we know that this
process is not error free, and the errors generated in previous      cal strategies and actions have permitted us to considerably
steps are carried forward to the next ones, we want to learn         reduce the amount of effort when creating ground truth for
about the type of errors that the computer makes in each stage       image preprocessing for OMR by 40 percent. Importantly, we
in order to: (i) provide better ground-truth data to improve         have not only obtained similar performance than using ground
the performance of the computer and (ii) let users (teachers)        truth created from scratch, but we have also achieved higher
of the system understand and know where computers make               user satisfaction [5]. We are currently increasing the iteration
mistakes in order to modify their behavior. To facilitate these      rate between training, correction, and retraining to see if even
tasks, we have implemented interactive checkpoints in the            better results can be obtained.
OMR workflow.
                                                                     Once the image preprocessing step has been performed, our
In the next two subsections we present the interactive inter-        OMR system outputs a number of image files per original
faces we have developed for teaching the machine how to              score image, where each file contains a layer representing
perform tasks in the first two stages of the OMR workflow.           different type of musical information. For example, these
                                                                     layers may contain notes, staff lines, lyrics, annotations, or
Teaching machines for image segmentation                             ornamental letters.
The first stage in our OMR workflow is image preprocessing.
                                                                     Teaching machines to recognize musical symbols
In this step, all pixels of the music score image are classified
into different, pre-defined layers. Since we need training           Our application for the second stage of the OMR workflow,
data as example for recognizing the different layers within          music symbol recognition, is called Interactive Classifier (IC).
an image, and creating ground truth from scratch is onerous          IC is a web-based version of the Gamera classifier [6]. In
and expensive, we have tested a few approaches for teaching          this stage, the connected components of a specific layer of the
the computer to perform the image preprocessing. So far, we          original image are automatically grouped into glyphs. Then, a
have found that we can drastically reduce the time and effort        human teacher has to manually label the classes of a number of
needed to build ground truth by preprocessing a small number         musical glyphs. IC will extract a set of features for describing
of images with a pre-existing model, usually a model learned         each of the glyphs, and will classify the data based on the
in pages of similar characteristics. If no model achieves a          k-nearest neighbors classifier.
meaningful result (i.e., if the output is not significantly better   An attractive aspect of IC is that it can be used in an incre-
than random), we use a heuristic method. Then, we correct            mental learning fashion [11]. That is, as new data is entered
the coarse errors in the output of the previous stage with a         by a human teacher into the system, IC will learn from new
pixel-level editor. In this step, we only spend the amount of        information and will accommodate the classes while preserv-
time required to correct the major errors in order to have a         ing previously acquired knowledge without building a new
reasonable set of corrected data, but not perfect. Finally, we       classifier. In other words, the IC module for music symbol
iterate over the two previous steps until desired performance        recognition is designed in a way that human teachers do not
is achieved. We assume that perfect performance can not be           have to start over and over from scratch if new data or classes
achieved because, at pixel-level, even for humans it is hard to      are entered into the learning system. Instead, they can use a
discriminate to what layer a pixel belongs to, especially at the     previously trained classifier of glyphs and labels for the initial
boundaries.                                                          classification. Then, they can manually correct the glyphs that
Most image preprocessing techniques (based on heuristic or           were misclassified and perform a reclassification. By repeating
machine learning techniques) output a non-negligible amount          this process, IC will learn the corrections at each iteration and
of misclassified pixels, and so we developed Pixel.js, an open       will build a better classifier until the teacher is satisfied with
source, web-based, pixel-level classification application to         the results.
correct the output of image segmentation processes [13]. We          An interesting characteristic of IC is that how well the machine
use this tool interactively with a convolutional neural network-     learns depends on how well the human teaches it. In fact, the
based classifier [4], to create ground-truth data incrementally.     human, through interaction, can gradually learn how to teach
A conventional machine learning approach would work under            the machine better. Furthermore, human teachers do not need
the assumption that training and tuning will be performed a          to know the intricacies of machine learning or need to be a
few times and need not be interactive. Hence, one reasonable         domain expert because, for humans, these are simple visual
strategy for improving supervised learning systems using hu-         tasks. We strongly believe that this interaction is important for
man interaction is enabling the user to evaluate a model, then       developing a pedagogy for machines that learn.
edit its training dataset based on his or her judgments of how
                                                                     Non-pedagogical OMR stages
the model should improve.
                                                                     The last two stages of our OMR workflow, musical nota-
In our approach for image segmentation, the output of a learn-       tion recognition and final representation construction have
ing system is used by a human teacher to further inform the          a common interactive breakpoint for visualizing and correct-
system about the performance of the task. As a result, we are        ing the output of the automatized OMR process. This human-
implementing an incremental and adaptive workflow based on           driven checkpoint is embedded as a web-based interface called
tactics and strategies by which human teachers modify their          Neume Editor Online (Neon) [3]. Neon allows a user to in-
actions depending on the outcome of a task given to learning         spect differences between the original music score image and
machines. Preliminary implementations of these pedagogi-             the rendered version of the output of the OMR process. By
visual inspection of the two overlaid scores, the user can ob-         Proceedings of the 13th International Society for Music
serve their difference and manually add, edit, or delete music         Information Retrieval Conference. 121–126.
symbols in the browser. So far, however, corrections entered        4. Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro
by the user are not fed back into the learning system, but they        Fujinaga. 2017a. Pixel-wise binarization of musical
change the encoded music file output.                                  documents with convolutional neural networks. In
                                                                       Proceedings of the 15th IAPR International Conference
Our OMR workflow management system                                     on Machine Vision Applications. Nagoya, Japan,
Since our workflow requires a human operator to teach the              362–365.
learning system, we need to be able to create interactive check-
points where the system stops a process and waits for user          5. Jorge Calvo-Zaragoza, Ké Zhang, Zeyad Saleh, Gabriel
input. As a result, all the constituent parts of our OMR work-         Vigliensoni, and Ichiro Fujinaga. 2017b. Music document
flow are handled by Rodan, a distributed, collaborative, and           layout analysis through machine learning and human
networked adaptive workflow management system [10] that                feedback. In Proceedings of 12th IAPR International
allows to specify interactive and non-interactive tasks.               Workshop on Graphics Recognition. Kyoto, Japan.
                                                                    6. Michael Droettboom, Karl MacMillan, and Ichiro
FINAL REMARKS AND FUTURE WORK                                          Fujinaga. 2003. The Gamera framework for building
The end goal of our project is not only to segment images              custom recognition systems. In Proceedings of the 2003
and to recognize music symbols, but to create a final music            Symposium on Document Image Understanding
representation that can be browsable and searchable by hu-             Technologies. Greenbelt, MD, 275–286.
mans and computers by many different means. We envision
this interface as an intelligent, music-score-searching tool for    7. Jerry Alan Fails and Dan R. Olsen Jr. 2003. Interactive
the 21st century. We are currently investigating the available         machine learning. In Proceedings of the 8th International
infrastructure for creating this interface. Among them, we are         Conference on Intelligent User Interfaces. Miami, FL,
making use of the International Image Interoperability Frame-          39–45.
work (IIIF) and IIIF manifests, which allows for the display        8. Rebecca Fiebrink, Perry R. Cook, and Dan Trueman.
of high-resolution images directly from the institutions having        2011. Human model evaluation in interactive supervised
the rights to these images. We also make use of visualization          learning. In Proceedings of the SIGCHI Conference on
interfaces (e.g., Diva.js document image viewer) that take ad-         Human Factors in Computing Systems. 147–156.
vantage of IIIF and the Music Encoding Initiative (MEI) music
encoding format (e.g., Verovio music notation engraving li-         9. Ichiro Fujinaga. 1996. Adaptive optical music recognition.
brary). We hope that this infrastructure, in combination with          PhD Dissertation. McGill University, Montréal, QC.
the proper teaching strategies and tactics developed by human      10. Andrew Hankinson. 2015. Optical Music Recognition
teachers in the interfaces for training the OMR system, will           Infrastructure for Large-scale Music Document Analysis.
enable the end-to-end recognition and encoding of music from           Ph.D. Dissertation. McGill University, Montréal, QC.
music score images.
                                                                   11. Robi Polikar, Lalita Upda, Satish S. Upda, and Vasant
ACKNOWLEDGMENTS                                                        Honavar. 2001. Learn++: An incremental learning
This research has been supported by the Social Sciences and            algorithm for supervised neural networks. IEEE
Humanities Research Council of Canada. Important parts of              Transactions on Systems, Man, and Cybernetics—Part C:
this work used ComputeCanada’s High Performance Comput-                Applications and Reviews 31, 4 (2001), 497–508.
ing resources.                                                     12. Laurent Pugin, John Ashley Burgoyne, Douglas Eck, and
                                                                       Ichiro Fujinaga. 2007. Book-adaptive and
REFERENCES                                                             book-dependent models to accelerate digitization of early
 1. Jordi Bieger, Kristinn R. Thórisson, and Bas R.                    music. In Proceedings of the NIPS Workshop on Music,
    Steunebrink. 2017. The pedagogical pentagon: A                     Brain, and Cognition. Whistler, BC, 1–8.
    conceptual framework for artificial pedagogy. In
    International Conference on Artificial General                 13. Zeyad Saleh, Ké Zhang, Jorge Calvo-Zaragoza, Gabriel
    Intelligence (Lecture Notes in Computer Science, vol               Vigliensoni, and Ichiro Fujinaga. 2017. Pixel.js:
    10414), Tom Everitt, Ben Goertzel, and Alexey Potapov              Web-based pixel classification correction platform from
    (Eds.). Springer, Cham, 212–222.                                   ground truth creation. In Proceedings of the 12th IAPR
                                                                       International Workshop on Graphics Recognition. Kyoto,
 2. Manuel Burghardt and Sebastian Spanner. 2017. Allegro:             Japan.
    User-centered design of a tool for the crowdsourced
    transcription of handwritten music scores. In Proceedings      14. Alan M. Turing. 2004. Intelligent machinery, a heretical
    of the 2nd International Conference on Digital Access to           theory. In The Essential Turing: Seminal Writings in
    Textual Cultural Heritage. 15–20.                                  Computing, Logic, Philosophy, Artificial Intelligence, and
                                                                       Artificial Life: Plus The Secrets of Enigma, B. Jack
 3. Gregory Burlet, Alastair Porter, Andrew Hankinson, and             Copeland (Ed.). Oxford University Press, Oxford, United
    Ichiro Fujinaga. 2012. Neon. js: Neume Editor Online. In           Kingdom, Chapter 12, 472–475.