=Paper=
{{Paper
|id=Vol-2068/milc4
|storemode=property
|title=An Environment for Machine Pedagogy: Learning How to Teach Computers to Read Music
|pdfUrl=https://ceur-ws.org/Vol-2068/milc4.pdf
|volume=Vol-2068
|authors=Gabriel Vigliensoni,Jorge Calvo-Zaragoza,Ichiro Fujinaga
|dblpUrl=https://dblp.org/rec/conf/iui/VigliensoniCF18
}}
==An Environment for Machine Pedagogy: Learning How to Teach Computers to Read Music==
An environment for machine pedagogy: Learning how to teach computers to read music Gabriel Vigliensoni, Jorge Calvo-Zaragoza, and Ichiro Fujinaga Schulich School of Music, McGill University, CIRMMT Montréal, QC, Canada {gabriel.vigliensonimartin, jorge.calvozaragoza, ichiro.fujinaga}@mcgill.ca ABSTRACT One of the strengths of current learning machines lies in their We believe that in many machine learning systems it would ability to recognize complex patterns, provided that there is a be effective to create a pedagogical environment where both large amount of labeled training data (ground truth). In cases the machines and the humans can incrementally learn to solve where massive ground-truth datasets are not readily available, problems through interaction and adaptation. one solution is to incrementally and interactively train an adaptive system, with gradual exposure of new data. We argue We are designing an optical music recognition (OMR) work- that in these supervised adaptive learning environments, it is flow system where human operators can intervene to correct important to study how humans impart their knowledge to the and teach the system at certain stages so that they can learn machine: what are the different teaching methods (pedagogy) from the errors and the overall performance can be improved for the machine to achieve a desired performance and how do progressively as more music scores are processed. humans learn these effective strategies. In order to instantiate this pedagogical process, we have de- veloped a series of browser-based interfaces for the different A Pedagogy for “Learning Machines” stages of our OMR workflow: image preprocessing, music In this paper, we propose the idea of a pedagogy for learning symbol recognition, musical notation recognition, and final machines as the study of the methods and activities of teaching representation construction. In most of these stages we inte- machines. This pedagogy is about creating an environment grate human input with the aim of teaching the computers to where humans can learn the art of how to teach machines run- improve the performance. ning learning algorithms in an incremental learning process. Turing also anticipated [14, p. 472] that learning machines ACM Classification Keywords H.5.5. Information interfaces and presentation (e.g., HCI): will make mistakes at times, and at times they may make Sound and Music Computing—Systems; H.5.2. Information new and very interesting statements, and on the whole interfaces and presentation (e.g., HCI): User Interfaces—User- the output of them will be worth attention to the same centered design; I.5.5. Pattern recognition: Implementation— sort of extent as the output of a human mind. Interactive systems Following Turing’s vision, we propose to exploit human skills Author Keywords and knowledge to teach machines to optimize their perfor- Optical music recognition; interactive machine learning; mance. In order to achieve this, we first need to understand artificial pedagogy; how humans interact with a machine-learning component and then we need to build a clever workflow in order to take ad- INTRODUCTION vantages of the intelligence of the human and the ability to The idea of achieving intellectual development of a machine— perform fast calculations of the computer. or making computers smarter when creating algorithmic mod- Bieger et al. proposed a conceptual framework for teach- els, is not new. Alan Turing stated in the middle of the last ing intelligent systems [1]. They identified the constituent century that the interaction of machines with humans would elements of that framework and stated that the interaction be necessary to adapt machines to the human standard and to between teachers (e.g., a human actor) and learners (e.g., a achieve intellectual or performance parity with humans [14]. computer system) has the goal of teaching the learning system He envisioned that human guidance and feedback are desir- to gain knowledge about something or about a specific task. able at various points of the machine’s process of learning. As a pedagogical strategy, we hypothesize that by knowing However, Turing also anticipated that humans can act as a the learner, and how the learner reacts to correction and new “brake” in fast machine computational processes, and so the input, teachers can adapt their teaching tactics to improve the places and levels of interaction between machines and humans pedagogy. should be studied and considered carefully. The impact of human supervision in the loop of supervised ©2018. Copyright for the individual papers remains with the authors. machine learning workflows has been also empirically studied. Copying permitted for private and academic purposes. For example, Fails and Olsen built a system for creating image MILC ’18, March 11, 2018, Tokyo, Japan Figure 1. Our end-to-end optical music recognition (OMR) workflow. Places where human intervention or human-entered data is needed are indicated by a human icon. The interfaces that humans use to visualize intermediate outputs of the system as well as to teach the system are shaded in grey. classifiers and proposed the concept of interactive machine In order to work at a larger scale, we have taken a different learning [7] for those environments where human teachers route to OMR of Medieval and Renaissance music by using a evaluate a model created by a learning machine, then edit the machine learning-based approach. Instead of using heuristics training data, and retrain the model according to their expert and features that take advantage of specific characteristics of judgment to improve the performance of the system in the the documents, we teach the computer to classify the different given task. Also, Fiebrink et al. studied evaluation practices elements in a music score by training it with a large number of human actors interactively building supervised learning of examples for each category to be classified. The computer systems for gesture analysis [8]. learns the regularities in these examples and creates a model of the data. Once a model is created, it is used to classify new In the next section we will detail how we have incorporated examples that the computer has not yet seen. In other words, interactive checkpoints between human teachers and learning- the computer learns by examples from the teacher. machine systems in the development of an intelligent interface for encoding symbolic music, so that people can access cul- In the standard OMR workflow, a human intervention is re- tural music heritage in an unprecedented manner. quired to correct the errors generated by the automated process. Hence, we can take advantage of this by incorporating the TEACHING MACHINES HOW TO READ MUSIC SCORES previously corrected scores, as ground truth, for subsequent Our aim is to read and extract the content from digitized im- processing in an adaptive OMR system [9]. Pugin et al. experi- ages of music documents. This process is called optical music mented with this idea by building book-adaptive OMR models recognition (OMR) and, despite more than 30 years of re- for music from microfilms [12]. Their experiments showed search, it remains to be a difficult problem. The slow develop- that human editing costs were substantially reduced and that ment in OMR, particularly when dealing with older music doc- the approach was especially well suited to handle the vari- uments, lies mainly in the large variability of musical sources ous degradation levels of music documents from typographic (i.e., degradation, bleed-through, handwriting and notation prints. style, among others). Since most approaches for extracting the Our entire OMR workflow is depicted in Figure 1. This pro- musical content in the different layers of these manuscripts cess is divided into four stages: image preprocessing, music (e.g., musical notes, lyrics, staff lines, ornamental letters, etc.) symbol recognition, musical notation recognition, and final have been developed using heuristic approaches, they rely on representation reconstruction. Digitized music scores are the specific characteristics of the documents, and so these meth- input to the system and image preprocessing is applied to seg- ods usually do not generalize well to music documents of a ment the constituent parts of the music document into layers. different type or era. The recognition of the music symbols and the analysis of their Fully manual OMR projects have been developed to overcome relationship is achieved once the symbols are isolated and the large degree of variability in handwritten music scores. Al- classified in the found layers. Finally, the retrieved musical legro, for example, is a recently developed web-based crowd- information is encoded into a machine-readable format. We sourcing tool to transcribe and encode scores of a corpus of want to automate the process of extracting and digitizing the folk songs in Common Western Music Notation [2]. content of music scores. However, since we know that this process is not error free, and the errors generated in previous cal strategies and actions have permitted us to considerably steps are carried forward to the next ones, we want to learn reduce the amount of effort when creating ground truth for about the type of errors that the computer makes in each stage image preprocessing for OMR by 40 percent. Importantly, we in order to: (i) provide better ground-truth data to improve have not only obtained similar performance than using ground the performance of the computer and (ii) let users (teachers) truth created from scratch, but we have also achieved higher of the system understand and know where computers make user satisfaction [5]. We are currently increasing the iteration mistakes in order to modify their behavior. To facilitate these rate between training, correction, and retraining to see if even tasks, we have implemented interactive checkpoints in the better results can be obtained. OMR workflow. Once the image preprocessing step has been performed, our In the next two subsections we present the interactive inter- OMR system outputs a number of image files per original faces we have developed for teaching the machine how to score image, where each file contains a layer representing perform tasks in the first two stages of the OMR workflow. different type of musical information. For example, these layers may contain notes, staff lines, lyrics, annotations, or Teaching machines for image segmentation ornamental letters. The first stage in our OMR workflow is image preprocessing. Teaching machines to recognize musical symbols In this step, all pixels of the music score image are classified into different, pre-defined layers. Since we need training Our application for the second stage of the OMR workflow, data as example for recognizing the different layers within music symbol recognition, is called Interactive Classifier (IC). an image, and creating ground truth from scratch is onerous IC is a web-based version of the Gamera classifier [6]. In and expensive, we have tested a few approaches for teaching this stage, the connected components of a specific layer of the the computer to perform the image preprocessing. So far, we original image are automatically grouped into glyphs. Then, a have found that we can drastically reduce the time and effort human teacher has to manually label the classes of a number of needed to build ground truth by preprocessing a small number musical glyphs. IC will extract a set of features for describing of images with a pre-existing model, usually a model learned each of the glyphs, and will classify the data based on the in pages of similar characteristics. If no model achieves a k-nearest neighbors classifier. meaningful result (i.e., if the output is not significantly better An attractive aspect of IC is that it can be used in an incre- than random), we use a heuristic method. Then, we correct mental learning fashion [11]. That is, as new data is entered the coarse errors in the output of the previous stage with a by a human teacher into the system, IC will learn from new pixel-level editor. In this step, we only spend the amount of information and will accommodate the classes while preserv- time required to correct the major errors in order to have a ing previously acquired knowledge without building a new reasonable set of corrected data, but not perfect. Finally, we classifier. In other words, the IC module for music symbol iterate over the two previous steps until desired performance recognition is designed in a way that human teachers do not is achieved. We assume that perfect performance can not be have to start over and over from scratch if new data or classes achieved because, at pixel-level, even for humans it is hard to are entered into the learning system. Instead, they can use a discriminate to what layer a pixel belongs to, especially at the previously trained classifier of glyphs and labels for the initial boundaries. classification. Then, they can manually correct the glyphs that Most image preprocessing techniques (based on heuristic or were misclassified and perform a reclassification. By repeating machine learning techniques) output a non-negligible amount this process, IC will learn the corrections at each iteration and of misclassified pixels, and so we developed Pixel.js, an open will build a better classifier until the teacher is satisfied with source, web-based, pixel-level classification application to the results. correct the output of image segmentation processes [13]. We An interesting characteristic of IC is that how well the machine use this tool interactively with a convolutional neural network- learns depends on how well the human teaches it. In fact, the based classifier [4], to create ground-truth data incrementally. human, through interaction, can gradually learn how to teach A conventional machine learning approach would work under the machine better. Furthermore, human teachers do not need the assumption that training and tuning will be performed a to know the intricacies of machine learning or need to be a few times and need not be interactive. Hence, one reasonable domain expert because, for humans, these are simple visual strategy for improving supervised learning systems using hu- tasks. We strongly believe that this interaction is important for man interaction is enabling the user to evaluate a model, then developing a pedagogy for machines that learn. edit its training dataset based on his or her judgments of how Non-pedagogical OMR stages the model should improve. The last two stages of our OMR workflow, musical nota- In our approach for image segmentation, the output of a learn- tion recognition and final representation construction have ing system is used by a human teacher to further inform the a common interactive breakpoint for visualizing and correct- system about the performance of the task. As a result, we are ing the output of the automatized OMR process. This human- implementing an incremental and adaptive workflow based on driven checkpoint is embedded as a web-based interface called tactics and strategies by which human teachers modify their Neume Editor Online (Neon) [3]. Neon allows a user to in- actions depending on the outcome of a task given to learning spect differences between the original music score image and machines. Preliminary implementations of these pedagogi- the rendered version of the output of the OMR process. By visual inspection of the two overlaid scores, the user can ob- Proceedings of the 13th International Society for Music serve their difference and manually add, edit, or delete music Information Retrieval Conference. 121–126. symbols in the browser. So far, however, corrections entered 4. Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro by the user are not fed back into the learning system, but they Fujinaga. 2017a. Pixel-wise binarization of musical change the encoded music file output. documents with convolutional neural networks. In Proceedings of the 15th IAPR International Conference Our OMR workflow management system on Machine Vision Applications. Nagoya, Japan, Since our workflow requires a human operator to teach the 362–365. learning system, we need to be able to create interactive check- points where the system stops a process and waits for user 5. Jorge Calvo-Zaragoza, Ké Zhang, Zeyad Saleh, Gabriel input. As a result, all the constituent parts of our OMR work- Vigliensoni, and Ichiro Fujinaga. 2017b. Music document flow are handled by Rodan, a distributed, collaborative, and layout analysis through machine learning and human networked adaptive workflow management system [10] that feedback. In Proceedings of 12th IAPR International allows to specify interactive and non-interactive tasks. Workshop on Graphics Recognition. Kyoto, Japan. 6. Michael Droettboom, Karl MacMillan, and Ichiro FINAL REMARKS AND FUTURE WORK Fujinaga. 2003. The Gamera framework for building The end goal of our project is not only to segment images custom recognition systems. In Proceedings of the 2003 and to recognize music symbols, but to create a final music Symposium on Document Image Understanding representation that can be browsable and searchable by hu- Technologies. Greenbelt, MD, 275–286. mans and computers by many different means. We envision this interface as an intelligent, music-score-searching tool for 7. Jerry Alan Fails and Dan R. Olsen Jr. 2003. Interactive the 21st century. We are currently investigating the available machine learning. In Proceedings of the 8th International infrastructure for creating this interface. Among them, we are Conference on Intelligent User Interfaces. Miami, FL, making use of the International Image Interoperability Frame- 39–45. work (IIIF) and IIIF manifests, which allows for the display 8. Rebecca Fiebrink, Perry R. Cook, and Dan Trueman. of high-resolution images directly from the institutions having 2011. Human model evaluation in interactive supervised the rights to these images. We also make use of visualization learning. In Proceedings of the SIGCHI Conference on interfaces (e.g., Diva.js document image viewer) that take ad- Human Factors in Computing Systems. 147–156. vantage of IIIF and the Music Encoding Initiative (MEI) music encoding format (e.g., Verovio music notation engraving li- 9. Ichiro Fujinaga. 1996. Adaptive optical music recognition. brary). We hope that this infrastructure, in combination with PhD Dissertation. McGill University, Montréal, QC. the proper teaching strategies and tactics developed by human 10. Andrew Hankinson. 2015. Optical Music Recognition teachers in the interfaces for training the OMR system, will Infrastructure for Large-scale Music Document Analysis. enable the end-to-end recognition and encoding of music from Ph.D. Dissertation. McGill University, Montréal, QC. music score images. 11. Robi Polikar, Lalita Upda, Satish S. Upda, and Vasant ACKNOWLEDGMENTS Honavar. 2001. Learn++: An incremental learning This research has been supported by the Social Sciences and algorithm for supervised neural networks. IEEE Humanities Research Council of Canada. Important parts of Transactions on Systems, Man, and Cybernetics—Part C: this work used ComputeCanada’s High Performance Comput- Applications and Reviews 31, 4 (2001), 497–508. ing resources. 12. Laurent Pugin, John Ashley Burgoyne, Douglas Eck, and Ichiro Fujinaga. 2007. Book-adaptive and REFERENCES book-dependent models to accelerate digitization of early 1. Jordi Bieger, Kristinn R. Thórisson, and Bas R. music. In Proceedings of the NIPS Workshop on Music, Steunebrink. 2017. The pedagogical pentagon: A Brain, and Cognition. Whistler, BC, 1–8. conceptual framework for artificial pedagogy. In International Conference on Artificial General 13. Zeyad Saleh, Ké Zhang, Jorge Calvo-Zaragoza, Gabriel Intelligence (Lecture Notes in Computer Science, vol Vigliensoni, and Ichiro Fujinaga. 2017. Pixel.js: 10414), Tom Everitt, Ben Goertzel, and Alexey Potapov Web-based pixel classification correction platform from (Eds.). Springer, Cham, 212–222. ground truth creation. In Proceedings of the 12th IAPR International Workshop on Graphics Recognition. Kyoto, 2. Manuel Burghardt and Sebastian Spanner. 2017. Allegro: Japan. User-centered design of a tool for the crowdsourced transcription of handwritten music scores. In Proceedings 14. Alan M. Turing. 2004. Intelligent machinery, a heretical of the 2nd International Conference on Digital Access to theory. In The Essential Turing: Seminal Writings in Textual Cultural Heritage. 15–20. Computing, Logic, Philosophy, Artificial Intelligence, and Artificial Life: Plus The Secrets of Enigma, B. Jack 3. Gregory Burlet, Alastair Porter, Andrew Hankinson, and Copeland (Ed.). Oxford University Press, Oxford, United Ichiro Fujinaga. 2012. Neon. js: Neume Editor Online. In Kingdom, Chapter 12, 472–475.