AUDIO-VISUAL GUITAR TRANSCRIPTION Marco Paleari, Benoit Huet∗ Antony Schutz, Dirk Slock Multimedia Department Mobile Communication Department Eurecom Institute Eurecom Institute Sophia Antipolis, France Sophia Antipolis, France ABSTRACT Music transcription refers to extraction of a human readable and interpretable description from a recording of a music per- formance. Automatic music transcription remains, nowadays, a challenging research problem when dealing with polyphonic sounds or when removing certain constraints. Some instru- ments like guitars and violins add ambiguity to the problem as the same note can be played at different positions. When dealing with guitar music tablature are, often, preferred to the usual music score, as they present information in a more ac- Fig. 1. Notes on a guitar fretboard cessible way. Here, we address this issue with a system which uses the visual modality to support traditional audio transcrip- tion techniques. The system is composed of four modules which have been implemented and evaluated: a system which tracks the position of the fretboard on a video stream, a sys- Burns and Wanderley [1] report few attempts that have tem which automatically detects the position of the guitar on been done to automatically extrapolate fingering information the first fret to initialize the first system, a system which de- through computer algorithms: real time processing using midi tects the position of the hand on the guitar, and finally a sys- guitar, post processing using sound analysis, post process- tem which fuses the visual and audio information to extract a ing using score analysis. Verner [2] retrieves fingering in- tablature. Results show that this kind of multimodal approach formation through the use of midi guitar. Using a midi gui- can easily disambiguate 89% of notes in a deterministic way. tar with different midi channels associated to each different string. Traube [3] suggests a solution based on the timbre. If two notes have the same pitch they can have different timbre. 1. INTRODUCTION Common issues are precision, needs for a-priori knowledge, and monophonic operation limitation. Another possibility is Written music is traditionally presented as a score, a musical to analyze the produced score and to extract the tablature by notation which includes attack times, duration and pitches of applying a set of rules based on physical constraints of the the notes that constitute the song. When dealing with the gui- instrument, biomechanical limitations, and others philologi- tar this task is usually more complex. In fact, the only pitch cal analysis. This kind of methods can result [4] in tablatures of the note is not always enough to represent the movements which are similar to the one generated by humans, but hardly and the positions that the performer has to execute to play a deal with situations in which the artistic intention or skill limi- piece. A guitar can indeed chime the same note at different tations are more important than the biomechanical movement. positions of the fretboard on different strings (See Fig. 1). Last but not least, Burns and Wanderley [1] propose to use the This is why the musical transcription of a guitar usually takes visual modality to extract the fingering information. Their ap- form of a tablature. A tablature is a musical notation which proach makes use of a camera mounted on the head of the gui- includes six lines (one for each guitar string) and numbers tar and extracts fingering information on the first 5 frets but representing the position at which the string. is not applicable to all cases because it needs ad hoc equip- ∗ Eurecom Institute’s research is partially supported by its industrial mem- ment, configuration, and it only returns information about the bers: BMW, Bouygues Télécom, Systems, France Télécom, Hitachi Europe, first 5 frets. This paper presents a multimodal approach to SFR, Sharp, STMicroelectronics, Swisscom, Thales. The research reported address this issue. The proposed approach combines infor- herein was also partially supported by the European Commission under con- tract FP6-027026, Knowledge Space of semantic inference for automatic an- mation from video (webcam quality) and audio analysis in notation and retrieval of multimedia content - K-Space. order to resolve ambiguous situations. the fretboard must be aligned; 2) the lengths of the frets must comply to the rule Li = L(i−1) ∗ 2−1/12 where Li represent the length of the ith fret. To enforce the first constraint a first line is computed that matches the highest possible number of points. The points apart from the line are filtered out and a linear regression (least squares) is computed. All points apart from this sec- ond line are filtered out and recomputed. The second constraint is applied by comparing the posi- tions of the points with a template representing the distances of all the frets from the nut (i.e. the fret at the head of the guitar). Every twenty seconds the tracking is re initialized to solve any kind of issues which may arise from a wrongful adrifts of the Lukas Kanade point tracking. 2.3. Hand Detection In section 2.2 the methodology employed to follow the posi- tion of the frets along the video has been described. Thanks to these coordinates it is possible to separate the region be- longing to the fretboard into n strings × n f rets cells cor- Fig. 2. Interface of the Automatic Transcription System responding to each string/fret intersection. Filtering is done on the frame to detect the skin color and the number of “hand” pixels is counted. A threshold can be 2. GUITAR TRANSCRIPTION applied to detect the presence of the hand (see figure 2). The typical scenario involved in the discussion of this paper 3. CONCLUSIONS involves one guitarist playing a guitar in front of a web-cam (XviD 640x480 pixels at 25 fps). In the work presented here In this paper we have overviewed a complete, quasi uncon- the entire fretboard of the guitar needs to be completely visi- strained, guitar tablature transcription system which uses low ble on the video. cost video cameras to solve string ambiguities in guitar pieces. A prototype was developed as a proof of concept demonstrat- 2.1. Automatic Fretboard Detection ing the feasibility of the system with today technologies. Re- sults of our studies are positive and encourage further studies The first frame of the video is analyzed to detect the guitar and on many aspects of guitar playing. its position. The current version of our system presents few constraints: the guitarist is considered to play a right handed guitar (i.e. the guitar face on the right side) and to trace an 4. REFERENCES angle with the horizontal which does not exceed 90◦ . The [1] A. Burns and M. M. Wanderley, “Visual methods for the background is assumed to be less textured than the guitar. As retrieval of guitarist fingering,” in NIME ’06: Proceed- a final result, this module returns the coordinates of the cor- ings of the 2006 conference on New interfaces for musical ner points defining the position of the guitar fretboard on the expression, Paris, France, 2006, pp. 196–199. video (two outermost points for each detected fret). [2] J. A. Verner, “Midi Guitar Synthesis: Yesterday, Today 2.2. Fretboard Tracking and Tomorrow,” Recording Magazine, vol. 8 (9), pp. 52– 57, 1995. We have described how the fretboard position is detected on the first frame of the video. We make use of the Tomasi Lukas [3] C. Traube, A Interdisciplinary Study of the Timbre of th Kanade algorithm to follow the points along the video. Classical Guitar, Ph.D. thesis, McGill University, 2004. The coordinates of the end points of each fret are influ- enced by the movement of the hand. Therefore, some tem- [4] D. P. Radicioni, L. Anselma, and V. Lombardo, “A plate matching techniques are applied to enforce points to Segmentation-Based Prototype to Compute String Instru- stick to the fretboard. Two constraints were chosen to be in- ments Fingering,” in CIM04: Proceedings of the 1st Con- variant to scale, translation or 3D rotations of the guitar: 1) ference on Interdisciplinary Musicology, 2004. all the points defining the upper (as well as lower) bound of