Improving music composition through peer feedback: experiment and preliminary results Daniel Martı́n and Benjamin Frantz and François Pachet Sony CSL Paris {daniel.martin,pachet}@csl.sony.fr Abstract address these questions we have designed a music composi- tion experiment based on anonymous one-way feedback with To which extent peer feedback can affect the qual- no dialogue. Such a scenario differs from typical collabora- ity of a music composition? How does musical ex- tive composition contexts in which composers work together perience influence the quality of a feedback during hand by hand in a composition. The experiment is not aimed the song composition process? To answer these at being realistic or to propose a new tool for collaboration questions we designed and conducted an experi- composition, but specifically to collect quantitative data re- ment in which participants compose short songs us- garding the relation between feedback, skills and song qual- ing an online lead sheet editor, are given the possi- ity. bility to feedback on other participant’s songs and We focus on the role of peer feedback in music compo- can either accept or reject feedback on their com- sition, specifically in lead sheet composition. A lead sheet positions. This experiments aim at collecting quan- is a representation of a simple song consisting of a melody titative data relating the intrinsic quality of songs and a corresponding chord grid. We propose an experiment (estimated by peer evaluation) with the nature of in which peer feedback consists in suggestions of changes of feedback. Preliminary results show that peer feed- certain parts of the lead sheet: specific notes or groups of back can indeed improve both the quality of a song notes or chords. These musical suggestions can be accom- composition and the composer’ satisfaction about panied by a text explanation. Once a feedback is posted by it. Also, composers tend to prefer compositions a participant, it can be reviewed by the composer, who then from other musicians with similar musical experi- decides to either accept it (and modify the lead sheet accord- ence level. ingly) or discard it. Additionally to the sheer effect of feedbacks, we also ex- amine the characteristics of the composer, commentator or 1 Introduction judge of the participants. Indeed, having an extended experi- Peer feedback has become an ubiquitous feature of online ed- ence in music composition might be seen as a prerequisite to ucation systems. Peer feedback consists in letting students write a nice song or to give useful suggestions. However, pre- or participants in a class revise, assess and more generally vious research showed that expertise might not be as critical comment on the work of other students. This model is op- as we could expect [Frese et al., 1999]. posed to the traditional one in which students’ works are eval- uated only by a teacher. Peer feedback is acknowledged to 2 Description of the experiment bring many benefits [Rollinson, 2005] such as saving teach- Participants are instructed to write a short composition using ers’ time as well as other pedagogical positive effects [Sadler an on-line lead sheet editor [Martı́n et al., 2015]. Then they and Good, 2006]. With the increase of online learning com- are asked to give feedback to another participant’s composi- munities and MOOCS [September, 2013], peer feedback is tion, and finally they are asked to improve their own origi- becoming more and more popular. nal composition using feedback posted on their composition. Peer feedback is not only useful in pedagogical contexts, it Participants are divided randomly in two groups: participants can be also used in creative tasks. In music composition, col- in the control group (G1) do not receive any feedback, and try laborative composition has been addressed in several studies to improve the song by themselves, whereas participants from [Donin, forthcoming 2016]. There are online creative com- the experimental group (G2) may use the feedback received munities in which music is composed collaboratively by sev- to improve their own song. The existence of these two groups eral users [Settles and Dow, 2013]. is ignored by the users so that the results are not biased. In those creative contexts, the following questions are le- As we are trying to assess the impact of feedback on the gitimate: to which extent peer-feedback can affect the quality quality of a music composition, we need to estimate the qual- of a musical composition? What is the influence of the musi- ity of all compositions as well as their various variations dur- cal experience of the composers involved in this process? To ing the experiment. To do so we use social consensus to de- 27 termine the quality of a song: participants listen and are given just made as well as their opinion on the original song they the possibility to ”like” other participants’ compositions. The modified. quality of a song is then simply determined by the number of likes obtained for that song. In the next section we describe in detail each phase of the experiment: 2.1 Questionnaire Participants start the experiment by answering 15 questions about to their experience in music, and more specifically in music composition. For example, they are asked how many years they have studied music theory, how many years they have been playing in a band, which style of music they like more, how often do they compose... etc. 2.2 Original composition Participants then write a short composition using the online Figure 2: Screenshot showing a participant entering an expla- lead sheet editor. A lead sheet is a particular type of music nation of the suggestion. score widely used in jazz, bossa-nova and song-writing, con- sisting on a monophonic melody and a chord grid. All com- positions have a fixed length of 8 bars; participants are not 2.4 Improvement: Final composition able to add or delete bars, but they can choose the tempo and Next, participants are asked to reconsider their own compo- the time signature of the song. Participants fill the 8 bars with sition and are asked to try to improve it. Participants from a melody and chord labels (e.g. Dmaj7, Em7...etc.). Figure 1 G1 (control group) are told that they unfortunately did not re- shows a screen-shot of the lead sheet editor. ceive suggestions and are encouraged to try to improve their own composition by themselves. Participants from G2 see the suggestions they received from two other participants. They can listen to all the suggestions. If they like a suggestion they can accept it, so that it is kept and the song is automatically updated accordingly. In addition to integrating suggestions, they can modify freely their composition. Once they are fin- ished, they answer a questionnaire about their confidence on their own improvement and on their opinion on the sugges- tions received. 2.5 Evaluation phase The last step of the experiment is to evaluate pairs of com- Figure 1: Screenshot of a composition being entered with the positions from other participants. Each pair of songs consist lead sheet editor. on the original song and the improved song. Participants are asked to evaluate each song by place it in a vertical display Participants can listen to their composition with a basic with a legend from 0 (”I don’t like it”) to 100 (”I like it very MIDI player. When they are done they click on ”Save and much”). Participants do not know which is the original and Finish”. Next, they answer a questionnaire about their con- the improved song when they are evaluating. One of the ver- fidence in the quality, complexity and satisfaction on their sions is presented as song A and the other as song B and this composition. assignment is performed randomly. Participants have to eval- 2.3 Feedback Posting uate at least 5 pairs of songs in order to finish the experiment. Once they have finished their composition they are asked to give feedback to another participant by suggesting improve- 3 Results ments in another participants’ composition. Each suggestion In this section we describe in detail the results obtained from can be at the most, two bars long. Participants can make as each phase of the experiment. many suggestions as they want as long as they do not over- lap. So, each participant can make a maximum of 8 sugges- 3.1 Population tions (one per bar). To make a suggestion, participants must The experiment was conducted between February and July choose the bar(s) to modify, then they can change the notes 2015. 66 participants completed the experiment (68% men and the chord symbols. Optionally, they can also leave a text and 32% women). Mean age was 29.2 years, ranging from comment explaining their changes. Figure 2 shows a com- 19 to 61. Musical experience was measured through a ques- position in which a participant is entering suggestions with tionnaire with 7 items. The scale has a satisfactory sensibility an explanation. When they are finished, they answer a short with an observed range from 7 to 41 (out of 0 to 42) and questionnaire about their confidence on the suggestions they we observed a mean of 28.7 with a Standard Deviation (SD) 28 of 8.9. The intern consistency is satisfactory (Cronbach’s al- commentators did agree together on the quality (r=.80) and pha=.82). on the complexity (r=.70). Composition experience was measured through a question- Moreover, from the judgments done during the evaluation naire with 5 items. The results show an overall low level phase (in which participants evaluate pairs of songs from of experience concerning composition in our sample with a other participants), the measurement of the quality of each mean 6.9 (SD=6.1) on a scale ranging from 0 to 30). The original song (standardized to z-scores) allows us to estimate intern consistency is satisfactory (Cronbach’s alpha=.85). the composition skills level of its author. Surprisingly, we ob- served that the quality of the original song is only marginally 3.2 Composition effects related to the composition experience (r=.18, p=.15) or to the Each participant was randomly assigned to either the control musical experience (r=.19, p=.12). group (G1) or the experimental group (G2). No significant We also asked the participants whether they used an in- differences were observed between the two groups in rela- strument to help them in their composition. Results show a tion to age, gender, musical experience or composition expe- marginally significant effect in favor of the use of an instru- rience. ment on the mean quality score (T(64)=-0.87, p=.38). Composition evaluations The mean duration of the composition time of the song as During the evaluation step, we checked if participants had lis- evaluated by the participants is 30 minutes (SD=32 min) rang- tened to the songs before evaluating them. On the 1195 eval- ing from 1 minute to 240 minutes. This evaluation is largely uations made, 219 were made without listening to the song. underestimated by the participants because the real duration We removed those evaluations. calculated from the time spent on the composition software The songs were evaluated by an average of 8.8 different is significantly longer (m=67 min; T(65)=4.20, p<.001). The judges. The mean score of the evaluations made during the correlation between these two durations is not very high, but evaluation phase is 53.25 (SD = 13.26) on a scale ranging significant (r=.46, p<.001 ) indicating that the error of dura- from 0 to 100. However, judges might be more or less strict, tion estimation is not exactly the same for everyone. Interest- and some songs might have been evaluated by a particularly ingly, we observed that the quality of the original songs (from strict or generous participant. To take into account the sever- the evaluation phase) is not linked with the time spent to ity of the judges, we have standardized the evaluations to compose, whether it is subjective (r=.04) or objective (r=.03). get z-scores where the mean and standard deviation used are This result suggests that in a situation where there is no time based on all the evaluations made by a given participant. As constraint, the amount of time devoted to compose has no ef- a result, the mean of the standard scores is approximately fect on its quality. equal to zero, and a standard deviation of approximately .50. Finally, there is a difference in the consensual quality of the It should be noted that this final score correlates strongly original song, obtained from the evaluation of several partic- with the raw score (r=.84). This result indicates that we had ipants (0.07 in G1 vs. -0.15 in G2). This could be due to enough evaluations for each songs to avoid any severity bias. differences in the group of judges evaluating each song. Original Composition Suggestions The questionnaire that participants were asked to com- In the questionnaire filled after making the suggestions, par- plete after finishing the original composition included self- ticipants were asked how much do they think the song they estimation questions about the quality, complexity and sat- are revising will be improved due to their modifications (on a isfaction for their composition on scales ranging from very 7 points Likert scale ranging from 0 ”very little”, to 6 ”very bad/simple/unsatisfied (0) to very good/complex/satisfied (6). much”). We also asked them to evaluate the time they spent to make The participants from G2, the experimental group (N=30), their composition and if they used an instrument to help them received two suggestions for their final composition. Once to compose (and which instrument if they did). they finished, we asked them if the suggestions received were Results show a mean quality of 2.8 (SD=1.5), a mean interesting (on a 7 points Likert scale ranging from 0 ”very complexity of 1.9 (SD=1.6) and a mean satisfaction of 3.2 little”, to 6 ”very much”). Additionally, we recorded the num- (SD=1.6). Only the complexity is significantly different to the ber of suggestions they received and the number of texts com- center of the scales which is 3 (T(65)=-5.27 ; p<.0001). This ments received. means that the participants tend to judge their work as rather We ran a series of correlations between these measures and simple (low complexity). We also observed positive and sig- the improvement effect (the difference between the original nificant correlations between these three measures, ranging song and the final song on the quality judgment score). None from r=.41 to r=.80. were significant, suggesting that neither the number of sug- During the suggestion step, we asked the participants to gestions received nor the number of explanations for that sug- also rate the quality and complexity of the songs they had gestions have an impact on the improvement of a song. to comment. Each composition from the experimental group (G2) was commented by two different participants. In the end Final composition we obtained the score from the author and two other scores Overall, we can see that the control group, G1, does not im- from two different commentators. Interestingly, there was no prove significantly between the original song (m=.07) and the correlation between the scores from the original composer final song (m=.12) (improvement effect = .05, T(35)=0.94, and the ones from the commentators (r<.10), but the two p=.35). However, we do see a significant improvement for 29 the experimental group, G2, between the original song (m=- .15) and the final song (m=.08) (improvement effect = .23, T(29)=2.47, p=.02). See Figure 3. Figure 4: Self-esteemed quality of the original and final songs for the group without feedbacks (G1) and the group with feedbacks (G2). Figure 3: Difference between the original song and the final or helpful (6) to compose with it. Results show a mean of song on on the quality judgment score for the group without 3.13 after the first composition and 3.41 after the final com- feedbacks (G1) and the group with feedbacks (G2). position (the difference is not significant) which means that even if the online editor was not specially helpful, it did not We also examined the subjective evaluation of the partic- hinder the composition process. ipants concerning the improvement of their song. We con- structed two composite scores. One from the self-evaluation Experience effect on evaluations scales of the original song (quality, complexity and satisfac- To find out whether musical experience has an impact on the tion), one from the self-evaluation scales of the final song way participants judge song from other participants. We di- (quality, complexity and satisfaction). The intern consistency vided our sample of participants in two groups according to of those composite scores are satisfactory (the two Cronbach’ their experience as musician (based on the median). We also alphas are above .81). We then conducted a mixed between divided our sample of songs according to the experience as participants (control and experimental groups) x within par- musician of their author. We then ran a two-way ANOVA to ticipants (original and final song) analysis of variance. We explore the effect of the experience of the judges according to observed a significant interaction between groups and songs the experience of the compositor. Results show a crossed in- (F(1,64) = 7.07, p=.01). To explore this interaction, we used teraction between these two variables (F(1,61)=7.63, p=.007) a post-hoc analysis with Tuckey HSD tests. Results show as illustrated in figure 5. These results indicate that experi- that participants who received suggestions had a significant enced judges give high scores to songs from experienced au- improvement between the original and final song (p<.001) thors and low scores to songs from non-experienced authors. while the control group had no improvement (p=.49) See Fig- It is exactly the opposite for the non-experienced judges. ure 4. This means that participants tend to prefer compositions from When evaluating songs, users did not know which song other participants with similar experience. This could explain was the original and which one was the final, as the order of the difference in the evaluation of the original songs in G1 the songs was determined randomly. This was a design deci- and G2. The groups of judges evaluating each song could sion to avoid the fact that participants could tend to rate better have different level of expertise. the final song, as it is supposed to be improved. Aditionally we wanted to ensure that songs were not better rated just be- cause they had more modifications. To check this point, we 4 Conclusion used and melodic similarity algorithm [Urbano et al., 2011] to estimate the similarity between each original and final songs. The aim of this experiment was primarily to examine quan- The correlation between the percent of similarity and the im- titatively the impact of peer feedback in music composition provement effect based both on the composer’s subjective and secondly to assess how important is the experience of the opinion and on the scores from the judges are low (r=-.36, participants as musicians or composers in the whole process. p=.003 and r=-.19, p=.13), which suggests that the improve- Before any improvement or suggestions, participants had to ment is not linked to the dissimilarity between the two ver- write their first song. Interestingly, results show that partici- sions. pants’ previous experience in composition did not impact the quality of their song. The same pattern was also found for Lead sheet editor the participants’ previous experience as a musician. These The software used was developed specifically for the experi- two results suggest that the quality of a song (based on social ment and we asked participant whether it was frustrating (0) consensus) does not really tap in musicality but in something 30 Acknowledgments This work is supported by the Praise project (EU FP7 num- ber 388770), a collaborative project funded by the European Commission under programme FP7-ICT-2011-8. References [Donin, forthcoming 2016] Nicolas Donin. Domesticating gesture: the collaborative creative process of florence baschet’s streicherkreis for ’augmented’ string quartet (2006-2008). Eric Clarke & Mark Doffman (eds.), Cre- ativity, Improvisation and Collaboration: Perspectives on the Performance of Contemporary Music, New York: Ox- ford University Press, forthcoming 2016. [Frese et al., 1999] Michael Frese, Eric Teng, and Cees JD Wijnen. Helping to improve suggestion systems: Predic- tors of making suggestions in companies. Journal of Or- ganizational Behavior, 20(7):1139–1155, 1999. [Martı́n et al., 2015] Daniel Martı́n, Timotée Neullas, and François Pachet. Leadsheetjs: A javascript library for on- Figure 5: Interaction between the experience of the author line lead sheet editing. In First International Conference and the experience of the judges on the quality score. on Technologies for Music Notation and Representation (TENOR), Paris, France, 2015. [Rollinson, 2005] Paul Rollinson. Using peer feedback in the esl writing class. ELT journal, 59(1):23–30, 2005. else, presumably creativity. As before, creativity might play an important role [Frese et al., 1999]. [Sadler and Good, 2006] Philip M Sadler and Eddie Good. The impact of self-and peer-grading on student learning. Results show that composers who received feedback (G2) Educational assessment, 11(1):1–31, 2006. clearly evaluated better the improved song than the original, meaning that they were satisfied with the improvement they [September, 2013] On September. Behind the scenes with made. Further, the evaluation based on social consensus had moocs: Berklee college of musics experience developing, a longer improvement also for G2. Hence, participants who running, and evaluating. CONTINUING HIGHER EDU- received feedbacks not only felt that they had composed a CATION REVIEW, 77:137, 2013. better song after the improvement step, but they actually did. [Settles and Dow, 2013] Burr Settles and Steven Dow. Let’s This basic finding suggests that improvements in music may get together: the formation and success of online creative be achieved even without real collaboration with dialogues collaborations. In Proceedings of the SIGCHI Conference and active interactions, but by simple suggestions on a single on Human Factors in Computing Systems, pages 2009– occasion. 2018. ACM, 2013. Since there is a difference on the evaluation of the origi- [Urbano et al., 2011] Julián Urbano, Juan Lloréns, Jorge nal songs between G1 and G2, we wanted to verify whether Morato, and Sonia Sánchez-Cuadrado. Melodic similar- experience can make a difference when evaluating songs and ity through shape similarity. In Exploring music contents, we found out that participants tend to like more songs that are pages 338–355. Springer, 2011. composed by other participants with similar musical experi- [Van den Berg et al., 2006] Ineke Van den Berg, Wilfried ence. Admiraal, and Albert Pilot. Designing student peer as- Future work may be done by going deeper in determin- sessment in higher education: Analysis of written and oral ing the influence of the participants’ experience. For ex- peer feedback. Teaching in Higher Education, 11(2):135– ample, by checking when are songs more improved, taking 147, 2006. into account the experience of composers, commentators and judges. Further, we could assess more precisely which sug- gestions were actually used (or accepted) by the original com- poser to obtain a ranking of commentators whose suggestions are most accepted, as a measure of how good commentators they are. We could check also if suggestions from experi- enced commentators are more likely to be used from inexperi- enced composers, or whether experienced composers usually accept suggestions of other composers, and how does this af- fects the improvement of the song. 31