Improving music composition through peer feedback: experiment and
                               preliminary results

                          Daniel Martı́n and Benjamin Frantz and François Pachet
                                                Sony CSL Paris
                                       {daniel.martin,pachet}@csl.sony.fr


                         Abstract                                      address these questions we have designed a music composi-
                                                                       tion experiment based on anonymous one-way feedback with
     To which extent peer feedback can affect the qual-                no dialogue. Such a scenario differs from typical collabora-
     ity of a music composition? How does musical ex-                  tive composition contexts in which composers work together
     perience influence the quality of a feedback during               hand by hand in a composition. The experiment is not aimed
     the song composition process? To answer these                     at being realistic or to propose a new tool for collaboration
     questions we designed and conducted an experi-                    composition, but specifically to collect quantitative data re-
     ment in which participants compose short songs us-                garding the relation between feedback, skills and song qual-
     ing an online lead sheet editor, are given the possi-             ity.
     bility to feedback on other participant’s songs and                  We focus on the role of peer feedback in music compo-
     can either accept or reject feedback on their com-                sition, specifically in lead sheet composition. A lead sheet
     positions. This experiments aim at collecting quan-               is a representation of a simple song consisting of a melody
     titative data relating the intrinsic quality of songs             and a corresponding chord grid. We propose an experiment
     (estimated by peer evaluation) with the nature of                 in which peer feedback consists in suggestions of changes of
     feedback. Preliminary results show that peer feed-                certain parts of the lead sheet: specific notes or groups of
     back can indeed improve both the quality of a song                notes or chords. These musical suggestions can be accom-
     composition and the composer’ satisfaction about                  panied by a text explanation. Once a feedback is posted by
     it. Also, composers tend to prefer compositions                   a participant, it can be reviewed by the composer, who then
     from other musicians with similar musical experi-                 decides to either accept it (and modify the lead sheet accord-
     ence level.                                                       ingly) or discard it.
                                                                          Additionally to the sheer effect of feedbacks, we also ex-
                                                                       amine the characteristics of the composer, commentator or
1   Introduction                                                       judge of the participants. Indeed, having an extended experi-
Peer feedback has become an ubiquitous feature of online ed-           ence in music composition might be seen as a prerequisite to
ucation systems. Peer feedback consists in letting students            write a nice song or to give useful suggestions. However, pre-
or participants in a class revise, assess and more generally           vious research showed that expertise might not be as critical
comment on the work of other students. This model is op-               as we could expect [Frese et al., 1999].
posed to the traditional one in which students’ works are eval-
uated only by a teacher. Peer feedback is acknowledged to              2   Description of the experiment
bring many benefits [Rollinson, 2005] such as saving teach-            Participants are instructed to write a short composition using
ers’ time as well as other pedagogical positive effects [Sadler        an on-line lead sheet editor [Martı́n et al., 2015]. Then they
and Good, 2006]. With the increase of online learning com-             are asked to give feedback to another participant’s composi-
munities and MOOCS [September, 2013], peer feedback is                 tion, and finally they are asked to improve their own origi-
becoming more and more popular.                                        nal composition using feedback posted on their composition.
   Peer feedback is not only useful in pedagogical contexts, it        Participants are divided randomly in two groups: participants
can be also used in creative tasks. In music composition, col-         in the control group (G1) do not receive any feedback, and try
laborative composition has been addressed in several studies           to improve the song by themselves, whereas participants from
[Donin, forthcoming 2016]. There are online creative com-              the experimental group (G2) may use the feedback received
munities in which music is composed collaboratively by sev-            to improve their own song. The existence of these two groups
eral users [Settles and Dow, 2013].                                    is ignored by the users so that the results are not biased.
   In those creative contexts, the following questions are le-            As we are trying to assess the impact of feedback on the
gitimate: to which extent peer-feedback can affect the quality         quality of a music composition, we need to estimate the qual-
of a musical composition? What is the influence of the musi-           ity of all compositions as well as their various variations dur-
cal experience of the composers involved in this process? To           ing the experiment. To do so we use social consensus to de-


                                                                  27
termine the quality of a song: participants listen and are given         just made as well as their opinion on the original song they
the possibility to ”like” other participants’ compositions. The          modified.
quality of a song is then simply determined by the number of
likes obtained for that song. In the next section we describe
in detail each phase of the experiment:
2.1   Questionnaire
Participants start the experiment by answering 15 questions
about to their experience in music, and more specifically in
music composition. For example, they are asked how many
years they have studied music theory, how many years they
have been playing in a band, which style of music they like
more, how often do they compose... etc.
2.2   Original composition
Participants then write a short composition using the online             Figure 2: Screenshot showing a participant entering an expla-
lead sheet editor. A lead sheet is a particular type of music            nation of the suggestion.
score widely used in jazz, bossa-nova and song-writing, con-
sisting on a monophonic melody and a chord grid. All com-
positions have a fixed length of 8 bars; participants are not            2.4    Improvement: Final composition
able to add or delete bars, but they can choose the tempo and
                                                                         Next, participants are asked to reconsider their own compo-
the time signature of the song. Participants fill the 8 bars with
                                                                         sition and are asked to try to improve it. Participants from
a melody and chord labels (e.g. Dmaj7, Em7...etc.). Figure 1
                                                                         G1 (control group) are told that they unfortunately did not re-
shows a screen-shot of the lead sheet editor.
                                                                         ceive suggestions and are encouraged to try to improve their
                                                                         own composition by themselves. Participants from G2 see the
                                                                         suggestions they received from two other participants. They
                                                                         can listen to all the suggestions. If they like a suggestion they
                                                                         can accept it, so that it is kept and the song is automatically
                                                                         updated accordingly. In addition to integrating suggestions,
                                                                         they can modify freely their composition. Once they are fin-
                                                                         ished, they answer a questionnaire about their confidence on
                                                                         their own improvement and on their opinion on the sugges-
                                                                         tions received.
                                                                         2.5    Evaluation phase
                                                                         The last step of the experiment is to evaluate pairs of com-
Figure 1: Screenshot of a composition being entered with the
                                                                         positions from other participants. Each pair of songs consist
lead sheet editor.
                                                                         on the original song and the improved song. Participants are
                                                                         asked to evaluate each song by place it in a vertical display
   Participants can listen to their composition with a basic             with a legend from 0 (”I don’t like it”) to 100 (”I like it very
MIDI player. When they are done they click on ”Save and                  much”). Participants do not know which is the original and
Finish”. Next, they answer a questionnaire about their con-              the improved song when they are evaluating. One of the ver-
fidence in the quality, complexity and satisfaction on their             sions is presented as song A and the other as song B and this
composition.                                                             assignment is performed randomly. Participants have to eval-
2.3   Feedback Posting                                                   uate at least 5 pairs of songs in order to finish the experiment.
Once they have finished their composition they are asked to
give feedback to another participant by suggesting improve-              3     Results
ments in another participants’ composition. Each suggestion              In this section we describe in detail the results obtained from
can be at the most, two bars long. Participants can make as              each phase of the experiment.
many suggestions as they want as long as they do not over-
lap. So, each participant can make a maximum of 8 sugges-                3.1    Population
tions (one per bar). To make a suggestion, participants must             The experiment was conducted between February and July
choose the bar(s) to modify, then they can change the notes              2015. 66 participants completed the experiment (68% men
and the chord symbols. Optionally, they can also leave a text            and 32% women). Mean age was 29.2 years, ranging from
comment explaining their changes. Figure 2 shows a com-                  19 to 61. Musical experience was measured through a ques-
position in which a participant is entering suggestions with             tionnaire with 7 items. The scale has a satisfactory sensibility
an explanation. When they are finished, they answer a short              with an observed range from 7 to 41 (out of 0 to 42) and
questionnaire about their confidence on the suggestions they             we observed a mean of 28.7 with a Standard Deviation (SD)


                                                                    28
of 8.9. The intern consistency is satisfactory (Cronbach’s al-         commentators did agree together on the quality (r=.80) and
pha=.82).                                                              on the complexity (r=.70).
   Composition experience was measured through a question-                Moreover, from the judgments done during the evaluation
naire with 5 items. The results show an overall low level              phase (in which participants evaluate pairs of songs from
of experience concerning composition in our sample with a              other participants), the measurement of the quality of each
mean 6.9 (SD=6.1) on a scale ranging from 0 to 30). The                original song (standardized to z-scores) allows us to estimate
intern consistency is satisfactory (Cronbach’s alpha=.85).             the composition skills level of its author. Surprisingly, we ob-
                                                                       served that the quality of the original song is only marginally
3.2   Composition effects                                              related to the composition experience (r=.18, p=.15) or to the
Each participant was randomly assigned to either the control           musical experience (r=.19, p=.12).
group (G1) or the experimental group (G2). No significant                 We also asked the participants whether they used an in-
differences were observed between the two groups in rela-              strument to help them in their composition. Results show a
tion to age, gender, musical experience or composition expe-           marginally significant effect in favor of the use of an instru-
rience.                                                                ment on the mean quality score (T(64)=-0.87, p=.38).
Composition evaluations                                                   The mean duration of the composition time of the song as
During the evaluation step, we checked if participants had lis-        evaluated by the participants is 30 minutes (SD=32 min) rang-
tened to the songs before evaluating them. On the 1195 eval-           ing from 1 minute to 240 minutes. This evaluation is largely
uations made, 219 were made without listening to the song.             underestimated by the participants because the real duration
We removed those evaluations.                                          calculated from the time spent on the composition software
   The songs were evaluated by an average of 8.8 different             is significantly longer (m=67 min; T(65)=4.20, p<.001). The
judges. The mean score of the evaluations made during the              correlation between these two durations is not very high, but
evaluation phase is 53.25 (SD = 13.26) on a scale ranging              significant (r=.46, p<.001 ) indicating that the error of dura-
from 0 to 100. However, judges might be more or less strict,           tion estimation is not exactly the same for everyone. Interest-
and some songs might have been evaluated by a particularly             ingly, we observed that the quality of the original songs (from
strict or generous participant. To take into account the sever-        the evaluation phase) is not linked with the time spent to
ity of the judges, we have standardized the evaluations to             compose, whether it is subjective (r=.04) or objective (r=.03).
get z-scores where the mean and standard deviation used are            This result suggests that in a situation where there is no time
based on all the evaluations made by a given participant. As           constraint, the amount of time devoted to compose has no ef-
a result, the mean of the standard scores is approximately             fect on its quality.
equal to zero, and a standard deviation of approximately .50.             Finally, there is a difference in the consensual quality of the
It should be noted that this final score correlates strongly           original song, obtained from the evaluation of several partic-
with the raw score (r=.84). This result indicates that we had          ipants (0.07 in G1 vs. -0.15 in G2). This could be due to
enough evaluations for each songs to avoid any severity bias.          differences in the group of judges evaluating each song.

Original Composition                                                   Suggestions
The questionnaire that participants were asked to com-                 In the questionnaire filled after making the suggestions, par-
plete after finishing the original composition included self-          ticipants were asked how much do they think the song they
estimation questions about the quality, complexity and sat-            are revising will be improved due to their modifications (on a
isfaction for their composition on scales ranging from very            7 points Likert scale ranging from 0 ”very little”, to 6 ”very
bad/simple/unsatisfied (0) to very good/complex/satisfied (6).         much”).
We also asked them to evaluate the time they spent to make                 The participants from G2, the experimental group (N=30),
their composition and if they used an instrument to help them          received two suggestions for their final composition. Once
to compose (and which instrument if they did).                         they finished, we asked them if the suggestions received were
   Results show a mean quality of 2.8 (SD=1.5), a mean                 interesting (on a 7 points Likert scale ranging from 0 ”very
complexity of 1.9 (SD=1.6) and a mean satisfaction of 3.2              little”, to 6 ”very much”). Additionally, we recorded the num-
(SD=1.6). Only the complexity is significantly different to the        ber of suggestions they received and the number of texts com-
center of the scales which is 3 (T(65)=-5.27 ; p<.0001). This          ments received.
means that the participants tend to judge their work as rather             We ran a series of correlations between these measures and
simple (low complexity). We also observed positive and sig-            the improvement effect (the difference between the original
nificant correlations between these three measures, ranging            song and the final song on the quality judgment score). None
from r=.41 to r=.80.                                                   were significant, suggesting that neither the number of sug-
   During the suggestion step, we asked the participants to            gestions received nor the number of explanations for that sug-
also rate the quality and complexity of the songs they had             gestions have an impact on the improvement of a song.
to comment. Each composition from the experimental group
(G2) was commented by two different participants. In the end           Final composition
we obtained the score from the author and two other scores             Overall, we can see that the control group, G1, does not im-
from two different commentators. Interestingly, there was no           prove significantly between the original song (m=.07) and the
correlation between the scores from the original composer              final song (m=.12) (improvement effect = .05, T(35)=0.94,
and the ones from the commentators (r<.10), but the two                p=.35). However, we do see a significant improvement for


                                                                  29
the experimental group, G2, between the original song (m=-
.15) and the final song (m=.08) (improvement effect = .23,
T(29)=2.47, p=.02). See Figure 3.


                                                                          Figure 4: Self-esteemed quality of the original and final songs
                                                                          for the group without feedbacks (G1) and the group with
                                                                          feedbacks (G2).

Figure 3: Difference between the original song and the final
                                                                          or helpful (6) to compose with it. Results show a mean of
song on on the quality judgment score for the group without
                                                                          3.13 after the first composition and 3.41 after the final com-
feedbacks (G1) and the group with feedbacks (G2).
                                                                          position (the difference is not significant) which means that
                                                                          even if the online editor was not specially helpful, it did not
   We also examined the subjective evaluation of the partic-
                                                                          hinder the composition process.
ipants concerning the improvement of their song. We con-
structed two composite scores. One from the self-evaluation               Experience effect on evaluations
scales of the original song (quality, complexity and satisfac-
                                                                          To find out whether musical experience has an impact on the
tion), one from the self-evaluation scales of the final song
                                                                          way participants judge song from other participants. We di-
(quality, complexity and satisfaction). The intern consistency
                                                                          vided our sample of participants in two groups according to
of those composite scores are satisfactory (the two Cronbach’
                                                                          their experience as musician (based on the median). We also
alphas are above .81). We then conducted a mixed between
                                                                          divided our sample of songs according to the experience as
participants (control and experimental groups) x within par-
                                                                          musician of their author. We then ran a two-way ANOVA to
ticipants (original and final song) analysis of variance. We
                                                                          explore the effect of the experience of the judges according to
observed a significant interaction between groups and songs
                                                                          the experience of the compositor. Results show a crossed in-
(F(1,64) = 7.07, p=.01). To explore this interaction, we used
                                                                          teraction between these two variables (F(1,61)=7.63, p=.007)
a post-hoc analysis with Tuckey HSD tests. Results show
                                                                          as illustrated in figure 5. These results indicate that experi-
that participants who received suggestions had a significant
                                                                          enced judges give high scores to songs from experienced au-
improvement between the original and final song (p<.001)
                                                                          thors and low scores to songs from non-experienced authors.
while the control group had no improvement (p=.49) See Fig-
                                                                          It is exactly the opposite for the non-experienced judges.
ure 4.
                                                                          This means that participants tend to prefer compositions from
   When evaluating songs, users did not know which song
                                                                          other participants with similar experience. This could explain
was the original and which one was the final, as the order of
                                                                          the difference in the evaluation of the original songs in G1
the songs was determined randomly. This was a design deci-
                                                                          and G2. The groups of judges evaluating each song could
sion to avoid the fact that participants could tend to rate better
                                                                          have different level of expertise.
the final song, as it is supposed to be improved. Aditionally
we wanted to ensure that songs were not better rated just be-
cause they had more modifications. To check this point, we                4   Conclusion
used and melodic similarity algorithm [Urbano et al., 2011] to
estimate the similarity between each original and final songs.            The aim of this experiment was primarily to examine quan-
The correlation between the percent of similarity and the im-             titatively the impact of peer feedback in music composition
provement effect based both on the composer’s subjective                  and secondly to assess how important is the experience of the
opinion and on the scores from the judges are low (r=-.36,                participants as musicians or composers in the whole process.
p=.003 and r=-.19, p=.13), which suggests that the improve-               Before any improvement or suggestions, participants had to
ment is not linked to the dissimilarity between the two ver-              write their first song. Interestingly, results show that partici-
sions.                                                                    pants’ previous experience in composition did not impact the
                                                                          quality of their song. The same pattern was also found for
Lead sheet editor                                                         the participants’ previous experience as a musician. These
The software used was developed specifically for the experi-              two results suggest that the quality of a song (based on social
ment and we asked participant whether it was frustrating (0)              consensus) does not really tap in musicality but in something


                                                                     30
                                                                       Acknowledgments
                                                                       This work is supported by the Praise project (EU FP7 num-
                                                                       ber 388770), a collaborative project funded by the European
                                                                       Commission under programme FP7-ICT-2011-8.

                                                                       References
                                                                       [Donin, forthcoming 2016] Nicolas Donin. Domesticating
                                                                          gesture: the collaborative creative process of florence
                                                                          baschet’s streicherkreis for ’augmented’ string quartet
                                                                          (2006-2008). Eric Clarke & Mark Doffman (eds.), Cre-
                                                                          ativity, Improvisation and Collaboration: Perspectives on
                                                                          the Performance of Contemporary Music, New York: Ox-
                                                                          ford University Press, forthcoming 2016.
                                                                       [Frese et al., 1999] Michael Frese, Eric Teng, and Cees JD
                                                                          Wijnen. Helping to improve suggestion systems: Predic-
                                                                          tors of making suggestions in companies. Journal of Or-
                                                                          ganizational Behavior, 20(7):1139–1155, 1999.
                                                                       [Martı́n et al., 2015] Daniel Martı́n, Timotée Neullas, and
                                                                          François Pachet. Leadsheetjs: A javascript library for on-
Figure 5: Interaction between the experience of the author                line lead sheet editing. In First International Conference
and the experience of the judges on the quality score.                    on Technologies for Music Notation and Representation
                                                                          (TENOR), Paris, France, 2015.
                                                                       [Rollinson, 2005] Paul Rollinson. Using peer feedback in the
                                                                          esl writing class. ELT journal, 59(1):23–30, 2005.
else, presumably creativity. As before, creativity might play
an important role [Frese et al., 1999].                                [Sadler and Good, 2006] Philip M Sadler and Eddie Good.
                                                                          The impact of self-and peer-grading on student learning.
   Results show that composers who received feedback (G2)                 Educational assessment, 11(1):1–31, 2006.
clearly evaluated better the improved song than the original,
meaning that they were satisfied with the improvement they             [September, 2013] On September. Behind the scenes with
made. Further, the evaluation based on social consensus had               moocs: Berklee college of musics experience developing,
a longer improvement also for G2. Hence, participants who                 running, and evaluating. CONTINUING HIGHER EDU-
received feedbacks not only felt that they had composed a                 CATION REVIEW, 77:137, 2013.
better song after the improvement step, but they actually did.         [Settles and Dow, 2013] Burr Settles and Steven Dow. Let’s
This basic finding suggests that improvements in music may                get together: the formation and success of online creative
be achieved even without real collaboration with dialogues                collaborations. In Proceedings of the SIGCHI Conference
and active interactions, but by simple suggestions on a single            on Human Factors in Computing Systems, pages 2009–
occasion.                                                                 2018. ACM, 2013.
  Since there is a difference on the evaluation of the origi-          [Urbano et al., 2011] Julián Urbano, Juan Lloréns, Jorge
nal songs between G1 and G2, we wanted to verify whether                  Morato, and Sonia Sánchez-Cuadrado. Melodic similar-
experience can make a difference when evaluating songs and                ity through shape similarity. In Exploring music contents,
we found out that participants tend to like more songs that are           pages 338–355. Springer, 2011.
composed by other participants with similar musical experi-            [Van den Berg et al., 2006] Ineke Van den Berg, Wilfried
ence.                                                                     Admiraal, and Albert Pilot. Designing student peer as-
   Future work may be done by going deeper in determin-                   sessment in higher education: Analysis of written and oral
ing the influence of the participants’ experience. For ex-                peer feedback. Teaching in Higher Education, 11(2):135–
ample, by checking when are songs more improved, taking                   147, 2006.
into account the experience of composers, commentators and
judges. Further, we could assess more precisely which sug-
gestions were actually used (or accepted) by the original com-
poser to obtain a ranking of commentators whose suggestions
are most accepted, as a measure of how good commentators
they are. We could check also if suggestions from experi-
enced commentators are more likely to be used from inexperi-
enced composers, or whether experienced composers usually
accept suggestions of other composers, and how does this af-
fects the improvement of the song.


                                                                  31