=Paper= {{Paper |id=Vol-3410/short6 |storemode=property |title=Quality Agreement on Learnersourced Multiple Choice Questions |pdfUrl=https://ceur-ws.org/Vol-3410/short6.pdf |volume=Vol-3410 |authors=Richard Glassey,Olle Bälter |dblpUrl=https://dblp.org/rec/conf/lats/GlasseyB22 }} ==Quality Agreement on Learnersourced Multiple Choice Questions== https://ceur-ws.org/Vol-3410/short6.pdf
Quality Agreement on Learnersourced Multiple
Choice Questions⋆
Richard Glassey1,∗,† , Olle Bälter1,†
1
    KTH Royal Institute of Technology, Stockholm, Sweden


                                         Abstract
                                         Learnersourcing presents an efficient and economical pathway to producing more learning content,
                                         whilst engaging students more actively and deeply in their learning. However, it also presents new
                                         challenges to solve. Most of all, how best can we manage the variance in quality of content produced. In
                                         this work, we focus on the extent to which students and teachers agree on quality of learnersourced
                                         multiple choice questions. Students (n=30) were tasked with producing six questions over three weeks of
                                         an introductory programming course as part of their assessment. They also had to review 12 questions
                                         authored by their peers over the same period using a set of principles for good questions. After this
                                         period, four teaching staff involved with the course reviewed the student questions using the same
                                         process and principles. Inter-rater reliability statistics found overall positive agreement across principles,
                                         however this dropped to weaker agreement for principles aimed at more subjective and higher order
                                         concerns of question quality and quality of question feedback.

                                         Keywords
                                         Learnersourcing, Peerwise, Inter-rater Reliability




1. Introduction
Learnersourcing can be succinctly defined as crowdsourcing content from students in a learning
context [1]. Different types of learnersourced content include: content annotation, resource
recommendation, explanation of misconceptions, content creation, and aspects of evaluation,
reflection and regulation [2]. In all cases, students are active in adding value to content and
creating new content for the consumption of other students.
   Creating learning content is a higher-order activity for students [3]. However, as with any
student activity there will be a spectrum of engagement and quality. This creates a challenge of
management as ideally students are exposed to the highest quality content, whilst lower quality
content is filtered out of the system. Both PeerWise [4] and RiPPLE [5] have adopted content
quality management strategies as platform features. However, one question that bubbles up is
do students and teachers agree what quality is, given its subjective nature?
   Here, we approach this question by studying the agreement gaps that emerge when students
and teachers are asked to give their opinions about learnersourced MCQs. To give structure

Woodstock’22: Symposium on the irreproducible science, June 07–11, 2022, Woodstock, NY
∗
    Corresponding author.
†
     These authors contributed equally.
Envelope-Open glassey@kth.se (R. Glassey); ob1@kth.se (O. Bälter)
Orcid 0000-0002-8996-0221 (R. Glassey); 0000-0001-5626-1187 (O. Bälter)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
to these opinions, we use a set of principles for producing better MCQs that were presented
during the production and review of questions [6]. Finally, we calculated inter-rater reliability
statistics [7] to discover both the differences within student and teacher reviews and between
student and teacher reviews.


2. Context of Study
In response to the 2015 European Refugee Crisis, academics at KTH Royal Institute of Technology,
Stockholm, Sweden developed an intensive three month training to integrate newly arrived
into the local IT workforce [8]. As this training was not attached to traditional course delivery,
there was much more freedom to innovate and try novel pedagogical interventions [9]. In the
first three weeks, when students were covering introductory programming, assessment was
achieved by having students create multiple choice questions on the topics they were learning.
The main motivation was to quickly generate a lot of MCQs that the students could then answer
to increase their opportunities to practice. This was achieved and it was found that 50% of
students answered 100 questions or more, without any demand from teachers to do so [6].
   Early iterations of this training found that, whilst students could produce good MCQs there
was a cold-start problem in MCQ quality - students gradually got better over time. Furthermore,
students were good at writing questions and answering alternatives, however when it came to
the explanation or feedback that accompanied the MCQ, students struggled to provide similar
quality [6]. In response, we developed a set of 12 principles for writing good MCQs, reflecting
the patterns of quality issue we detected in reviewing student MCQs [6], rather than more
established guides developed for academics [10]. The principles are listed in table 1.
   In the most recent iteration, to help students understand and apply the principles, students
worked as a group to learn about writing good questions before they undertook the task
individually. Principles were used to create a question and then review a question created by
another group. In this way, students had a chance to discuss their interpretation of the principles
and apply them both in creating and reviewing MCQs. For each week of the three week course,
students were tasked with creating two questions and reviewing four questions. Once the
course had ended, four teachers reviewed 96 questions independently using the principles and
review process used by the students. For each question there were three student reviews and
four teacher reviews.


3. Findings and Discussion
Table 1 shows the inter-rater reliability (IRR) results for both student and the teacher reviewers,
according to each principle, sorted by the difference between students and teachers, and the
final column shows the combined IRR for both students and teachers. IRR statistics provide the
amount of agreement between independent reviewers when assessing the same things [7]; in
our case, students and teachers rating MCQs according to principles. As there were more than
two reviewers, Gwet’s AC1/AC2 agreement coefficient amongst multiple raters was selected to
calculate IRR [7].
Table 1
Principles of good multiple choice questions ranked by difference of student vs teacher inter-reviewer
reliability (IRR). Students and teachers reviewers are shown separately, then their difference, and
then finally when they are combined altogether. Green highlighting indicates strong agreement, grey
indicates moderate agreement, and red indicates where there is weak agreement.
   Principles of Good Multiple Choice Questions     Student IRR     Teacher IRR    Diff    All IRR
   Only the feedback to the correct alternative             0.20           0.73    0.53       0.27
   reveals the answer

   Question is aiming at higher order thinking              0.21           0.61    0.40       0.20

   Feedback is unique and provided for each                 0.60           0.97    0.37       0.47
   answer alternative

   Question is formulated to ease readability               0.72           0.95    0.23       0.81

   Question targets a misconception                         0.57           0.69    0.12       0.52

   All answer alternatives are formulated                   0.81           0.93    0.12       0.86
   to ease readability

   All answer alternatives are plausible                    0.59           0.69    0.09       0.46
   and related to a misconception

   Feedback is sufficient to understand why                 0.48           0.54    0.06       0.25
   each alternative was incorrect or correct

   Question is reasonable to solve without                  0.88           0.95    0.06       0.91
   external systems

   All feedback is correct                                  0.78           0.83    0.05       0.74

   Question is from the course domain                       0.94           0.98    0.04       0.94

   Three or more answer alternatives are                    0.97           1.00    0.03       0.97
   provided

                                                        𝑥1 ∶ 0.65      𝑥2 ∶ 0.82          𝑥3 ∶ 0.62


   First, taking student and teacher IRR scores separately, teachers had a higher average agree-
ment over all principles (0.82) versus students (0.65). Also, students had a greater distribution
of scores (from 0.20 to 0.97) versus teachers (from 0.54 to 1.00). These findings can be partly
explained by the teachers both generating the principles and also discussing their possible inter-
pretations. Students on the other hand had much less opportunities to discuss the interpretation
of the principles, other than within the scheduled group creation and review sessions, which
only occurred twice in the three week course.
   Second, looking at the largest differences in IRR scores by principle, “Only the feedback to
the correct alternative reveals the answer” had a magnitude of 0.53. This suggests a difference
in attitude with teachers seeing more value for MCQ feedback that helps correct student
misconception without taking away the chance of a second attempt. Another large difference
relating to the feedback (0.37), “Feedback is unique and provided for each answer alternative”,
continues this theme, whereas students did not have similar strong agreement. This is interesting
as it is an objective measure - each answering alternative should have its own unique feedback
and it only requires a cursory glance to confirm. Much like the smallest difference (0.03) for
“Three or more answer alternatives are provided”, where it is quite easy to count the number of
answering alternatives. One limitation of PeerWise is that there is only a single text area for
‘explanation’ and this may have contributed to this anomaly.
   Third, looking at the smallest differences that are not clearly objective or easy to determine
(like “Question is from the course domain”), both principles: “All answer alternatives are plausible
and related to a misconception” and “Feedback is sufficient for you to understand why each
alternative was incorrect or correct” are interesting as these are quite challenging aspects of
quality to agree upon, even for teachers who understand the intent of each principle. This
suggests a limitation of the depth of quality one can hope to measure when rating learnersourced
content, however the IRR results here are still promising with medium agreement between
students (0.59 and 0.48) and between teachers (0.69 and 0.54).
   Finally, when combining both student and teacher reviews together, creating a pool of
seven reviewers, the most agreement can be found in perhaps the most objectively answerable
principles. This is not surprising and acts as a nice control for agreement on the basics of
MCQs. Of more concern is that where there is weak agreement, two principles are concerned
with feedback (“Feedback is sufficient for you to understand why each alternative was incorrect
or correct” and “Only the feedback to the correct alternative reveals the answer”) and the other
concerning if the question targets higher order thinking. This represents a challenge that
warrants deeper investigation as the value of feedback is well known and accepted in effective
education, but what if our different points of view on it never actually meet and the feedback
that teachers feel is sufficient is not exactly (or even close to) what students need?


4. Conclusion
Learnersourcing generates more content, but comes with the challenge of how to determine
quality. Part of solving the challenge is finding out where students and teachers agree (or not)
about quality. Use of agreement statistics, such as inter-rater reliability, are a potentially useful
metric, but as mentioned here, caution is advised as there are many to choose from and not all
work as expected. The work presented here shows that there are areas we can find agreement,
however we need to find better ways to solicit impressions of quality at deeper levels and then
find ways to integrate and automate them within learnersourcing platforms.
References
 [1] J. Kim, Learnersourcing: improving learning with collective learner activity, Ph.D. thesis,
     Massachusetts Institute of Technology, 2015.
 [2] Y. Jiang, D. Schlagwein, B. Benatallah, A review on crowdsourcing for education: State of
     the art of literature and practice., PACIS (2018) 180.
 [3] S. Abdi, H. Khosravi, S. Sadiq, G. Demartini, Evaluating the quality of learning resources:
     A learnersourcing approach, IEEE Transactions on Learning Technologies 14 (2021) 81–92.
 [4] P. Denny, A. Luxton-Reilly, J. Hamer, The peerwise system of student contributed as-
     sessment questions, in: Proceedings of the tenth conference on Australasian computing
     education-Volume 78, 2008, pp. 69–74.
 [5] H. Khosravi, G. Gyamfi, B. E. Hanna, J. Lodge, Fostering and supporting empirical research
     on evaluative judgement via a crowdsourced adaptive learning system, in: Proceedings of
     the Tenth International Conference on Learning Analytics & Knowledge, 2020, pp. 83–88.
 [6] R. Glassey, O. Bälter, Put the students to work: Generating questions with constructive
     feedback, in: 2020 IEEE Frontiers in Education Conference (FIE), IEEE, 2020, pp. 1–8.
 [7] K. L. Gwet, Computing inter-rater reliability and its variance in the presence of high
     agreement, British Journal of Mathematical and Statistical Psychology 61 (2008) 29–48.
 [8] M. Wiggberg, E. Gobena, M. Kaulio, R. Glassey, O. Bälter, D. Hussain, R. Guanciale,
     P. Haller, Effective reskilling of foreign-born people at universities-the software develop-
     ment academy, IEEE Access 10 (2022) 24556–24565.
 [9] R. Glassey, O. Bälter, Sustainable approaches for accelerated learning, Sustainability 13
     (2021) 11994.
[10] T. M. Haladyna, S. M. Downing, M. C. Rodriguez, A review of multiple-choice item-writing
     guidelines for classroom assessment, Applied measurement in education 15 (2002) 309–333.