Assessment analytics for peer-assessment: a model and
                     implementation

                Blaženka Divjak                        Darko Grabar                       Marcel Maretić
           Faculty of Organization and           Faculty of Organization and        Faculty of Organization and
                   Informatics                           Informatics                        Informatics
              University of Zagreb                  University of Zagreb               University of Zagreb
                   Pavlinska 2                           Pavlinska 2                        Pavlinska 2
            42000 Varaždin, Croatia               42000 Varaždin, Croatia            42000 Varaždin, Croatia
                 bdivjak@foi.hr                    darko.grabar@foi.hr                  mmaretic@foi.hr

ABSTRACT                                                         LA is interdisciplinary and it must be emphasized that LA
Learning analytics should go beyond data analysis and in-        includes the aspects of human judgments and it goes beyond
clude approaches and algorithms that are meaningful for          data analysis (business analytics): it has to make sense of
learner performance and that can be interpreted by teacher       information, come to decisions and take action based on data
and related to learning outcomes. Assessment analytics has       [13]. This is the leitmotiv of the research presented in this
been lagging behind other research in learning analytics. This   paper.
holds true especially for peer-assessment analytics.
                                                                 LA has to be useful to a vast majority of students. The
In this paper we present a mathematical model for peer-          so-called average student has to be taken into account when
assessment based on the use of scoring rubrics for criteria-     setting the goals of LA, not only the under-performing or
based assessment. We propose methods for the calculation         over-performing students. Teaching practice shows that a
of the final grade along with reliability measurues of peer-     meaningful analysis of assessment results is of interest to all
assessment. Modeling is motivated and driven by the identi-      the students.
fied peer-assessment scenarios.
                                                                 Assessment is both ubiquitous and very meaningful as far as
Use of peer-assessment based on a sound model provides           students and teachers are concerned (Ellis in [6]). It is an
benefits of the deeper learning while addressing the issues of   essential part of the teaching and learning process, especially
validity and reliability.                                        in the formal education because assessment guides learning
                                                                 for a vast majority of students. Ellis at the same time claims
                                                                 that assessment analytics are lagging behind other types of
Categories and Subject Descriptors                               learning analytics. There are several reasons for this. Among
D.2.8 [Software Engineering]: Metrics—complexity mea-
                                                                 these, we argue that insufficient granularity of assessment
sures, performance measures; H.4 [Information Systems
                                                                 data presents a difficulty for an interpretation of results.
Applications]: Miscellaneous
                                                              The so called networked learning (see [12], e.g. Massive Open
General Terms                                                 Online Courses (MOOCs), social learning platforms, online
Algorithms, Design, Measurement, Reliability                  learning and e-learning in general) presents a completely
                                                              new playground for learning analytics. In networked learn-
Keywords                                                      ing the number of participants rapidly increases along with
                                                              the interactions between learners in the form of discussions
peer-assessment, assessment, analytic tools for learners, as-
                                                              and mutual learning. We focus here on a special types of
sessment learning analytics
                                                              assessment: peer-assessment. Use of peer-assessment and
                                                              self-assessment is appealing and very appropriate for a task
1. BACKGROUND ON ASSESSMENT LEARN- leading to a certificate in a MOOC with enrollment mea-
     ING ANALYTICS                                            sured in tens of thousands. This approach generates a huge
Learning analytics (LA) is all about usefulness of the data   amount of assessment data but also asks for sound metrics
once they have been collected and analyzed [6]. Research in   for the calculation of final grade and for estimates on the
                                                              reliability of assessment data. Peer-assessment has additional
                                                              benefits in the learning process, but also additional disad-
                                                              vantages (cf. [4]). Among the disadvantages there are issues
                                                              of reliability and validity of assessment.

                                                                 To address validity, we advise the use of the scoring rubrics
                                                                 as they contribute to the quality of assessments and by
                                                                 facilitating valid judgments of complex competencies [10].
                                                                 Based on the analysis of 75 studies Jonsson and Svigby
in [10] conclude that the use of scoring rubrics enhances


                                                                                                            experienced
the reliability of assessments, especially if the rubrics are
analytic, topic-specific, and complemented with examples
and/or rater training. Otherwise, the scoring rubrics do not


                                                                                                                                           tic
facilitate valid judgment of performance assessments. Besides                experienced evaluators


                                                                                                                                        is
                                                                                                                                     al
this, rubrics have a potential to promote learning and/or                    (Scenario B)


                                                                                                                                   re
                                                                                                                                    t
improve instruction.


                                                                                                                                 no
Aim of this paper is to model peer-assessment and to dis-
cuss issues of final grade calculation and reliability of raters’
judgments. Jonsson and Svingby note that variations in                     small                                                                 large
                                                                                                size of a peer assessment group
raters’ judgments can occur either across raters, known as
inter-rater reliability, or in the consistency of one single rater,
called intra-rater reliability. Referring to [1] Jonsson and


                                                                                                 e
Svingby state that a major threat to reliability is the lack of


                                                                                              at
                                                                                                                            inexperienced evalu-


                                                                                            pri
consistency of an individual grader. Reports rarely mention                                                                 ators


                                                                                         ro
                                                                                        pp
this measure. On the other hand, inter-rater reliability is in                                                              (Scenario A)


                                                                                                            inexperienced
                                                                                    ta
                                                                                   no
some form mentioned in more than half of the reports but
many of these simply use percentage as a measure for agree-
ment. This is in agreement with Sadler and Good’s critique
in [14] of poor quality of quantitative research regarding
self-assessment. Situation has improved since. Nevertheless,
majority of current research still uses overly simple statisti-
cal measures in order to determine correlations that might                                   Figure 1: Scenario grid
indicate reliability.

In the following sections we describe two major peer-assessment       Detailed analysis is given in the Table 1.
scenarios we have recognized and for which we have developed
a mathematical model. After that we present and analyze a
model for these scenarios.                                            3.    OVERVIEW OF THE PEER-ASSESSMENT
                                                                            ACTIVITY
2.    SCENARIOS FOR PEER-ASSESSMENT                                   Peer-assessment activity starts after the work on the assign-
Reliability of peer-assessment depends on many factors but            ment task has completed. In a general case peer-assessment
consistency of individual evaluator was very early recognized         consists of two phases. We identify following activities in the
as the most important (see [1]). On the other hand, having            whole process.
more assessments per assignment increases the reliability of
peer-assessment with relatively inexperienced evaluators.
                                                                       Phase 1: Assessment of assignments
From experienced evaluators (experts) we presume a high
expertize in the domain knowledge and prior experience                         i. Learners assess a (predefined) number of assigned
in evaluation. Similarly, an inexperienced evaluator is an                        assignments
individual with a relatively high level of domain knowledge                   ii. Analysis of peer-assessments
(high baseline), but lacking experience in evaluation (e.g. peer                  (grouped by assignment)
assessment by senior undergraduates).
                                                                              iii. Calculation of the assignment grade
We analyze scenarios with respect to the experience of evalu-
ator as is shown in scenario grid (Fig. 1). We have placed             Phase 2: Assessment of the assessments
a continuum of possible scenarios in a grid with four quad-
rants. Within four quadrants we recognize two interesting                      i. Analysis of peer-assessments
scenarios for peer-assessment and discard the other two as                        (grouped by grader)
either unrealistic or inappropriate.                                          ii. Calculation of the assessment grade

In the first scenario, let us call it Scenario A, participants
are inexperienced evaluators (for example undergraduate               First phase starts with learners assessing the assignment
students with introductory domain knowledge and no experi-            work of their peers. We assume that each participant grades
ence in peer-assessment) whereas in the scenario B evaluators         several assignments (at least 2). At the end of the first phase
have higher expertize in the evaluated domain (i.e. teach-            a reliability check has to be performed and the final grade has
ers, graduate students or senior undergraduates) and prior            to be calculated. Second phase is concerned with the quality
training in assessment. In scenario A, the lack of experience         of assessments relative to the evaluator. As on outcome of
in evaluation must be compensated with a quantity of peer-            the second phase graders can receive a grade (points) for the
assessments, i.e. having a larger group size in peer-assessment.      quality of their assessments.
On the other hand, setting a group size too large in scenario
B is a needless waste of expert’s time.
                                                    Table 1: Scenario table
                                 Scenario A                                         Scenario B
                                 Networked learning (MOOCs, online learn-           Multiple graduate/postgraduate assess com-
  Playground – use               ing and e-learning in general, see [12])           plex student work [3]
  cases                          Voting for awards where general audience is        Peer assessment of research papers
                                 involved                                           Evaluation of competitive research projects
  Evaluators’ character-         A considerable number of relatively inexpe-        A few experienced evaluators that are ex-
  istics                         rienced evaluators in the area they assess         perts in the area they assess
                                 Inexpensive evaluators workload in almost          Expertize of evaluators and their judgment
  Resources to rely on
                                 unlimited quantities                               that can be trusted

                                                                                    Experts don’t have equal expertize in all
  Reliability thread             Intra-rater and inter-rater inconsistency
                                                                                    evaluation criteria

  Strategy to increase           Quantity of assessment that might be con-          Quality of small number of assessments
  reliability                    vergent (statistically speaking)                   without outliers


4.    MATHEMATICAL MODEL FOR PEER-                                  optimal final grade O(S) defined by
      ASSESSMENT                                                                                                    1
                                                                     O(S) = (of1 , . . . , ofn ),           ofi =
                                                                                                                                  
We recognized three challenges: (1) calculation of the final                                        where             W (S) + B(S) .
                                                                                                                    2
grade based on different assessment scenarios, (2) measure-
                                                                    W (S) and B(S) represent amalgamations of worst and best
ment of the assessment’s reliability and (3) measurement of
                                                                    received gradings respectively, defined by:
reliability of each grader (for grading of the graders).
                                                                                                                              (j)
                                                                             W (S) = (w1 , . . . , wn ),        wi = min ck,i
                                                                                                                          j
4.1    Overview of the assignment grading                                                                                      (j)
A grading G from the scoring rubric with n criteria is a                      B(S) = (b1 , . . . , bn ),            bi = max ck,i .
                                                                                                                          j
tuple of numbers G = (g1 , . . . , gn ). We consider gradings as
points in an n-dimensional space endowed with a metric d,          This approach is inspired by Hwang and Yoon’s TOPSIS
i.e. a function that measures the distance between points          (Technique for Order of Preference by Similarity to Ideal
(i.e. gradings) and satisfies the axioms of a metric space.        Solution) method of multi-criteria decision making in [9].
                                                                   When evaluators are trusted experts, we don’t expect “wild”
In [5] we proposed the use of the non-euclidean taxicab metric     gradings (outliers). Here, it is expected that after just a few
d1 , but for the purpose of this paper it is sufficient think of   initial evaluations any additional gradings will have no effect
d as any distance metric.                                          on the final grade O(S). Please consult [5] for additional
                                                                   details.

4.2    Calculation of the assignment’s final grade                  A summary of our recommendations for two scenarios A and
An assignment graded through peer-assessment will receive           B is given in the Table 2.
several peer gradings. These will have to be analyzed. If
estimated as reliable these gradings will be use as input for
the calculation of the final grade.                                       Table 2: Grading method recommendations
                                                                                  Scenario A        Scenario B
A simplest approach is to calculate the final grade of assign-                                                  Optimal value grad-
ment as the mean value of received assessments.                      Sugge-           Mean value grading.
                                                                                                                ing.
                                                                     sted             Reliability provided
                                                                                                                Reliability provided
Let S = {Sk1 , . . . , Skm } denote a set of peer gradings for       grading          by quantity of evalua-
                                                                     method                                     by the quality of eval-
assignment k, then the mean grade is                                                  tions.
                                                                                                                uators.
                                                 m
                                                       !
              f           f             f    1 X (j)
   M (S) = (a1 , . . . , an ), where ai =           c     .
                                            m j=1 k,i
                                                                    With optimal value grading we have the opportunity to allow
M (S) is a center of mass of the set S. This method for             experts to skip grading for certain criteria. For example this
grade calculation is suitable for scenario A. We can say that       would be reasonable if an expert is not an expert for all the
M (S) is sensitive to quantity, and less sensitive to outliers      criteria. To be able to calculate O(S) it is sufficient to have
(it “respects the decision of the majority”).                       every criteria covered by at least one expert.

For scenario B, we propose an alternative grade calculation         4.3      Reliability of the peer-assessment
method (see [5]). In scenario B we assume that peers are            A prerequisite for the calculation of the assignment’s final
experienced evaluators. Final grade is calculated as so-called      grade is the determination whether a received set of peer-
assessments is (sufficiently) reliable, i.e. acceptable.
                                                                        Algorithm 1: Semi-autonomous Grading Process
For reasoning about reliability it is necessary to have granular        input   : Set of gradings S = {S (1) , . . . , S (m) },
data. The importance of granular scoring data is illustrated                      acceptable error e ≥ 0
                                                                                  grading calculation method g
in the example in Table 3. Gradings S1 and S2 agree on the                        critical size N (i.e. N = 3)
summative level, but seem very distinct at the granular level.          output : Final grade or indicate gradings S as invalid
This is an example of an unreliable peer-grading set where
this incoherence is not visible on the summative level.             1 find a maximal S 0 ⊆ S with acceptable error
                                                                    2 if #(S 0 ) ≥ N then
                                                                    3     find S 00 of size #(S 00 ) = #(S 0 ) of minimal diameter
Table 3: Highlighting the importance of granular
                                                                    4     return g(S 00 ) as a proper grade for assignment k
data
                           P                                          else
             C1 C2 C3 C4                                            5     Ask for teacher intervention (grading)
              S1   3   0    2       2   7
                                            summative
              S2   0   1    3       3   7

                       granular
                                                                        e0 = 0.2 for example). Additionally, this could facilitate
                                                                        comparison of data from different tasks (within a course, or
                                                                        from different courses).

A diameter of a set of gradings S = {S1 , . . . , Sn } is defined       4.6     Evaluation of peer-assessments (awarding
as
                                                                                the graders)
                   diam S = max d(Si , Sj ) .                           Goal of the second phase of the peer-assessment process is
                                  i,j
                                                                        to reward the graders for their effort. Graders (peers) who
We consider a set S of peer gradings as reliable if diam S              have graded consistently and accurately (near the final grade)
(maximal pairwise distance between gradings) is less than 2e            should be rewarded more than inconsistent and inaccurate
where e is acceptable error given in advance.                           graders.

Note that the diameter of the set S is also a diameter of               Let us assume that a maximum of A points is awarded for
an encompassing sphere. So, we can say that a reliable                  the peer-assessment task. Then grader k can be awarded
peer-grading set fits within an encompassing e-sphere.                  Ai points for each of the m gradings Gi that he/she was
                                                                        assigned, where Ai is calculated by the following formula
If a set of peer-assessments is estimated as not acceptable
                                                                                               A e − di  , di < e
                                                                                              
(un-reliable) on the granulated level then the final grade
cannot be calculated. A recommendation about acceptability                        Ai (di ) :=   me                       ,
                                                                                                     0      , di ≥ e
                                                                                              
of particular peer-assessment set can be given to teacher
or course designer by LA. This can be implemented in the
                                                                        where di = d(Gi , F ).
learning management system (LMS, for example Moodle).
Practical related issues will be discussed in the section 5.
                                                                        This has the effect that 0 points are awarded for a grading
                                                                        outside of the e-sphere around final grade F . For a grading
4.4    Grading process                                                  within this e-sphere Ai is proportional to (e − di ) where
Assessment set can turn out as unacceptable because of a                di = d(Gi , F ).
single outlier grading. As an attempt to eliminate the outlier
grading we propose to search for a maximal acceptable subset            Finally, grader k is awarded a total of A(k) points for his
of the received peer-assessments. If such subset can be found,          effort with gradings G1 , . . . , Gm where A(k) is calculated as
it is then used as input for the final grade calculation.               a sum of Ai (di ).
As a measure of final resort, an supervisor’s intervention
is asked for. In a course with a large student enrollment
(thousands for a MOOC) this will be avoided as much as pos-                     A/m
sible. However, if present, instructor’s assessment becomes
                                                                                    points awarded


a final grade (no need for calculation). This is described in                                        Ai (di )
Algorithm 1.

4.5    Normalization
Metric d can be linearly scaled to obtain a normalized metric
d0 with values within the interval [0, 1]. Distance of d0 = 1                        0                            e              di
corresponds to the maximal distance between worst and best
                                                                                                                          distance from Fi
possible gradings.

This would facilitate having general recommendations for                Figure 2: Points awarded to grader for grading Gi
setting acceptable error e on a normalized scale (setting
5.   IMPLEMENTATION                                                of the model and recommendations on the applicability of
A support for peer-assessment LA is lacking in assessment          scenarios, parameters and analysis of the acceptable error of
analytics in general. We analyze the current implementa-           the assessment set.
tion in the Moodle LMS where peer-assessment activity is
implemented with the Workshop plugin.                              Also, we intend to implement our model (algorithms and the
                                                                   supporting recommendation system) as a peer-assessment
In a Workshop activity, students receive a grade for their         plug-in for the Moodle LMS.
work and another grade for the quality of their assessment
of other student’s assignments.                                    Finally, we conclude that a well founded mathematical mod-
                                                                   eling, based on not just descriptive statistics, should be used
Each participant in Workshop gets a grade for his submission       more often in learning analytics.
and a grade for her assessments. These grades are visible as
separate grade items in student’s gradebook.                       7.     REFERENCES
                                                                   [1] Brown G., Bull J., Pendelbury M., “Assessing Student
Current implementation of Workshop calculates the assign-             Learning in Higher Education”, Psychology Press, 1997.
ment grade as a weighted mean of received assessment grad-         [2] Divjak, B. “Implementation of Learning Outcomes in
ings. Received gradings are not analyzed for reliability. If the      Mathematics for Non-Mathematics Major by Using
teacher wishes to override or influence the calculated assign-        E-Learning”, in Teaching Mathematics Online: Emergent
ment grade, he can (a) additional provide his own assessment          Technologies and Methodologies, A. A. Juan, M. A.
and set its corresponding weight to a higher value or (b) even        Huertas, S. Trenholm, and C. Steegmann, Eds. IGI
completely override the final grade. As we have argued here           Global, 2012, pp. 119–140.
and in [5] we find this method as inadequate. Therefore, we
                                                                   [3] Divjak, B. “Assessment of Complex, Non-Structured
proposed alternative methods for the calculation of the final         Mathematical Problems”, in IMA International Conference
grade.                                                                on Barriers and Enablers to Learning Maths, 2015.
                                                                   [4] Divjak, B., Maretić, M. “Learning Analytics for
Assessment grade calculation is more complex. The goal is
                                                                      e-Assessment: The State of the Art and One Case Study”,
to estimate the quality of each assessment. One assessment
                                                                      CECIIS, 2015.
is singled out as the best one – it is the assessment closest to
the mean value of all assessments. This selected assessment        [5] Divjak, B., Maretić, M. “Geometry for Learning
is assigned with highest grade. Other assessments receive             Analytics”, Scientific and Professional Information
grades based on the distance from the selected assessment.            Journal of Croatian Society for Constructive Geometry
Teacher can influence in this process by setting the parame-          and Computer Graphics, KoG 19, 2015.
ter which determines how quickly a grade should decreased          [6] Ellis, C. “Broadening the scope and increasing the
relative to the distance.                                             usefulness of learning analytics: The case for assessment
                                                                      analytics”, Br. J. Educ. Technol., vol. 44, no. 4, pp.
We are currently developing a new Moodle plugin for peer-             662–664, 2013.
assessment. This plugin will address the identified problems       [7] Entwistle, N. J. “Teaching for understanding at
of the current implementation according to our model.                 university: deep approaches and distinctive ways of
                                                                      thinking”. Basingstoke, Hampshire: Palgrave Macmillan,
                                                                      2009.
                                                                   [8] Ferguson, R. “The state of learning analytics in 2012: a
6.   CONCLUSION. FURTHER RESEARCH                                     review and future challenges”, Tech. Rep. KMI-12-01, vol.
Peer-assessment has many advantages for students (for ex-
                                                                      4, no. March, p. 18, 2012.
ample development of metacognitive skills) and for teachers
                                                                   [9] Hwang, C.L, Yoon, K.,“Multiple Attribute Decision
(for example saves teacher’s time) but there are several chal-
lenges related to their implementation such as calculation of         Making and Applications”, NY, Springer Verlag, 1981.
final grade, reliability check and awarding an evaluator for       [10] Jonnson, A., Svigby, G., “The use of scoring rubrics:
peer-assessment.                                                      Reliability, validity and educational consequences”,
                                                                      Educational Research Review, 2007.
In this paper we propose new methods for calculation of            [11] Moodle LMS (https://moodle.org/)
the grades in peer-assessment. We propose a measure for               Plugins available on January 10th, 2016. at
reliability and a method for grading peer-evaluations in a              https://moodle.org/plugins/
peer-assessment exercise. These metrics are based on two           [12] Papamitsiou Z., Economides A.A., “Learning Analytics
distinguished scenario analysis that takes into account a num-       and Educational Data Mining in Practice”, A Systematic
ber of possible evaluators and evaluator expertize (domain            Literature Review of Empirical Evidence, Educational
knowledge and evaluation skills). We pursue an approach to           Technology & Societym 17(5), 49-64., 2014.
model assessment LA analytics with a geometric model.              [13] Reyes Jacqueleen A., “The Skinny on Big Data in
                                                                      Education: Learning Analytics Simplified”, TechTrends:
In [4] we analyzed a case study based on the master level             Linking Research and Practice to Improve Learning 59
Project Management course at the University of Zagreb. Our           (April): 75–80. 2015.
analysis has confirmed the need for deeper analysis of relia-      [14] Sadler, P., Good, E., “The impact of self-and peer
bility in peer-assessment. Further exploring of data related         grading on student learning”, Educ. Assess., vol. 11, no. 1,
to the peer-assessment learning analytics in MOOCs is ex-             pp. 37–41, 2006.
pected. Having additional data should result in improvement