Assessment analytics for peer-assessment: a model and implementation Blaženka Divjak Darko Grabar Marcel Maretić Faculty of Organization and Faculty of Organization and Faculty of Organization and Informatics Informatics Informatics University of Zagreb University of Zagreb University of Zagreb Pavlinska 2 Pavlinska 2 Pavlinska 2 42000 Varaždin, Croatia 42000 Varaždin, Croatia 42000 Varaždin, Croatia bdivjak@foi.hr darko.grabar@foi.hr mmaretic@foi.hr ABSTRACT LA is interdisciplinary and it must be emphasized that LA Learning analytics should go beyond data analysis and in- includes the aspects of human judgments and it goes beyond clude approaches and algorithms that are meaningful for data analysis (business analytics): it has to make sense of learner performance and that can be interpreted by teacher information, come to decisions and take action based on data and related to learning outcomes. Assessment analytics has [13]. This is the leitmotiv of the research presented in this been lagging behind other research in learning analytics. This paper. holds true especially for peer-assessment analytics. LA has to be useful to a vast majority of students. The In this paper we present a mathematical model for peer- so-called average student has to be taken into account when assessment based on the use of scoring rubrics for criteria- setting the goals of LA, not only the under-performing or based assessment. We propose methods for the calculation over-performing students. Teaching practice shows that a of the final grade along with reliability measurues of peer- meaningful analysis of assessment results is of interest to all assessment. Modeling is motivated and driven by the identi- the students. fied peer-assessment scenarios. Assessment is both ubiquitous and very meaningful as far as Use of peer-assessment based on a sound model provides students and teachers are concerned (Ellis in [6]). It is an benefits of the deeper learning while addressing the issues of essential part of the teaching and learning process, especially validity and reliability. in the formal education because assessment guides learning for a vast majority of students. Ellis at the same time claims that assessment analytics are lagging behind other types of Categories and Subject Descriptors learning analytics. There are several reasons for this. Among D.2.8 [Software Engineering]: Metrics—complexity mea- these, we argue that insufficient granularity of assessment sures, performance measures; H.4 [Information Systems data presents a difficulty for an interpretation of results. Applications]: Miscellaneous The so called networked learning (see [12], e.g. Massive Open General Terms Online Courses (MOOCs), social learning platforms, online Algorithms, Design, Measurement, Reliability learning and e-learning in general) presents a completely new playground for learning analytics. In networked learn- Keywords ing the number of participants rapidly increases along with the interactions between learners in the form of discussions peer-assessment, assessment, analytic tools for learners, as- and mutual learning. We focus here on a special types of sessment learning analytics assessment: peer-assessment. Use of peer-assessment and self-assessment is appealing and very appropriate for a task 1. BACKGROUND ON ASSESSMENT LEARN- leading to a certificate in a MOOC with enrollment mea- ING ANALYTICS sured in tens of thousands. This approach generates a huge Learning analytics (LA) is all about usefulness of the data amount of assessment data but also asks for sound metrics once they have been collected and analyzed [6]. Research in for the calculation of final grade and for estimates on the reliability of assessment data. Peer-assessment has additional benefits in the learning process, but also additional disad- vantages (cf. [4]). Among the disadvantages there are issues of reliability and validity of assessment. To address validity, we advise the use of the scoring rubrics as they contribute to the quality of assessments and by facilitating valid judgments of complex competencies [10]. Based on the analysis of 75 studies Jonsson and Svigby in [10] conclude that the use of scoring rubrics enhances experienced the reliability of assessments, especially if the rubrics are analytic, topic-specific, and complemented with examples and/or rater training. Otherwise, the scoring rubrics do not tic facilitate valid judgment of performance assessments. Besides experienced evaluators is al this, rubrics have a potential to promote learning and/or (Scenario B) re t improve instruction. no Aim of this paper is to model peer-assessment and to dis- cuss issues of final grade calculation and reliability of raters’ judgments. Jonsson and Svingby note that variations in small large size of a peer assessment group raters’ judgments can occur either across raters, known as inter-rater reliability, or in the consistency of one single rater, called intra-rater reliability. Referring to [1] Jonsson and e Svingby state that a major threat to reliability is the lack of at inexperienced evalu- pri consistency of an individual grader. Reports rarely mention ators ro pp this measure. On the other hand, inter-rater reliability is in (Scenario A) inexperienced ta no some form mentioned in more than half of the reports but many of these simply use percentage as a measure for agree- ment. This is in agreement with Sadler and Good’s critique in [14] of poor quality of quantitative research regarding self-assessment. Situation has improved since. Nevertheless, majority of current research still uses overly simple statisti- cal measures in order to determine correlations that might Figure 1: Scenario grid indicate reliability. In the following sections we describe two major peer-assessment Detailed analysis is given in the Table 1. scenarios we have recognized and for which we have developed a mathematical model. After that we present and analyze a model for these scenarios. 3. OVERVIEW OF THE PEER-ASSESSMENT ACTIVITY 2. SCENARIOS FOR PEER-ASSESSMENT Peer-assessment activity starts after the work on the assign- Reliability of peer-assessment depends on many factors but ment task has completed. In a general case peer-assessment consistency of individual evaluator was very early recognized consists of two phases. We identify following activities in the as the most important (see [1]). On the other hand, having whole process. more assessments per assignment increases the reliability of peer-assessment with relatively inexperienced evaluators. Phase 1: Assessment of assignments From experienced evaluators (experts) we presume a high expertize in the domain knowledge and prior experience i. Learners assess a (predefined) number of assigned in evaluation. Similarly, an inexperienced evaluator is an assignments individual with a relatively high level of domain knowledge ii. Analysis of peer-assessments (high baseline), but lacking experience in evaluation (e.g. peer (grouped by assignment) assessment by senior undergraduates). iii. Calculation of the assignment grade We analyze scenarios with respect to the experience of evalu- ator as is shown in scenario grid (Fig. 1). We have placed Phase 2: Assessment of the assessments a continuum of possible scenarios in a grid with four quad- rants. Within four quadrants we recognize two interesting i. Analysis of peer-assessments scenarios for peer-assessment and discard the other two as (grouped by grader) either unrealistic or inappropriate. ii. Calculation of the assessment grade In the first scenario, let us call it Scenario A, participants are inexperienced evaluators (for example undergraduate First phase starts with learners assessing the assignment students with introductory domain knowledge and no experi- work of their peers. We assume that each participant grades ence in peer-assessment) whereas in the scenario B evaluators several assignments (at least 2). At the end of the first phase have higher expertize in the evaluated domain (i.e. teach- a reliability check has to be performed and the final grade has ers, graduate students or senior undergraduates) and prior to be calculated. Second phase is concerned with the quality training in assessment. In scenario A, the lack of experience of assessments relative to the evaluator. As on outcome of in evaluation must be compensated with a quantity of peer- the second phase graders can receive a grade (points) for the assessments, i.e. having a larger group size in peer-assessment. quality of their assessments. On the other hand, setting a group size too large in scenario B is a needless waste of expert’s time. Table 1: Scenario table Scenario A Scenario B Networked learning (MOOCs, online learn- Multiple graduate/postgraduate assess com- Playground – use ing and e-learning in general, see [12]) plex student work [3] cases Voting for awards where general audience is Peer assessment of research papers involved Evaluation of competitive research projects Evaluators’ character- A considerable number of relatively inexpe- A few experienced evaluators that are ex- istics rienced evaluators in the area they assess perts in the area they assess Inexpensive evaluators workload in almost Expertize of evaluators and their judgment Resources to rely on unlimited quantities that can be trusted Experts don’t have equal expertize in all Reliability thread Intra-rater and inter-rater inconsistency evaluation criteria Strategy to increase Quantity of assessment that might be con- Quality of small number of assessments reliability vergent (statistically speaking) without outliers 4. MATHEMATICAL MODEL FOR PEER- optimal final grade O(S) defined by ASSESSMENT 1 O(S) = (of1 , . . . , ofn ), ofi =  We recognized three challenges: (1) calculation of the final where W (S) + B(S) . 2 grade based on different assessment scenarios, (2) measure- W (S) and B(S) represent amalgamations of worst and best ment of the assessment’s reliability and (3) measurement of received gradings respectively, defined by: reliability of each grader (for grading of the graders). (j) W (S) = (w1 , . . . , wn ), wi = min ck,i j 4.1 Overview of the assignment grading (j) A grading G from the scoring rubric with n criteria is a B(S) = (b1 , . . . , bn ), bi = max ck,i . j tuple of numbers G = (g1 , . . . , gn ). We consider gradings as points in an n-dimensional space endowed with a metric d, This approach is inspired by Hwang and Yoon’s TOPSIS i.e. a function that measures the distance between points (Technique for Order of Preference by Similarity to Ideal (i.e. gradings) and satisfies the axioms of a metric space. Solution) method of multi-criteria decision making in [9]. When evaluators are trusted experts, we don’t expect “wild” In [5] we proposed the use of the non-euclidean taxicab metric gradings (outliers). Here, it is expected that after just a few d1 , but for the purpose of this paper it is sufficient think of initial evaluations any additional gradings will have no effect d as any distance metric. on the final grade O(S). Please consult [5] for additional details. 4.2 Calculation of the assignment’s final grade A summary of our recommendations for two scenarios A and An assignment graded through peer-assessment will receive B is given in the Table 2. several peer gradings. These will have to be analyzed. If estimated as reliable these gradings will be use as input for the calculation of the final grade. Table 2: Grading method recommendations Scenario A Scenario B A simplest approach is to calculate the final grade of assign- Optimal value grad- ment as the mean value of received assessments. Sugge- Mean value grading. ing. sted Reliability provided Reliability provided Let S = {Sk1 , . . . , Skm } denote a set of peer gradings for grading by quantity of evalua- method by the quality of eval- assignment k, then the mean grade is tions. uators. m ! f f f 1 X (j) M (S) = (a1 , . . . , an ), where ai = c . m j=1 k,i With optimal value grading we have the opportunity to allow M (S) is a center of mass of the set S. This method for experts to skip grading for certain criteria. For example this grade calculation is suitable for scenario A. We can say that would be reasonable if an expert is not an expert for all the M (S) is sensitive to quantity, and less sensitive to outliers criteria. To be able to calculate O(S) it is sufficient to have (it “respects the decision of the majority”). every criteria covered by at least one expert. For scenario B, we propose an alternative grade calculation 4.3 Reliability of the peer-assessment method (see [5]). In scenario B we assume that peers are A prerequisite for the calculation of the assignment’s final experienced evaluators. Final grade is calculated as so-called grade is the determination whether a received set of peer- assessments is (sufficiently) reliable, i.e. acceptable. Algorithm 1: Semi-autonomous Grading Process For reasoning about reliability it is necessary to have granular input : Set of gradings S = {S (1) , . . . , S (m) }, data. The importance of granular scoring data is illustrated acceptable error e ≥ 0 grading calculation method g in the example in Table 3. Gradings S1 and S2 agree on the critical size N (i.e. N = 3) summative level, but seem very distinct at the granular level. output : Final grade or indicate gradings S as invalid This is an example of an unreliable peer-grading set where this incoherence is not visible on the summative level. 1 find a maximal S 0 ⊆ S with acceptable error 2 if #(S 0 ) ≥ N then 3 find S 00 of size #(S 00 ) = #(S 0 ) of minimal diameter Table 3: Highlighting the importance of granular 4 return g(S 00 ) as a proper grade for assignment k data P else C1 C2 C3 C4 5 Ask for teacher intervention (grading) S1 3 0 2 2 7 summative S2 0 1 3 3 7 granular e0 = 0.2 for example). Additionally, this could facilitate comparison of data from different tasks (within a course, or from different courses). A diameter of a set of gradings S = {S1 , . . . , Sn } is defined 4.6 Evaluation of peer-assessments (awarding as the graders) diam S = max d(Si , Sj ) . Goal of the second phase of the peer-assessment process is i,j to reward the graders for their effort. Graders (peers) who We consider a set S of peer gradings as reliable if diam S have graded consistently and accurately (near the final grade) (maximal pairwise distance between gradings) is less than 2e should be rewarded more than inconsistent and inaccurate where e is acceptable error given in advance. graders. Note that the diameter of the set S is also a diameter of Let us assume that a maximum of A points is awarded for an encompassing sphere. So, we can say that a reliable the peer-assessment task. Then grader k can be awarded peer-grading set fits within an encompassing e-sphere. Ai points for each of the m gradings Gi that he/she was assigned, where Ai is calculated by the following formula If a set of peer-assessments is estimated as not acceptable  A e − di  , di < e  (un-reliable) on the granulated level then the final grade cannot be calculated. A recommendation about acceptability Ai (di ) := me , 0 , di ≥ e  of particular peer-assessment set can be given to teacher or course designer by LA. This can be implemented in the where di = d(Gi , F ). learning management system (LMS, for example Moodle). Practical related issues will be discussed in the section 5. This has the effect that 0 points are awarded for a grading outside of the e-sphere around final grade F . For a grading 4.4 Grading process within this e-sphere Ai is proportional to (e − di ) where Assessment set can turn out as unacceptable because of a di = d(Gi , F ). single outlier grading. As an attempt to eliminate the outlier grading we propose to search for a maximal acceptable subset Finally, grader k is awarded a total of A(k) points for his of the received peer-assessments. If such subset can be found, effort with gradings G1 , . . . , Gm where A(k) is calculated as it is then used as input for the final grade calculation. a sum of Ai (di ). As a measure of final resort, an supervisor’s intervention is asked for. In a course with a large student enrollment (thousands for a MOOC) this will be avoided as much as pos- A/m sible. However, if present, instructor’s assessment becomes points awarded a final grade (no need for calculation). This is described in Ai (di ) Algorithm 1. 4.5 Normalization Metric d can be linearly scaled to obtain a normalized metric d0 with values within the interval [0, 1]. Distance of d0 = 1 0 e di corresponds to the maximal distance between worst and best distance from Fi possible gradings. This would facilitate having general recommendations for Figure 2: Points awarded to grader for grading Gi setting acceptable error e on a normalized scale (setting 5. IMPLEMENTATION of the model and recommendations on the applicability of A support for peer-assessment LA is lacking in assessment scenarios, parameters and analysis of the acceptable error of analytics in general. We analyze the current implementa- the assessment set. tion in the Moodle LMS where peer-assessment activity is implemented with the Workshop plugin. Also, we intend to implement our model (algorithms and the supporting recommendation system) as a peer-assessment In a Workshop activity, students receive a grade for their plug-in for the Moodle LMS. work and another grade for the quality of their assessment of other student’s assignments. Finally, we conclude that a well founded mathematical mod- eling, based on not just descriptive statistics, should be used Each participant in Workshop gets a grade for his submission more often in learning analytics. and a grade for her assessments. These grades are visible as separate grade items in student’s gradebook. 7. REFERENCES [1] Brown G., Bull J., Pendelbury M., “Assessing Student Current implementation of Workshop calculates the assign- Learning in Higher Education”, Psychology Press, 1997. ment grade as a weighted mean of received assessment grad- [2] Divjak, B. “Implementation of Learning Outcomes in ings. Received gradings are not analyzed for reliability. If the Mathematics for Non-Mathematics Major by Using teacher wishes to override or influence the calculated assign- E-Learning”, in Teaching Mathematics Online: Emergent ment grade, he can (a) additional provide his own assessment Technologies and Methodologies, A. A. Juan, M. A. and set its corresponding weight to a higher value or (b) even Huertas, S. Trenholm, and C. Steegmann, Eds. IGI completely override the final grade. As we have argued here Global, 2012, pp. 119–140. and in [5] we find this method as inadequate. Therefore, we [3] Divjak, B. “Assessment of Complex, Non-Structured proposed alternative methods for the calculation of the final Mathematical Problems”, in IMA International Conference grade. on Barriers and Enablers to Learning Maths, 2015. [4] Divjak, B., Maretić, M. “Learning Analytics for Assessment grade calculation is more complex. The goal is e-Assessment: The State of the Art and One Case Study”, to estimate the quality of each assessment. One assessment CECIIS, 2015. is singled out as the best one – it is the assessment closest to the mean value of all assessments. This selected assessment [5] Divjak, B., Maretić, M. “Geometry for Learning is assigned with highest grade. Other assessments receive Analytics”, Scientific and Professional Information grades based on the distance from the selected assessment. Journal of Croatian Society for Constructive Geometry Teacher can influence in this process by setting the parame- and Computer Graphics, KoG 19, 2015. ter which determines how quickly a grade should decreased [6] Ellis, C. “Broadening the scope and increasing the relative to the distance. usefulness of learning analytics: The case for assessment analytics”, Br. J. Educ. Technol., vol. 44, no. 4, pp. We are currently developing a new Moodle plugin for peer- 662–664, 2013. assessment. This plugin will address the identified problems [7] Entwistle, N. J. “Teaching for understanding at of the current implementation according to our model. university: deep approaches and distinctive ways of thinking”. Basingstoke, Hampshire: Palgrave Macmillan, 2009. [8] Ferguson, R. “The state of learning analytics in 2012: a 6. CONCLUSION. FURTHER RESEARCH review and future challenges”, Tech. Rep. KMI-12-01, vol. Peer-assessment has many advantages for students (for ex- 4, no. March, p. 18, 2012. ample development of metacognitive skills) and for teachers [9] Hwang, C.L, Yoon, K.,“Multiple Attribute Decision (for example saves teacher’s time) but there are several chal- lenges related to their implementation such as calculation of Making and Applications”, NY, Springer Verlag, 1981. final grade, reliability check and awarding an evaluator for [10] Jonnson, A., Svigby, G., “The use of scoring rubrics: peer-assessment. Reliability, validity and educational consequences”, Educational Research Review, 2007. In this paper we propose new methods for calculation of [11] Moodle LMS (https://moodle.org/) the grades in peer-assessment. We propose a measure for Plugins available on January 10th, 2016. at reliability and a method for grading peer-evaluations in a https://moodle.org/plugins/ peer-assessment exercise. These metrics are based on two [12] Papamitsiou Z., Economides A.A., “Learning Analytics distinguished scenario analysis that takes into account a num- and Educational Data Mining in Practice”, A Systematic ber of possible evaluators and evaluator expertize (domain Literature Review of Empirical Evidence, Educational knowledge and evaluation skills). We pursue an approach to Technology & Societym 17(5), 49-64., 2014. model assessment LA analytics with a geometric model. [13] Reyes Jacqueleen A., “The Skinny on Big Data in Education: Learning Analytics Simplified”, TechTrends: In [4] we analyzed a case study based on the master level Linking Research and Practice to Improve Learning 59 Project Management course at the University of Zagreb. Our (April): 75–80. 2015. analysis has confirmed the need for deeper analysis of relia- [14] Sadler, P., Good, E., “The impact of self-and peer bility in peer-assessment. Further exploring of data related grading on student learning”, Educ. Assess., vol. 11, no. 1, to the peer-assessment learning analytics in MOOCs is ex- pp. 37–41, 2006. pected. Having additional data should result in improvement