-

Assessment analytics for peer-assessment: a model and implementation

Blaženka Divjak

bdivjak@foi.hr 0

Darko Grabar

darko.grabar@foi.hr 1

Marcel Maretic´

mmaretic@foi.hr 2 0 Faculty of Organization and , Informatics , University of Zagreb , Pavlinska 2, 42000 Varaždin , Croatia 1 Faculty of Organization and , Informatics , University of Zagreb , Pavlinska 2, 42000 Varaždin , Croatia 2 Faculty of Organization and , Informatics , University of Zagreb , Pavlinska 2, 42000 Varaždin , Croatia

Learning analytics should go beyond data analysis and include approaches and algorithms that are meaningful for learner performance and that can be interpreted by teacher and related to learning outcomes. Assessment analytics has been lagging behind other research in learning analytics. This holds true especially for peer-assessment analytics. In this paper we present a mathematical model for peerassessment based on the use of scoring rubrics for criteriabased assessment. We propose methods for the calculation of the nal grade along with reliability measurues of peerassessment. Modeling is motivated and driven by the identied peer-assessment scenarios.

in [ 10 ] conclude that the use of scoring rubrics enhances the reliability of assessments, especially if the rubrics are analytic, topic-speci c, and complemented with examples and/or rater training. Otherwise, the scoring rubrics do not facilitate valid judgment of performance assessments. Besides this, rubrics have a potential to promote learning and/or improve instruction.

Aim of this paper is to model peer-assessment and to discuss issues of nal grade calculation and reliability of raters' judgments. Jonsson and Svingby note that variations in raters' judgments can occur either across raters, known as inter-rater reliability, or in the consistency of one single rater, called intra-rater reliability. Referring to [ 1 ] Jonsson and Svingby state that a major threat to reliability is the lack of consistency of an individual grader. Reports rarely mention this measure. On the other hand, inter-rater reliability is in some form mentioned in more than half of the reports but many of these simply use percentage as a measure for agreement. This is in agreement with Sadler and Good's critique in [ 14 ] of poor quality of quantitative research regarding self-assessment. Situation has improved since. Nevertheless, majority of current research still uses overly simple statistical measures in order to determine correlations that might indicate reliability.

In the following sections we describe two major peer-assessment scenarios we have recognized and for which we have developed a mathematical model. After that we present and analyze a model for these scenarios.

2. SCENARIOS FOR PEER-ASSESSMENT

Reliability of peer-assessment depends on many factors but consistency of individual evaluator was very early recognized as the most important (see [ 1 ]). On the other hand, having more assessments per assignment increases the reliability of peer-assessment with relatively inexperienced evaluators. From experienced evaluators (experts) we presume a high expertize in the domain knowledge and prior experience in evaluation. Similarly, an inexperienced evaluator is an individual with a relatively high level of domain knowledge (high baseline), but lacking experience in evaluation (e.g. peer assessment by senior undergraduates).

We analyze scenarios with respect to the experience of evaluator as is shown in scenario grid (Fig. 1). We have placed a continuum of possible scenarios in a grid with four quadrants. Within four quadrants we recognize two interesting scenarios for peer-assessment and discard the other two as either unrealistic or inappropriate.

In the rst scenario, let us call it Scenario A, participants are inexperienced evaluators (for example undergraduate students with introductory domain knowledge and no experience in peer-assessment) whereas in the scenario B evaluators have higher expertize in the evaluated domain (i.e. teachers, graduate students or senior undergraduates) and prior training in assessment. In scenario A, the lack of experience in evaluation must be compensated with a quantity of peerassessments, i.e. having a larger group size in peer-assessment. On the other hand, setting a group size too large in scenario B is a needless waste of expert's time. inexperienced evaluators (Scenario A) experienced evaluators (Scenario B)

t appropriate no small size of a peer assessment group large

3. OVERVIEW OF THE PEER-ASSESSMENT ACTIVITY

Peer-assessment activity starts after the work on the assignment task has completed. In a general case peer-assessment consists of two phases. We identify following activities in the whole process.

Phase 1: Assessment of assignments

i. Learners assess a (prede ned) number of assigned assignments ii. Analysis of peer-assessments

(grouped by assignment) iii. Calculation of the assignment grade

Phase 2: Assessment of the assessments

i. Analysis of peer-assessments

(grouped by grader) ii. Calculation of the assessment grade First phase starts with learners assessing the assignment work of their peers. We assume that each participant grades several assignments (at least 2). At the end of the rst phase a reliability check has to be performed and the nal grade has to be calculated. Second phase is concerned with the quality of assessments relative to the evaluator. As on outcome of the second phase graders can receive a grade (points) for the quality of their assessments.

4. MATHEMATICAL MODEL FOR PEER ASSESSMENT

We recognized three challenges: (1) calculation of the nal grade based on di erent assessment scenarios, (2) measurement of the assessment's reliability and (3) measurement of reliability of each grader (for grading of the graders).

4.1 Overview of the assignment grading

A grading G from the scoring rubric with n criteria is a tuple of numbers G = (g1; : : : ; gn). We consider gradings as points in an n-dimensional space endowed with a metric d, i.e. a function that measures the distance between points (i.e. gradings) and satis es the axioms of a metric space. In [ 5 ] we proposed the use of the non-euclidean taxicab metric d1, but for the purpose of this paper it is su cient think of d as any distance metric.

4.2 Calculation of the assignment’s final grade

An assignment graded through peer-assessment will receive several peer gradings. These will have to be analyzed. If estimated as reliable these gradings will be use as input for the calculation of the nal grade.

A simplest approach is to calculate the nal grade of assignment as the mean value of received assessments.

Let S = fSk1; : : : ; Skmg denote a set of peer gradings for assignment k, then the mean grade is

M (S) = (af1 ; : : : ; afn); where aif = 1 m m X c(kj;)i j=1 ! M (S) is a center of mass of the set S. This method for grade calculation is suitable for scenario A. We can say that M (S) is sensitive to quantity, and less sensitive to outliers (it \respects the decision of the majority ").

For scenario B, we propose an alternative grade calculation method (see [ 5 ]). In scenario B we assume that peers are experienced evaluators. Final grade is calculated as so-called optimal nal grade O(S) de ned by

O(S) = (of1 ; : : : ; ofn); where oif =

W (S) + B(S) : W (S) and B(S) represent amalgamations of worst and best received gradings respectively, de ned by: This approach is inspired by Hwang and Yoon's TOPSIS (Technique for Order of Preference by Similarity to Ideal Solution) method of multi-criteria decision making in [ 9 ]. When evaluators are trusted experts, we don't expect \wild" gradings (outliers). Here, it is expected that after just a few initial evaluations any additional gradings will have no e ect on the nal grade O(S). Please consult [ 5 ] for additional details.

A summary of our recommendations for two scenarios A and B is given in the Table 2.

4.3 Reliability of the peer-assessment

A prerequisite for the calculation of the assignment's nal grade is the determination whether a received set of peerassessments is (su ciently) reliable, i.e. acceptable. For reasoning about reliability it is necessary to have granular data. The importance of granular scoring data is illustrated in the example in Table 3. Gradings S1 and S2 agree on the summative level, but seem very distinct at the granular level. This is an example of an unreliable peer-grading set where this incoherence is not visible on the summative level. We consider a set S of peer gradings as reliable if diam S (maximal pairwise distance between gradings) is less than 2e where e is acceptable error given in advance.

Note that the diameter of the set S is also a diameter of an encompassing sphere. So, we can say that a reliable peer-grading set ts within an encompassing e-sphere. If a set of peer-assessments is estimated as not acceptable (un-reliable) on the granulated level then the nal grade cannot be calculated. A recommendation about acceptability of particular peer-assessment set can be given to teacher or course designer by LA. This can be implemented in the learning management system (LMS, for example Moodle). Practical related issues will be discussed in the section 5.

4.4 Grading process

Assessment set can turn out as unacceptable because of a single outlier grading. As an attempt to eliminate the outlier grading we propose to search for a maximal acceptable subset of the received peer-assessments. If such subset can be found, it is then used as input for the nal grade calculation. As a measure of nal resort, an supervisor's intervention is asked for. In a course with a large student enrollment (thousands for a MOOC) this will be avoided as much as possible. However, if present, instructor's assessment becomes a nal grade (no need for calculation). This is described in Algorithm 1.

4.5 Normalization

Metric d can be linearly scaled to obtain a normalized metric d0 with values within the interval [0; 1]. Distance of d0 = 1 corresponds to the maximal distance between worst and best possible gradings.

This would facilitate having general recommendations for setting acceptable error e on a normalized scale (setting Algorithm 1: Semi-autonomous Grading Process input : Set of gradings S = fS(1); : : : ; S(m)g, acceptable error e 0 grading calculation method g critical size N (i.e. N = 3) output : Final grade or indicate gradings S as invalid 1 nd a maximal S0 S with acceptable error 2 if #(S0) N then 3 nd S00 of size #(S00) = #(S0) of minimal diameter 4 return g(S00) as a proper grade for assignment k else 5

Ask for teacher intervention (grading) e0 = 0:2 for example). Additionally, this could facilitate comparison of data from di erent tasks (within a course, or from di erent courses).

4.6 Evaluation of peer-assessments (awarding the graders)

Goal of the second phase of the peer-assessment process is to reward the graders for their e ort. Graders (peers) who have graded consistently and accurately (near the nal grade) should be rewarded more than inconsistent and inaccurate graders.

Let us assume that a maximum of A points is awarded for the peer-assessment task. Then grader k can be awarded Ai points for each of the m gradings Gi that he/she was assigned, where Ai is calculated by the following formula Ai(di) := 8 A < me

5. IMPLEMENTATION

A support for peer-assessment LA is lacking in assessment analytics in general. We analyze the current implementation in the Moodle LMS where peer-assessment activity is implemented with the Workshop plugin.

In a Workshop activity, students receive a grade for their work and another grade for the quality of their assessment of other student's assignments.

Each participant in Workshop gets a grade for his submission and a grade for her assessments. These grades are visible as separate grade items in student's gradebook.

Current implementation of Workshop calculates the assignment grade as a weighted mean of received assessment gradings. Received gradings are not analyzed for reliability. If the teacher wishes to override or in uence the calculated assignment grade, he can (a) additional provide his own assessment and set its corresponding weight to a higher value or (b) even completely override the nal grade. As we have argued here and in [ 5 ] we nd this method as inadequate. Therefore, we proposed alternative methods for the calculation of the nal grade.

Assessment grade calculation is more complex. The goal is to estimate the quality of each assessment. One assessment is singled out as the best one { it is the assessment closest to the mean value of all assessments. This selected assessment is assigned with highest grade. Other assessments receive grades based on the distance from the selected assessment. Teacher can in uence in this process by setting the parameter which determines how quickly a grade should decreased relative to the distance.

We are currently developing a new Moodle plugin for peerassessment. This plugin will address the identi ed problems of the current implementation according to our model.

6. CONCLUSION. FURTHER RESEARCH

Peer-assessment has many advantages for students (for example development of metacognitive skills) and for teachers (for example saves teacher's time) but there are several challenges related to their implementation such as calculation of nal grade, reliability check and awarding an evaluator for peer-assessment.

In this paper we propose new methods for calculation of the grades in peer-assessment. We propose a measure for reliability and a method for grading peer-evaluations in a peer-assessment exercise. These metrics are based on two distinguished scenario analysis that takes into account a number of possible evaluators and evaluator expertize (domain knowledge and evaluation skills). We pursue an approach to model assessment LA analytics with a geometric model. In [ 4 ] we analyzed a case study based on the master level Project Management course at the University of Zagreb. Our analysis has con rmed the need for deeper analysis of reliability in peer-assessment. Further exploring of data related to the peer-assessment learning analytics in MOOCs is expected. Having additional data should result in improvement of the model and recommendations on the applicability of scenarios, parameters and analysis of the acceptable error of the assessment set.

Also, we intend to implement our model (algorithms and the supporting recommendation system) as a peer-assessment plug-in for the Moodle LMS.

Finally, we conclude that a well founded mathematical modeling, based on not just descriptive statistics, should be used more often in learning analytics. [11] Moodle LMS (https://moodle.org/)

Plugins available on January 10th, 2016. at https://moodle.org/plugins/

[1] Brown

, Bull

, Pendelbury

, \ Assessing Student Learning in Higher Education" , Psychology Press, 1997 .

[2] Divjak , B. \ Implementation of Learning Outcomes in Mathematics for Non-Mathematics Major by Using E-Learning" , in Teaching Mathematics Online: Emergent Technologies and Methodologies ,

A. A.

Juan ,

M. A.

Huertas ,

Trenholm , and C. Steegmann, Eds. IGI Global , 2012 , pp. 119 { 140 .

[3] Divjak , B. \ Assessment of Complex, Non-Structured Mathematical Problems" , in IMA International Conference on Barriers and Enablers to Learning Maths, 2015 .

[4] Divjak , B. , Maretic , M. \ Learning Analytics for e-Assessment: The State of the Art and One Case Study " , CECIIS , 2015 .

[5] Divjak , B. , Maretic , M. \ Geometry for Learning Analytics" , Scienti c and Professional Information Journal of Croatian Society for Constructive Geometry and Computer Graphics , KoG 19 , 2015 .

[6] Ellis , C. \ Broadening the scope and increasing the usefulness of learning analytics: The case for assessment analytics",

Br. J.

Educ . Technol., vol. 44 , no. 4 , pp. 662 { 664 , 2013 .

[7] Entwistle , N. J. \ Teaching for understanding at university: deep approaches and distinctive ways of thinking" . Basingstoke, Hampshire: Palgrave Macmillan, 2009 .

[8] Ferguson , R. \ The state of learning analytics in 2012: a review and future challenges" , Tech. Rep. KMI-12-01 , vol. 4 , no. March , p. 18 , 2012 .

[9] Hwang , C.L , Yoon , K. ,\ Multiple Attribute Decision Making and Applications" , NY, Springer Verlag, 1981 .

[10] Jonnson , A. , Svigby , G. , \ The use of scoring rubrics: Reliability, validity and educational consequences " , Educational Research Review , 2007 .

[12] Papamitsiou

, Economides

A.A.

, \Learning Analytics and Educational Data Mining in Practice", A Systematic Literature Review of Empirical Evidence , Educational Technology & Societym 17 ( 5 ), 49 - 64 ., 2014 .

[13] Reyes

Jacqueleen A.

, \ The Skinny on Big Data in Education: Learning Analytics Simpli ed" , TechTrends: Linking Research and Practice to Improve Learning 59 (April): 75 { 80 . 2015 .

[14] Sadler , P. , Good , E. , \ The impact of self-and peer grading on student learning " , Educ. Assess., vol. 11 , no. 1 , pp. 37 { 41 , 2006 .