-

Patricia Gutierrez

patricia@iiia.csic.es 0

Nardine Osman

nardine@iiia.csic.es 0

Carles Sierra

sierra@iiia.csic.es 0 0 IIIA-CSIC , Campus de la UAB, Barcelona , Spain

In this paper we introduce an automated assessment service for online learning support in the context of communities of learners. The goal is to introduce automatic tools to support the task of assessing massive number of students as needed in Massive Open Online Courses (MOOC). The nal assessments are a combination of tutor's assessment and peer assessment. We build a trust graph over the referees and use it to compute weights for the assesments aggregations. The model proposed intends to be a support for intelligent online learning applications that encourage student's interactions within communities of learners and bene ts from their feedback to build trust measures and provide automatic marks.

The authors of [ 7 ] propose methods to estimate peer reliability and correct peer biases. They present results over real world data from 63,000 peer assessments of two Coursera courses. The models proposed are probabilistic and they are compared to the grade estimation algorithm used on Coursera's platform, which does not take into account individual biases and reliabilities. Di erently from them, we place more trust in students who grade like the tutor and do not consider student's biases. When a student is biased its trust measure will be very low and his/her opinion will have a moderate impact over the nal marks. [ 3 ] proposes the CrowdGrader framework, which de nes a crowdsourcing algorithm for peer evaluation. The accuracy degree (i.e. reputation) of each student is measured as the distance between his/her self assesment and the aggregated opinion of the peers weighted by their accuracy degrees. The algorithm thus implements a reputation system for students, where higher accuracy leads to higher in uence on the consensus grades. Di erently from this work, we give more weight to those peers that have similar opinions to those of the tutor.

In this paper, and di erently from previous works, we want to study the reliability of student assessments when compared with tutor assessments. Although part of the learning process is that students participate in the de nition of the evalution criteria, tutors want to be certain that the scoring of the students' works is fair and as close as possible to his/her expert opinion.

Our inspiration comes from a use case explored in the EUfunded project PRAISE [ 1 ]. PRAISE enables online virtual communities of students with shared interests and goals to come together and share their music practice with each other so the process of learning becomes social. It provides tools for giving and receiving feedback, as feedback is considered an essential part of the learning process. Tutors de ne lesson plans as pedagogical work ows of activities, such as uploading recorded songs, automatic performance analysis, peer feedback, or re exive pedagogy analysis. The goal of any lesson plan is to improve student skills, for instance, the performance speed competence or the interpretation maturity level. Assessments of students' performances have to evaluate the achievement of these skills. Once a lesson plan is de ned, PRAISE's interface tools allow students to navigate through the activities, to upload assignments, to practice, to assess each other, and so on. The tools allow tutors to monitor what students have done and to assess them. In this work we concentrate on the development of a service that can be included as part of a lesson plan and helps tutors in the overall task of assessing the students participating in the lesson plan. This assessment is based on aggregating students' assessments, taking into consideration the trust that tutors have on the students' individual capabilities in judging each others work.

To achieve our objective we propose in this paper an automated assessment method (Section 2) based on tutor assessments, aggregations of peer assessments and on trust measures derived from peer interactions. We experimentaly evaluate (Section 3) the accuracy of the method over di erent topologies of student interactions (i.e. di erent types of student grouping). The results obtained are based on simulated data, leaving the validation with real data for future work. We then conclude with a discussion of the results (Section 4). 2. COLLABORATIVE ASSESSMENT In this section we introduce the formal model of the method and the algorithms for collaborative assessment.

2.1 Notation and preliminaries

We say an online course has a tutor , a set of peer students S, and a set of assignments A that need to be marked by the tutor and/or students with respect to a given set of criteria C.

The automated assessment state S is then de ned as the tuple:

S = hR; A; C; Li R = f g [ S de nes the set of possible referees (or markers), where a referee could either be the tutor or some student s 2 S. A is the set of submitted assignments that need to be marked and C = hc1; : : : ; cni is the set of criteria that assignments are marked upon. L is the set of marks (or assessments) made by referees, such that L : R A ! [0; ]n (we assume marks to be real numbers between 0 and some maximum value ). In other words, we de ne a single assessment as: = M~ , where 2 A, 2 R, and M~ = hm1; : : : ; mni describes the marks provided by the referee on the n criteria of C, mi 2 [0; ].

Similarity between marks. We de ne a similarity function sim : [0; ]n [0; ]n ! [0; 1] to determine how close two assesments and are. We calculate the similarity between assessments = fm1; : : : ; mng and = fm01; : : : ; m0ng as follows: sim( ; ) = 1 n X i=1 jmi n X i=1 m0ij This measure satis es the basic properties of a fuzzy similarity [ 6 ]. Other similarity measures could be used. Trust relations between referees. Tutors need to decide up to which point they can believe on the assessments made by peers. We use two di erent intuitions to make up this belief. First, if the tutor and the student have both assessed some assigments, their similarity gives a hint of how close the judegements of the student and the tutor are. Similarly, we can de ne the judgement closeness of any two students by looking into the assignments evaluated by both of them. In case there are no assigments evaluated by the tutor and one particular student we could simply not take that student's opinion into account because the tutor would not know how much to trust the judgement of this student, or, as we do in this paper, we approximate that unknown trust by lookig into the chain of trust between the tutor and the student through other students. To model this we de ne two di erent types of trust relations:

Direct trust : This is the trust between referees ; 2 R that have at least one assignement assessed in common. The trust value is the average of similarities on the assessments over the same peers. Let the set A ; be the set of all assignments that have been assessed by both referees. That is, A ; = f j 2 L and 2 Lg. Then,

TD( ; ) =

P 2A ; sim( ;

) jA ; j We could also de ne direct trust as the conjunction of the similarities for all common assignments as:

TD( ; ) = sim( ;

) ^ 2A ; However, this would not be practical, as a signi cant di erence in just one assessment of those assessed by two referees would make their mutual trust very low. Indirect trust : This is the trust between referees ; 2 R without any assignement assessed by both of them. We compute this trust as a transitive measure over chains of referees for which we have pair-wise direct trust values. We de ne a trust chain as a sequence of referees qj = h i; :::; i; i+1; : : : ; mj i where i 2 R, 1 = and mj = and TD( i; i+1) is de ned for all pairs ( i; i+1) with i 2 [1; mj 1]. We note by Q( ; ) the set of all trust chains between and . Thus, indirect trust is de ned as a aggregation of the direct trust values over these chains as follows:

TI ( ; ) =

max qj2Q( ; )

Y i2[1;mj 1]

TD( i; i+1) Hence, indirect trust is based in the notion of transitivity.1 1TI is based on a fuzzy-based similarity relation sim presented before and ful lling the -Transitivity property: sim(u; v) sim(v; w) sim(u; w), 8u; v; w 2 V , where is a t-norm [ 6 ].

Update direct trust and edges Ideally, we would like to not overrate the trust of a tutor on a student, that is, we would like that TD(a; b) TI (a; b) in all cases. Guaranteeing this in all cases is impossible, but we can decrease the number of overtrusted students by selecting an operator that gives low values to TI . In particular, we prefer to use the product Q operator, because this is the tnorm that gives the smallest possible values. Other opertors could be used, for instance the min function.

Trust Graph. To provide automated assessments, our proposed method agregates the assessments on a given assignment taking into consideration how much trusted is each marker/referee from the point of view of the tutor (i.e. taking into consideration the trust of the tutor on the referee in marking assignments). The algorithm that computes the student nal assessment is based on a graph de ned as follows:

G = hR; E; wi where the set of nodes R is the set of referees in S, E R R are edges between referees with direct or indirect trust relations, and w : E ! [0; 1] provides the trust value. We note by D E the set of edges that link referees with direct trust. That is, D = fe 2 EjTD(e) 6= ?g. An similarly, I E for indirect trust, I = fe 2 EjTI (e) 6= ?g n D. The w values will be used as weights to combine peer assessments and are de ned as: w(e) = (TD(e) , if e 2 D

TI (e) , if e 2 I 2.2 Computing collaborative assessments Algorithm 1 implements the collaborative assessment method. We keep the notation ( ; ) to refer to the edge connecting nodes and in the trust graph and Q( ; ) to refer the set of trust chains between and .

The rst thing the algorithm does is to build a trust graph from L. Then, the nal assessments are computed as follows. If the tutor marks an assignment, then the tutor mark is considered the nal mark. Otherwise, a weighted average ( ) of the marks of student peers is calculated for this assignment, where the weight of each peer is the trust value between the tutor and that peer. Other forms of aggregation could be considered to calculate , for instance a peer assessment may be discarded if it is very far from the rest of assessments, or if the referee's trust falls below a certain threshold.

Figure 1 shows four trust graphs built from four assessments histories that corresponds to a chronological sequence of assessments made. The criteria C in this example are speed and maturity and the maximum mark value is = 10. For simplicity we only represent those referees that have made assessments in L. In Figure 1(a) there is one node representing the tutor who has made the only assessment over the assignment ex1 and there are no links to other nodes as no one else has assessed anything. In (b) student Dave assesses the same exercise as the tutor and thus a link is created between them. The trust value w(tutor; Dave) = TD(tutor; Dave) is high since their marks were similar. In (c) a new assessment by Dave is added to L with no consequences in the graph construction. In (d) student Patricia adds an assessment on ex2 that allows to build a direct trust between Dave and Patricia and an indirect trust between the tutor and Patricia, through Dave. The automated assessments generated in case (d) are: h5; 5i for exercise 1 (which preserves the tutor's assessment) and h3:7; 3:7i for exercise 2 (which uses a weighted aggregation of the peers' assessments).

Note that the trust graph built from L is not necessarily connected. A tutor wants to reach a point in which the graph is totally connected because that means that the collaborative assessment algorithm generates an assessment for every assignment. Figure 2 shows an example of a trust graph of a particular learning community involving 50 peer students and a tutor. When S has a history of 5 tutor assessments and 25 student assessments (jLj = 30) we observe that not all nodes are connected. As the number of assessments in(a) L=f teuxt1or=h5;5ig (b) L=f teuxt1or=h5;5i; edxa1ve=h6;6ig (a) jLj = 30 (b) jLj = 200 (c) L=f teuxt1or=h5;5i; edxa1ve= (d) L=f teuxt1or=h5;5i; edxa1ve= h6;6i; edxa2ve=h2;2ig h6;6i; edxa2ve=h2;2i; epxa2tricia=h8;8ig creases, the trust graph becomes denser and eventually it gets completely connected. In (b) and (c) we see a complete graph. 3. EXPERIMENTAL PLATFORM AND EVAL

UATION In this Section we describe how we generate simulated social networks, describe our experimental platform, de ne our benchmarks and discuss experimental results.

3.1 Social Network Generation

Several models for social network generation have been proposed re ecting di erent characteristics present in real social communities. Topological and structural features of such networks have been explored in order to understand wich generating model resembles best the structure of real communities [ 5 ].

A social network can be de ned as a graph N where the set of nodes represent the individuals of the network and the set of edges represent connections or social ties among those individuals. In our case, individuals are the members of the learning community: the tutor and students. Connections represent the social ties and they are usually the result of interactions in the learning community. For instance a social relation will be born between two students if they interact with each other, say by collaboratively working on a project together. In our experimentation, we rely on the social network in order to simulate which student will assess the assignment of which other student. We assume students will assess the assignments of students they know, as opposed to picking random assignments. As such, we clarify that social networks are di erent from the trust graph of Section 2. While the nodes of both graphs are the same, edges (c) jLj = 400 of the social network represent social ties, whereas edges in the trust graph represent how much does one referee trust another in judging others work.

To model social networks where relations represent social ties, we follow three di erent approaches: the Erdo}s-Renyi model for random networks [ 4 ], the Barabasi-Albert model for power law networks[ 2 ] and a hierarchical model for cluster networks.

3.1.1 Random Networks

The Erdo}s-Renyi model for random networks consists of a graph containing n nodes connected randomly. Each possible edge between two vertices may be included in the graph with probability p and may not be included with probability (1 p). In addition, in our case there is always an edge between the node representing the tutor and the rest of nodes, as the tutor knows all of its students (and may eventually mark any of those students).

The degree distribution of random graphs follows a Poisson distribution. Figure 3(a) shows an example of a random graph with 51 nodes and p = 0:5 and its degree distribution. Note that the point with degree 50 represents the tutor node while the rest of the nodes degree t a Poisson distribution.

3.1.2 Power Law Networks

The Barabasi-Albert model for power law networks base their graph generation on the notions of growth and preferential attachment. The generation scheme is as follows. Nodes are added one at a time. Starting with a small number of initial nodes, at each time step we add a new node with m edges linked to nodes already part of the network. In our experiments, we start with m + 1 initial nodes. The edges are not placed uniformly at random but preferentially in proportion to the degree of the network nodes. The probability p that the new node is connected to a node i already in the network depends on the degree ki of node i, such that: p = ki= Pn

j=1 kj. As above, there is also always an edge between the node representing the tutor and the rest of nodes.

The degree distribution of this network follows a Power Law distribution. Figure 3(b) shows an example of a power law graph with 51 nodes and m = 16 and its degree distribution. The point with degree 50 describes the tutor node while the rest of the nodes closely resemble a power law distribution. Recent empirical results on large real-world networks often show, among other features, their degree distribution following a power law [ 5 ].

3.1.3 Cluster Networks

As our focus is on learning communities, we also experiment with a third type of social network: the cluster network which is based on the notions of groups and hierarchy. Such networks consists of a graph composed of a number of fully connected clusters (where we believe clusters may represent classrooms or similar pedagogical entities). Additionally, as above, all the nodes are connected with the tutor node. Figure 3(c) shows an example of a cluster graph with 51 nodes, 5 clusters of 10 nodes each and its degree distribution. The point with degree 50 describes the tutor while the rest of the nodes have degree 10, since every student is fully connected with the rest of the classroom.

3.2 Experimental Platform

In our experimentation, given an initial automated assessment state S = hR; A; C; Li with an empty set of assessments L = fg, we want to simulate tutor and peer assessments so that the collaborative assessment method can eventually generate a reliable and de nitive set of assessments for all assignments.

To simulate assessments, we say each students is de ned by its pro le that describes how good its assessments are. The pro le is essentially de ned by the measure, or distance, d 2 [0; 1] that speci es how close are the student's assessments to that of the tutor.

We then assume the simulator knows how the tutor and each student would assess an assignment. This becomes necessary in our simulation, since we generate student assessments in terms of their distance to that of the tutor's, even if the tutor does not choose to actually assess the assignment in question. This simulator's knowledge of the values of all possible assessments is generated accordingly:

For every assignment 2 A, we calculate the tutor's assessment, which is randomly generated according to the function f : A ! [0; ]n. This assessment essentially describes what mark would the tutor give , if it decided to assess it.

For every assignment 2 A, we also calculate the assessment of each student 2 S. This is calculated according to the function f : A ! [0; ]n, such that: (a) Random Network (aprox graph density 0.5) (b) Power Law Network (aprox graph density 0.5) (c) Cluster Network (aprox graph density 0.2) sim(f ( ); f ( )) d We note that we only need to calculate 's assessment of if the student who submitted the assignment is a neighbour of in N . We note that the above only calculates what the assessments would be, if referees where to assess assignments.

3.3 Benchmark

Given an initial automated assessment state S = hR; A; C; Li with an empty set of assessments L = fg, a set of student pro les P r = fdsg8s2S , and a social network N (whose nodes is the set R), we simulate individual tutor and students' assessments. When does a referee in R assess an assignment in A is explained shortly. However we note here that the value of each generated assessment is equivalent to that calculated for the simulator's knowledge (see Section 3.2 above).

In our benchmark, we consider the three types of social networks introduced earlier: random social networks (with 51 nodes, p = 0:5, and approximate density of 0.5), power law networks (with 51 nodes, m = 16, and approximate density of 0.5), and cluster networks (with 51 nodes, 5 clusters of 10 nodes each, and approximate density of 0.2). Examples of these generated networks are shown in Figure 3. We say one assignment is submitted by each student, resulting in jSj = 50 and jAj = 50. The range that a referee (tutor or student) may mark a given assignment with respect to a given criteria is [0,10]. And the set of criteria is C = hspeed; maturityi. The criteria essentially measure the speed of playing a musical piece, and the maturity level of the student's performance.

An assessment pro le is generated for each student at the beginning of the execution, resulting in a set of student proles P r = fdsg8s2S , where d 2 [0; 0:5]. We consider here two cases for generating the set of student pro les P r. A rst case where d is picked randomly following a power law distribution (Figure 4(a)) and a second case where d is picked randomly following a uniform distribution (Figure 4(b)). With simulated individual assessments, we then run the collaborative assessment method in order to compute an automated assessment. We also compute the `error' of the collaborative assessment method, whose range is [0; 1], over the set of assignments A accordingly:

X sim(f ( ); ( )) 2A

jAj 2 A , where ( ) describes the automated assessment for a given assignment (a) Power law pro le generation (b) Uniform pro le generation With the settings presented above, we run two di erent experiments. The results presented are an average over 50 executions. The two experiments are presented next. In experiment 1, students provide their assessments before the tutor. Each student provides assessments for a randomly chosen a number of peer assigments (of course, where assignments are those of their neighboring peers in N ). We run the experiment for 5 di erent values of a = f3; 4; 5; 6; 7g. After the students provide their assessments, the tutor starts assessing assignments incrementally. After every tutor assessment, the error over the set of automated assessment is calculated. Notice that the collaborative assessment method takes the tutor assessment, when it exists, to be the nal assessment. As such, the number of automated assessments calculated based on aggregating students' assessments is reduced over time. Finally, when the tutor has assessed all 50 students, the resulting error is 0.

In experiment 2, the tutor provides its assessments before the students. The tutor in this experiment will assess a randomly chosen number of assignments, where this number is based on the percentage a of the total number of assignments. We run the experiment for 4 di erent values of a = f5; 10; 15; 20g. After the tutor provides their assessments, students' assessments are performed. In every iteration, a student randomly selects a neighbor in N and assesses his assignment (in case it has not been assessed before by , otherwise another connected peer is chosen). We note that in the case of random and power law networks (denser networks), a total number of 1000 student assessments are performed. Whereas in the case of cluster networks (looser network), a total of 400 student assessments are performed. We note that initially, the trust graph is not fully connected, so the service is not able to provide automated assessments for all assignments. When the grap gets fully connected, the service generates automated assessments for all assignments and we start measuring the error after every new iteration.

3.4 Evaluation

In experiment 1, we observe (Figure 5) that the error decreases when the number of tutor assessments increase, as expected, until it reaches 0 when the tutor has assessed all 50 students. This decrement is quite stable and we do not observe abrupt error variations or important error increments from one iteration to the next. More variations are observed in the initial iterations since the service has only a few assessments to deduce the weights of the trust graph and to calculate the nal outcome.

In the case of experiment 2 (Figure 6), the error diminishes slowly as the number of student assessments increase, although it never reaches 0. Since the number of tutor assessments is xed in this experiment, we have an error threshold (a lower bound) which is linked to the students' assessment pro le: the closest to the tutor's the lower this threshold will be. In fact, in both experiments we observe that when using a power law distribution pro le (Figure 4(a)) the automated assessment error is lower than when using a uniform distribution pro le (Figure 4(b)). This is because when using a power law distribution, more student pro les are generated whose assessments are closer to the tutors'.

In general, the error trend observed in all experiments comparing di erent social network scenarios (random, cluster or power law) show a similar behavior. Taking a closer look at experiment 2, cluster social graphs have the lowest error and we observe that assessments on all assignments are achieved earlier (this is, the trust graph gets connected earlier). We attribute this to the topology of the fully connected clusters which favors the generations of indirect edges earlier in the graph between the tutor and the nodes of each cluster. Power law social graphs have lower error than random networks in most cases. This can be attributed to the criteria of preferential attachment in their network generation, Figure 5: Eperiment 1 Figure 6: Experiment 2 which favors the creation of some highly connected nodes. Such nodes are likely to be assessed more frequently since more peers are connected to them. Then, the automated assessments of these higly connected peers are performed with more available information which could lead to more accurate outcomes.

4. DISCUSSION

The collaborative assessment model proposed in this paper is thought of as a support in the creation of intelligent online learning applications that encourage student interactions within communities of learners. It goes beyond current tutor-student online learning tools by making students participate in the learning process of the whole group, providing mutual assessment and making the overall learning process much more collaborative.

The use of AI techniques is key for the future of online learning communities. The application presented in this paper is specially useful in the context of MOOC: with a low number of tutor assessments and encouraging students to interact and provide assessments among each other, direct and indirect trust measures can be calculated among peers and automated assessments can be generated.

Several error indicators can be designed and displayed to the tutor managing the course, which we leave for future work. For example the error indicators may inform the tutor which assignments have not received any assessments yet, or which deduced marks are considered unreliable. For example, a deduced mark on a given assignment may be considered unreliable if all the peer assessments that have been provided for that assignment are considered not to be trusted by the tutor as they fall below a preselected acceptable trust threshold. Alternatively, a reliability measure may also be assigned to the computed trust measure TD. For instance, if there is only one assignment that has been assessed by and , then the computed TD( ; ) will not be as reliable as having a number of assignments assessed by and . As such, some reliability threshold may be used that de nes what is the minimum number of assignments that both and need to assess for TD( ; ) to be considered reliable. Observing such error indicators, the tutor can decide to assess more assignments and as a result the error may improve or the set of deduced assessments may increase. Finally, if the error reaches a level of acceptance, the tutor can decide to endorse and publish the marks generated by the collaborative assessment method.

Another interesting question for future work is presented next. Missing connections might be detected in the trust graph that would improve its connectivity or maximize the number of direct edges. The question that follows then is, what assignments should be suggested to which peers such that the trust graph and the overall assessment outcome would improve? Additionally, future work may also study di erent approaches for calculating the indirect trust value between two referees. In this paper, we use the product operator. We suggest to study a number of operators, and run an experiment to test which is most suitable. To do such a test, we may calculate the indirect trust values for edges that do have a direct trust measure, and then see which approach for calculating indirect trust gets closest to the direct trust measures.

Acknowledgements

This work is supported by the Agreement Technologies project (CONSOLIDER CSD2007-0022, INGENIO 2010) and the PRAISE project (EU FP7 grant number 388770).

[1] Praise project: http://www.iiia.csic.es/praise/.

[2]

Barabasi and

Albert . Emergence of scaling in random networks . Science , 1999 .

[3] L. de Alfaro and

Shavlovsky . Thecnical report 1308 .5273, arxiv.org. Crowdgrader: Crowdsourcing the evaluation of homework assignments , 2013 .

[4]

Erdo

}s and

Renyi . On random graphs . Publicationes Mathematicae , 1959 .

[5]

Ferrara and

Fiumara . Topological features of online social networks . Communications in Applied and Industrial Mathematics , 2011 .

[6]

Godo and R. Rodr guez. Logical approaches to fuzzy similarity-based reasoning: an overview . Preferences and Similarities , 2008 .

[7]

Piech ,

Huang ,

Chen ,

Do ,

Ng , and

Koller . Tuned models of peer assessment in moocs . Proc. of the 6th International Conference on Educational Data Mining (EDM 2013 ), 2013 .