=Paper=
{{Paper
|id=Vol-1633/ws1-paper1
|storemode=property
|title=Probing the Landscape: Toward a Systematic Taxonomy of Online Peer Assessment Systems in Education
|pdfUrl=https://ceur-ws.org/Vol-1633/ws1-paper1.pdf
|volume=Vol-1633
|authors=Dmytro Babik,Edward Gehringer,Jennifer Kidd,Ferry Pramudianto,David Tinapple
|dblpUrl=https://dblp.org/rec/conf/edm/BabikGKPT16
}}
==Probing the Landscape: Toward a Systematic Taxonomy of Online Peer Assessment Systems in Education==
<pdf width="1500px">https://ceur-ws.org/Vol-1633/ws1-paper1.pdf</pdf>
<pre>
    Probing the Landscape: Toward a Systematic Taxonomy
       of Online Peer Assessment Systems in Education
           Dmytro Babik                          Edward F. Gehringer                          Jennifer Kidd
      James Madison University               North Carolina State University             Old Dominion University
         421 Bluestone Dr.                  Department of Computer Science               166-7 Education Building
       Harrisonburg, VA 22807                      Raleigh, NC 27695                        Norfolk, VA 23529
         +1 (540) 568-3064                         +1 (919) 515-2066                        +1 (757) 683-3248
        babikdx@jmu.edu                               efg@ncsu.edu                           jkidd@odu.edu

                  Ferry Pramudianto                                               David Tinapple
             North Carolina State University                                  Arizona State University
            Department of Computer Science                                     Dixie Gammage Hall
                   Raleigh, NC 27695                                             Tempe, AZ 85287
                   +1 (919) 513-0816                                             +1 (480) 965-3122
                   fferry@ncsu.edu                                         david.tinapple@asu.edu


ABSTRACT                                                            assessment systems, and (2) to develop an arsenal of web
We present the research framework for a taxonomy of online          services for a wide range of applications in such systems. We
educational peer-assessment systems. This framework enables         have examined a number of these systems, including such better
researchers in technology-supported peer assessment to under-       known ones as Calibrated Peer Review [20], CritViz [26],
stand the current landscape of technologies supporting student      CrowdGrader [6], Expertiza [9], Mobius SLIP [2], Peerceptiv [5]
peer review and assessment, specifically, its affordances and       and peerScholar [14]. We adopt the term “online peer assessment
constraints. The framework helps identify the major themes in       system” to describe the broad range of computer applications
existing and potential research and formulate an agenda for         purposefully designed and developed to support student peer
future studies. It also informs educators and system design         review and assessment. Specifically, we define an online peer-
practitioners about use cases and design options.                   assessment system as a web-based application that facilitates
                                                                    peer assessment process workflow, such as collecting submission
Keywords                                                            artifacts, allocating reviewers to critique and/or evaluate
                                                                    designated artifacts submitted by peers, setting deadlines, and
Peer assessment, peer review, system design, rubric, scale          guiding reviewers on the format of the qualitative and
                                                                    quantitative feedback. This term covers a class of systems
1. INTRODUCTION                                                     described in the literature as “computer (technology, IT, CIT,
In the twenty years that the web has been widely used in            ICT, network, internet, web, cloud)-aided (assisted, based,
education, dozens, if not hundreds, of online peer assessment       enabled, mediated, supported)” peer assessment (review,
systems have appeared. They have been conceived by educators        evaluation) systems (in any combination). Online peer-
in many disciplines, such as English, Computer Science, and         assessment systems are a subset of a general class of social
Design, to name a few. Topping [29] highlighted computer-aided      computing systems that involve peer review (including social
peer assessment as an important pedagogical approach to             networking and social-media applications, such as wikis, blogs,
developing higher level competencies. Surprisingly, most of these   and discussion forums), but are distinguished by having specific
systems have been designed “from the ground up” — until now,        workflow constraints and being directed at specific educational
there is little evidence that designers and developers of one       goals.
system have consulted other systems to see what existing            The purpose of this paper is to set up a framework for the
techniques are appropriate to their experience, and what can be     systematic review and analysis of the current state of online peer
done better. Several authors have conducted reviews of existing     assessment systems. We contrast our study with the earlier
peer assessment approaches [4, 5, 8, 11, 19, 25, 28]. To the best   surveys by Luxton-Reilly [19] and Søndergaard and Mulder [25],
of our knowledge, however, no one has proposed a systematic         which considered the facilities of individual systems one by one
research framework for exploring and generalizing affordances       and then contrasted them. Our approach is to discuss functional-
and constraints of educational technology-enabled peer              ities of systems, and then describe how individual systems realize
assessment systems.                                                 those funtionalities. Thus, in a sense, it is a dual of the earlier
Our Peerlogic project1 is pursuing two primary goals: (1) to        papers. Alternatively, one might say it applies the jigsaw
systematically explore the domain of technology-enabled peer        technique [34] to them. Because of space limitations, this paper
                                                                    only begins to apply the taxonomy, which we will elaborate and
1                                                                   extend in a future paper.
    The Peerlogic project is funded by the National Science
    Foundation under grants 1432347, 1431856, 1432580,              We use our framework to examine affordances and limitations of
    1432690, and 1431975.                                           the systems that have been developed since 2005 and how they
address pedagogical, philosophical, and technological decisions.      These objectives and use cases are system-independent because
We also exploit the framework to develop a research agenda to         they are not determined by the system in which they are realized
guide future studies. In this paper, we will begin to address these   but rather by the user needs independent of any system. In this
important research questions: What is the current state of the        paper, for illustration purposes, we focus only on objective I
online peer assessment in education? How is technology                (Table 2).
transforming and advancing student peer review?
                                                                      Next, we examined a sample set of online peer assessment
We address this study to several audiences such as peer               systems to identify how these use cases are implemented as
assessment researchers, practitioners, system designers and           functionality (features). In this study, we focus on functionality
educational technologists. Researchers in learning analytics can      relevant specifically to the student peer-to-peer interactions in the
learn what peer-assessment data can be extracted and mined.           review and assessment process and ignore complementary
Software designers can learn what has been designed and               functionality that is germane to any learning, knowledge
implemented in the past. Instructors applying peer review             management or communication systems (such as learning-object
pedagogy in their classes can find what systems and functionality     content management). A given use case may be implemented in
would best meet their needs. Instructors may turn to ed-tech          various systems as different ensembles of features, with varying
specialists and instructional designers to answer these questions;    design options. Therefore, functionality and design options are
thus, the latter also constitute an audience for this work.           system-dependent. For each functionality, specific design options
Conversely, marketers of these systems may identify the unique        were identified and categorized.
features of their systems so they can inform their constituencies.
                                                                      Visually, our framework can be represented as hierarchically
2. FRAMEWORK AND METHODOLOGY                                          organized layers, where the top layer comprises objectives, which
                                                                      determine use cases, supported by functionality, implemented as
2.1 Framework                                                         specific design options (Figure 1).
We applied a grounded theory approach to construct our
framework. First, we identified all possible use cases, occurring
in the online peer assessment. For this, we used an informal focus
group, where faculty using peer assessment in their pedagogy
described various situations and scenarios. In addition, academic
papers on peer assessment were reviewed and relevant practices
were brought to the discussion. Through this discussion of
practices, the peer assessment process use cases were identified
and categorized. Next, these use case categories were formalized
                                                                      Figure 1. Research framework for a taxonomy of online peer
as objectives of the peer assessment process. Thus, we obtained a
                                                                      assessment systems.
classification of system-independent peer assessment objectives
and respective use cases that support these objectives (Table 1).
                                                                      2.2 Data Collection and Analysis
Table 1. Primary objectives for online peer assessment systems        Data collection was conducted through iterative paper presenta-
                                                                      tion, system demonstrations, and discussions, documented as
 Objective           Descriptive Questions                            written notes and video recordings (including screencasts) shared
 I. Eliciting        How do student reviewers input evaluation        online. Over three years, the authors have reviewed and experi-
 evaluation          data (quantitative and qualitative,              mented with multiple available systems, designed and implement-
                     structured and semi-structured)? What            ed their own systems, systematically reviewed literature, and
                     input controls are used to elicit responses?     collaborated with other creators and users of systems in research
                                                                      and practice.
 II. Assessing       How are peer assessment results computed
 achievement         and presented to instructors and to              Identified, categorized and formalized themes, patterns, use
 and generating      students? What assessment metrics can be         cases, and design choices led to the construction of the
 learning            used?                                            framework. Then we used this framework to design
 analytics                                                            questionnaires for surveys and structured interviews to collect
 III. Structuring    What is the process of online peer review?       additional data on each identified system. Collected data was
 automated peer      What variations of this process exist?           synthesized in a spreadsheet, with formally defined “cases” and
 assessment                                                           “variables”. Our current sample includes 40 systems described in
 workflow                                                             the literature and found on the web. For the purpose of this
                                                                      paper, we illustrate our analysis with a subsample of selected
 IV. Reducing or     How assessment subjectivity can be               systems (Figure 2). Finally, the multi-case method will be used to
 controlling for     reduced or controlled for? What metrics of       complete our taxonomy and to answer our research questions in
 evaluation          assessment inaccuracy can be used?               the full study.
 biases
 V. Changing         How online peer assessment can be                3. SAMPLE ANALYSIS
 social              conducted to achieve higher-level learning       To demonstrate the application of our research framework for the
 atmosphere of       and other benefits?                              analysis of the online peer assessment systems in education, in
 the learning                                                         this paper, we focus on Objective I, “Eliciting evaluation”. We
 community                                                            analyze the input mechanisms and controls that students use to
                                                                      conduct peer assessment. In general, the review process involves
two tasks: (a) providing quantitative evaluations based on some      judged. Quality-level definitions specify achievement levels (e.g.,
criterion or criteria and using some scale, and (b) providing        “meets standards”, “needs improvement”) and help assessors
qualitative critiques or comments to peers’ artifacts. Therefore,    understand what evidences those levels. The scoring strategy
this objective is manifested in two distinct use cases: (I)          translates reviewer judgments into usable, often numeric,
“Eliciting quantitative peer evaluation” and (II) “Eliciting         representations.
qualitative peer evaluation, critiquing and commenting”. Use case
                                                                     Rubrics can be categorized as holistic or specific/analytical [13,
I is supported by two functionalities: rubrics and scales used for
                                                                     15]. In a holistic rubric, a submission is judged as a whole, with a
quantitative assessment; use case II is also supported by two
                                                                     single value or category representing its overall quality. In
functionalities:    critique     artifact   media    types    and
                                                                     contrast, a specific/analytic rubric requires evaluations on several
contextualization of critiques (Table 2). We present below the
                                                                     distinct criteria.
taxonomy of specific design choices available for these
functionalities and illustrate them with examples of specific        In the context of peer review, we found that the term “rubric” has
systems.                                                             been used more loosely to describe a multitude of evaluative
                                                                     processes and structures. Some systems offer wide flexibility in
 Table 2. The application of the research framework for the          design of rubrics that may or may not contain all three elements,
 analysis of objective I                                             while other systems are more restrictive. This leaves to the
                                                                     instructor assessment decisions, such as the type of rubric, the
     System-independent                 System-dependent             number of criteria, the number of achievement levels, the point
                                                                     value for each level, whether to use definitions, numeric scales,
   Objective        Use case      Functionality       Design         or both to delineate achievement levels. For example, in Canvas
                                   (features)         options        and Expertiza, a rubric can vary from a series of open-ended
                                                                     questions with no established quality levels or quantitative scores
 I. Evaluation    Use case I:     Rubrics           Holistic         to an elaborate rubric with multiple criteria, detailed definitions,
 elicitation      Eliciting                                          and a complex scoring strategy. In CritViz, a rubric is a set of
                  quantitative                      Specific /       questions that reviewers have to consider when evaluating peers’
                  peer                              analytic         submissions. Mobius SLIP supports creation of a qualitative
                  evaluation                                         rubric complete with the essential components but elicits holistic
                                  Scales            Rating           quantitative evaluation (Figure 2).
                                                                     Typically, online peer review systems, e.g., Expertiza, Calibrated
                                                    Ranking          Peer Review, Peerceptiv, and Canvas, support specific/analytical
                                                                     rubrics because they generate more detailed feedback that helps
                  Use case II:    Critique          Plain text       students understand their performance on each of these criteria.
                  Eliciting       artifact media                     Specific rubrics provide a more granular picture of artifacts’
                  qualitative     types             Rich text /      strengths and weaknesses and more guidance to students as they
                  peer                              hypertext /      complete subsequent revisions or assignments. Some systems,
                  evaluation,                       URL              such as Mobius SLIP and CritViz favor holistic evaluations (even
                  critiques,                                         if some specific rubrics are provided); noticeably, these systems
                  comments                          Inline file      also rely in ranking (rather than rating) evaluations. Holistic
                                                    annotation       rubrics make more sense for overall ranking, as it may be tedious
                                                                     for evaluators to rank multiple products on each of several
                                                    Multimedia       criteria.
                                                    attachments
                                                                     Limited choices in rubric design reduce the instructor’s control
                                                                     over pedagogical implications of using different rubric types, but
                                  Contextuali-      Non-
                                                                     free them to focus on other aspects of instruction. Instructors
                                  zation of         contextu-
                                                                     new to assessment may appreciate not having to make too many
                                  critiques         alized
                                                                     of these decisions. Some systems fall in the middle, dictating
                                                                     some parameters, but allowing flexibility with others. For
                                                    Contextu-
                                                                     example, Peerceptiv allows instructors to determine the number
                                                    alized
                                                                     of criteria, but requires each criterion to have a 7-point scale,
                                                                     unaccompanied by elaborated definitions. If rubric design is a
                                                                     critical factor in the institution’s use of peer review process,
3.1 Eliciting Quantitative Peer Evaluation                           instructors must carefully vet and select the system which best
3.1.1 Rubrics                                                        fits their assignment and assessment requirements.
Rubrics are used at all levels of education to evaluate a wide       In the context of peer review, rubrics are also associated with
variety of products. A rubric is an assessment tool that             higher student achievement [18] and higher reliability of peer
communicates expectations for an assignment submission. A            evaluations [12, 30]. Several studies suggested students need to
well-designed rubric must consist of three essential components:     engage with the rubrics in order for them to be effective [20].
evaluation criteria, quality level definitions, and a scoring        Providing rubrics when an assignment is first given and asking
strategy [20]. Evaluation criteria are the factors deemed to be      students to complete self- and peer reviews were shown to be
important on which the goodness of the submission will be            effective ways to facilitate this engagement.
                       (a) Canvas                                  (b) Expertiza


                       (c) CritViz                                (d) Mobius SLIP


Figure 2. Screenshot of selected online peer assessment systems
While rubrics are typically viewed as an assessment tool, many        similar quality as much as between the artifacts whose qualities
researchers suggested that they have a second, often overlooked,      may be far apart. Some ordinal scales may implicitly emphasize
instructional purpose. When used formatively, rubrics can illum-      items earlier in the list and lead to their higher ranking.
inate strengths and weaknesses and suggest a direction for future     Evaluating on ordinal scales places higher cognitive load on the
improvements. Rubrics help students understand what to change         evaluators because it requires them to compare multiple items
in their work and help educators see where future instruction         against each other. Thus, rubrics that use ordinal scales tend to
should be directed. Interestingly, studies of student perceptions     contain fewer criteria, and consequently, they may not draw
of rubrics suggested that students value these formative purposes.    evaluators’ attention to as many salient features of the artifact
Students observed that rubrics clarify the objectives for their       under review. Scores from rating-based systems are usually
work, help them plan their approach, check their work, and            determined by calculating a weighted average of scores given to
reflect on feedback from others. They also report producing           various criteria, which means they depend on multiple,
better submissions, earning higher grades, and feeling less           independent decisions by each evaluator, rather than a single
anxious about assignments when they are provided with a rubric        decision of how to rank this submission relative to others.
[20].
                                                                      Most online peer assessment systems are rating-based, e.g.,
Empirical studies support students’ impressions, providing            Calibrated Peer Review, Peerceptiv, Expertiza. Typically, a rating
evidence that rubrics support teaching and learning and               scale is presented as a drop-down menu or validated text box.
contribute to higher achievement [13, 20].                            Ranking-based systems have been also gaining prominence
                                                                      thanks to the strengths of the ordinal evaluation approach. In
Online peer review systems offer a variety of means for
                                                                      CritViz, for example, students have to “drag and drop”
supporting the formative use of rubrics. Some allow different
                                                                      submission artifacts to position them in the ranking order
rubrics to be used for different rounds of peer review within a
                                                                      according to the reviewers perception of their quality. Yet other
single assignment; others offer calibration to show students how
                                                                      systems attempt to take advantage of combining both evaluation
peer evaluations compare to the instructor assessment on a
                                                                      scales in a single control. For example, in Mobius SLIP, the SLIP
selected sample assignment. Many systems allow student
                                                                      Slider control (figure 2) allows recording ratings on the 0-100
achievement scores to be calculated in different ways depending
                                                                      scale as well as ranking, which then can be used separately to
on whether peer review is used for formative or summative
                                                                      generate analytics and grading data. Naturally, for such controls
purposes. These features, while important to this discussion, are
                                                                      to function, they should exclude the possibility of assigning the
beyond the purview of this paper, and will be discussed in a
                                                                      same rating to any two artifacts, but they allow placing two
future publication.
                                                                      artifacts close to each other to indicate approximately the same
                                                                      level of quality. Another example of a system that supports both
3.1.2 Scales                                                          ranking and rating scales is peerScholar [14], where the instructor
In general, quantitative evaluations may be conducted using
                                                                      can configure an assignment to have either a rating scale or a
either ranking or rating [9]. Rating refers to the comparison of
                                                                      ranking scale. Inasmuch as long rubrics also seem to elicit more
different items using a common absolute, or cardinal, scale
                                                                      textual feedback, systems that use ranking may provide the
(either numeric or categorical). Ranking, sometimes also called
                                                                      author with less feedback on the quality of the submission and
forced-distribution rating, means comparing different items
                                                                      guidance how to improve it [35].
directly one to another on a relative, or ordinal, scale [22]. Both
ranking and rating have their strengths and weaknesses, and there
is still little consensus as to which has a greater predictive
                                                                      3.2 Eliciting Qualitative Peer Evaluation
validity [1, 16, 17].                                                 3.2.1 Critique Artifact Media Types
                                                                      Critiques, as the verbal component of reviews, can be provided in
Generally, ranking and rating are expected to correlate, but some     different formats. The most obvious and typical design choice is
studies have demonstrated that ordinal (ranking-based)                to prompt the reviewer to post a plain-text comment in a text
evaluations contain significantly less noise than cardinal (rating-   box. Most systems provide a web form combining rubric
based) evaluations [23, 32]. A cardinal scale in the context of       questions and text boxes to fill out. Plain-text feedback is the
peer evaluations is also susceptible to score inflation, whereas an   most basic and arguably the fastest way to provide feedback.
ordinal scale is immune to this problem [9]. When a cardinal          Textual critiques can be enhanced by allowing rich-text format
scale is used, an evaluator may “smokescreen” his preferences by      (varying font faces and sizes, bullet points, alignment, hyperlinks,
giving all evaluated artifacts the same rating, and may severely      etc.) using the WYSIWIG editors. Including a hyperlink in the
inflate scores by giving all artifacts the same high ratings          text feedback further enhances the options by referencing an
(similarly, he can severely degrade scores by giving all artifacts    externally hosted copy of the submission artifact (which can be
the same low ratings). Thus, a cardinal scale is very vulnerable to   edited and/or annotated) or by referencing externally hosted
social or personal biases (e.g., “never give the highest rating”)     multimedia critique artifacts, such as voice and video recordings,
and idiosyncratic shocks (e.g, mood or inconsistency in               screencasts and HTML documents. Only a few systems (e.g,
evaluation style). When an ordinal scale is used, an evaluator        Canvas) allow internal hosting of multimedia critique artifacts,
must construct an explicit total ordering of artifacts (based on      but arguments have been made that this type of critiques
their perceived quality) over others [24]. This makes the             substantially improves the provider’s efficiency and the
evaluation more robust. Psychological evidence suggests that          recipient’s experience.
evaluators are better at making comparative judgments than
absolute ones [26, 31].                                               The next step up in providing rich critiques is inline file
                                                                      annotation. Several systems take advantage of the third party
The ordinal scale also has its drawbacks. It forces evaluators to     APIs allowing inline file annotations of submission artifacts
discriminate between artifacts that may be perceived to have very
uploaded as files. For instance, Mobius SLIP and Canvas utilize a       entry space (textbox) associated with a specific criterion/question
document viewer called Crocodoc, which renders various file             in the rubric (e.g, Expertiza, CritViz) or as inline file annotation
formats as an HTML document and allows reviewers to select              with Crocodoc (e.g., Mobius SLIP, Canvas). Further exploration
portions of the document and annotate them in place. Annotation         of this functionality and design options for its implementation
includes highlighting, commenting, adding text and primitive            will be provided in the full study.
graphic elements. This feature is similar to adding comments in a
Microsoft Word file or a Google doc. Crocodoc supports both             4. CONCLUSION
non-anonymous and anonymous file annotation. While the                  We have presented our initial attempt at formulating the research
Crocodoc API is used by a number of systems, after its                  framework for a taxonomy of educational online peer assessment
acquisition by Box in 2013, it is expected to be replaced by a new      systems. This framework enables researchers of technology-
API with similar, and possibly, more advanced inline file               supported peer assessment to understand the current landscape of
annotation functionality. Web annotation is another possible            technologies supporting student peer review and assessment,
implementation of inline annotation in the web-based online peer        specifically, its affordances and constraints. Importantly, this
assessment systems [33] but no systems in our illustrative sample       framework helps identify the major research questions in existing
rely on it; therefore, this option needs to be explored further. To     and potential research and formulate agenda for the future
the best of our knowledge, no existing online peer review systems       studies. It also informs educators and system design practitioners
offer its “native,” custom-built inline file annotation                 about use cases and design options in this particular branch of
functionality.                                                          educational technology.
Since text critiques may not offer the desired expressiveness and       Using a grounded theory approach, we identified several primary
clarity that other media may provide, users have been requesting        objectives for online peer assessment systems and combined
to allow reviewers to attach multimedia files containing critique       them in the research framework. To illustrate the application of
artifacts (e.g., images, voice or video recordings) as an alternative   this framework in this research-in-progress paper, we presented a
to inserting URLs to such externally hosted files in the plain- or      sample analysis of how use cases supporting the objective of
rich-text comments. Such an option, for example, would allow            eliciting quantitative and qualitative peer evaluations are
reviewers, who are more comfortable using traditional media             implemented in several different systems. In the future, full
(e.g., pen and paper), to write their critiques offline, then scan      study, we intend to apply the multi-case method to conduct a
them into pdf or image files, and then attach them to the original      complete analysis of the objectives based on a large sample of
submission artifacts. For another example, some reviewers may           online peer assessment systems.
also be more productive when providing their critiques as voice
or screencast recordings made directly in the system. In our
sample only Canvas offers such options, but since they are
                                                                        5. REFERENCES
                                                                        [1] Alwin, D. F., & Krosnick, J. A. 1985. The Measurement of
available in other social learning applications, such as
                                                                            Values in Surveys: A Comparison of Ratings and Rankings.
VoiceThread (voicethread.com), it is reasonable to expect
                                                                            Public Opinion Quarterly, 49(4), 535–552.
increasing availability of such functionality in online peer
                                                                            http://doi.org/10.1086/268949
assessment in the near future.
                                                                        [2] Babik, D., Iyer, L., & Ford, E. 2012. Towards a
3.2.2 Contextualization of Critiques                                        Comprehensive Online Peer Assessment System: Design
A number of factors influence how well the author of the                    Outline. Lecture Notes in Computer Science, 7286 LNCS,
submission artifact is able to understand and relate to a reviewer's        1–8.
feedback: spatial relationship of the critique artifacts with the       [3] Babik, D., Singh, R., Zhao, X., & Ford, E. 2015. What You
submission artifact, placing critiques in the specific context of           Think and What I Think: Studying Intersubjectivity in
the submission artifact, and the granularity of comments. For               Knowledge Artifacts Evaluation. Information Systems
example, directly annotating an issue in a fragment of the                  Frontiers. http://doi.org/10.1007/s10796-015-9586-x
submission artifact, rather than trying to explain in the overall,
“detached”, critique where the issue is located and how to fix it,      [4] Bouzidi, L., & Jaillet, A. 2009. Can Online Peer Assessment
simplifies communication between the reviewer and the author.               Be Trusted? Educational Technology & Society, 12(4), 257–
Moreover, annotation is more suitable for providing specific fine-          268.
grained comments, while filling out a text box is more                  [5] Cho, K., & Schunn, C. D. 2007. Scaffolded Writing and
appropriate for more global comments.                                       Rewriting in the Discipline: A Web-based Reciprocal Peer
                                                                            Review System. Computers & Education, 48(3), 409–426.
We define this aspect of eliciting qualitative evaluation as the
                                                                            http://doi.org/10.1016/j.compedu.2005.02.004
contextualization of critiques. Naturally, the system interface
design determines how much critiques can be contextualized in           [6] Davies, P. 2000. Computerized Peer Assessment.
relation to submission artifacts. Moreover, the interface                   Innovations in Education and Teaching International,
implementation of other functionalities, such as rubrics, scales            37(4), 346–355.
and critique artifact media types closely interplays with the           [7] De Alfaro, L., & Shavlovsky, M. 2014. CrowdGrader: A
implementation of critique contextualization. Contextualization             Tool for Crowdsourcing the Evaluation of Homework
of critiques, thus, has two options: (a) “detached”, non-                   Assignments. In Proceedings of the 45th ACM Technical
contextualized (“single comment per submission”); (b)                       Symposium on Computer Science Education (pp. 415–420).
contextualized (“multiple comments in various fragments of the              New York, NY, USA: ACM.
submission”). While the former is typically available in all                http://doi.org/10.1145/2538862.2538900
systems in our sample, the latter is implemented as either an
[8] Doiron, G. 2003. The Value of Online Student Peer Review,            Management, 35(4), 899–927.
    Evaluation and Feedback in Higher Education. CDTL Brief,             http://doi.org/10.1177/0149206307312514
    6(9), 1–2.                                                       [23] Shah, N. B., Bradley, J. K., Parekh, A., Wainwright, M., &
[9] Douceur, J. R. 2009. Paper Rating vs. Paper Ranking. ACM              Ramchandran, K. 2013. A Case for Ordinal Peer Evaluation
    SIGOPS Operating Systems Review, 43(2), 117–121.                      in MOOCs. NIPS Workshop on Data Driven Education.
[10] Gehringer, E., Ehresman, L., Conger, S. G., & Wagle, P.              Retrieved from
     2007. Reusable Learning Objects Through Peer Review: The             http://lytics.stanford.edu/datadriveneducation/papers/shahet
     Expertiza Approach. Innovate: Journal of Online                      al.pdf
     Education, 3(5), 4.                                             [24] Slovic, P. 1995. The Construction of Preference. American
[11] Gikandi, J. W., Morrow, D., & Davis, N. E. 2011. Online              Psychologist, 50(5), 364–371. http://doi.org/10.1037/0003-
     Formative Assessment in Higher Education: A Review of the            066X.50.5.364
     Literature. Computers & Education, 57(4), 2333–2351.            [25] Søndergaard, H., Mulder, R. 2012. Collaborative Learning
     http://doi.org/10.1016/j.compedu.2011.06.004                         Through Formative Peer Review: Pedagogy, Programs and
[12] Hafner, J., & Hafner, P. 2003. Quantitative Analysis of the          Potential, Computer Science Education, December 2012, 1–
     Rubric as an Assessment Tool: An Empirical Study of                  25.
     Student Peer‐Group Rating. Intl. J. Sci. Educ., 25(12), 1509-   [26] Spetzler, C. S., & Stael Von Holstein, C.-A. S. 1975.
     1528.                                                                Probability Encoding in Decision Analysis. Management
[13] Jonsson, A., & Svingby, G. 2007. The Use of Scoring                  Science, 22(3), 340–358.
     Rubrics: Reliability, Validity and Educational                  [27] Tinapple, D., Olson, L., & Sadauskas, J. 2013. CritViz:
     Consequences. Educational Research Review, 2(2), 130–                Web-Based Software Supporting Peer Critique in Large
     144. http://doi.org/10.1016/j.edurev.2007.05.002                     Creative Classrooms. Bulletin of the IEEE Technical
[14] Joordens, S., Desa, S., & Paré, D. 2009. The Pedagogical             Committee on Learning Technology, 15(1), 29.
     Anatomy of Peer Assessment: Dissecting a peerScholar            [28] Topping, K. J. 1998. Peer Assessment between Students in
     Assignment. Journal of Systemics, Cybernetics &                      Colleges and Universities. Review of Educational Research,
     Informatics, 7(5). Retrieved from                                    68(3), 249 –276.
     http://www.iiisci.org/journal/CV$/sci/pdfs/XE123VF.pdf               http://doi.org/10.3102/00346543068003249
[15] Kavanagh, S., & Luxton-Reilly, A. 2016. Rubrics Used in         [29] Topping, K. J. 2005. Trends in Peer Learning. Educational
     Peer Assessment (pp. 1–6). ACM Press.                                Psychology, 25(6), 631–645.
     http://doi.org/10.1145/2843043.2843347                               http://doi.org/10.1080/01443410500345172
[16] Krosnick, J. A. 1999. Maximizing Questionnaire Quality. In      [30] Vista, A., Care, E., & Griffin, P. 2015. A New Approach
     J. P. Robinson, P. R. Shaver, & L. S. Wrightsman (Eds.),             Towards Marking Large-scale Complex Assessments:
     Measures of Political Attitudes (pp. 37–57). San Diego, CA           Developing a Distributed Marking System that Uses an
     US: Academic Press.                                                  Automatically Scaffolding and Rubric-targeted Interface for
[17] Krosnick, J. A., Thomas, R., & Shaeffer, E. 2003. How Does           Guided Peer-review. Assessing Writing, 24, 1-15.
     Ranking Rate?: A Comparison of Ranking and Rating Tasks.        [31] Wang, H., Dash, D., & Druzdzel, M. J. 2002. A Method for
     In Conference Papers – American Association for Public               Evaluating Elicitation Schemes for Probabilistic Models.
     Opinion Research (p. N.PAG).                                         IEEE Transactions on Systems, Man, and Cybernetics. Part
[18] Liu, C.C., Lu, K.H., Wu, L.Y., & Tsai, C.C. 2016. The                B, Cybernetics : A Publication of the IEEE Systems, Man,
     Impact of Peer Review on Creative Self-efficacy and                  and Cybernetics Society, 32(1), 38–43.
     Learning Performance in Web 2.0 Learning Activities.            [32] Waters, A., Tinapple, D., & Baraniuk, R. 2015. BayesRank:
     Journal of Educational Technology & Society, 19(2), 286-             A Bayesian Approach to Ranked Peer Grading. In ACM
     297. Retrieved from                                                  Conference on Learning at Scale, Vancouver.
     http://www.jstor.org/stable/jeductechsoci.19.2.286              [33] Jigsaw (teaching technique). 2016 June 20. In Wikipedia,
[19] Luxton-Reilly, A. 2009. A Systematic Review of Tools That            the free encyclopedia. Retrieved from
     Support Peer Assessment. Computer Science Education,                 https://en.wikipedia.org/wiki/Jigsaw_(teaching_technique)
     19(4), 209–232. http://doi.org/10.1080/08993400903384844        [34] Web annotation. 2016, May 20. In Wikipedia, the free
[20] Reddy, Y. M., & Andrade, H. 2010. A Review of Rubric                 encyclopedia. Retrieved from
     Use in Higher Education. Assessment & Evaluation in                  https://en.wikipedia.org/w/index.php?title=Web_annotation
     Higher Education, 35(4), 435–448.                                    &oldid=721222451
     http://doi.org/10.1080/02602930902862859                        [35] Yadav, R. K., & Gehringer, E. F. 2016. Metrics for
[21] Russell, A. A. 2001. Calibrated Peer Review: A Writing and           Automated Review Classification: What Review Data Show.
     Critical-Thinking Instructional Tool. UCLA, Chemistry,               In Y. Li, M. Chang, M. Kravcik, E. Popescu, R. Huang,
     2001. Retrieved from http://www.unc.edu/opt-                         Kinshuk, & N.-S. Chen (Eds.), State-of-the-Art and Future
     ed/eval/bp_stem_ed/russell.pdf                                       Directions of Smart Learning (pp. 333–340). Springer
[22] Schleicher, D. J., Bull, R. A., & Green, S. G. 2008. Rater           Singapore. Retrieved from
     Reactions to Forced Distribution Rating Systems. Journal of          http://link.springer.com/chapter/10.1007/978-981-287-868-
                                                                          7_41

</pre>