<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RaMon, a Rating Monitoring System for Educational Environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Escuela de Ingeniería de Bilbao (UPV/EHU)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bilbao</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain mikel.v@ehu.eus</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Escuela Universitaria de Ingeniería de Vitoria-Gasteiz (UPV/EHU)</institution>
          ,
          <addr-line>Vitoria-Gasteiz</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>When more than one rater is involved in the assessment and scoring of a work, the scores are affected by each rater's thinking processes, knowledge level and personal preferences among other issues. These idiosyncrasies are known as rater effects and can dramatically affect the evaluation process. Even when instruments such as evaluation rubrics are used to increase the fairness and impartiality of the evaluation, rater effects may be present and affect the scoring. Rater effects can remarkably influence the final score in those assessable elements in which various raters are involved. Therefore, identifying and trying to avoid those effects is crucial for a fair evaluation. However, identifying these effects is not always an easy task and scoring leaders need tools that help them in this process. In this paper RaMon, a system for monitoring raters and controversial evaluations using visualization techniques, is presented. The authors have tested the system using data from a course with more than 100 evaluations made by 15 raters which has helped to detect some rater-effects.</p>
      </abstract>
      <kwd-group>
        <kwd>Scoring leaders</kwd>
        <kwd>rater effects</kwd>
        <kwd>monitoring</kwd>
        <kwd>visualizations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        All formal educational environments imply some kind of assessment or scoring of the
work done. In some cases, there is only one teacher involved in the evaluation, but in
other cases, e.g., Final Year Projects or Doctoral Thesis, the evaluation is performed
by several raters. When the evaluation is carried out by more than one rater,
monitoring both the scores and the raters is required, as there can be an important rater effect
in the final mark of a work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Rater effects are systematic patterns in evaluation
behaviours that can be produced in an unconscious way, due to the different personal
perceptions and tendencies of the raters or on purpose to affect some student’s score
in a positive or negative sense. To guarantee the quality of the evaluation and its
fairness, the rater effects have to be detected and avoided.
      </p>
      <p>
        With the purpose of avoiding rater effects and guarantee a fair marking that truly
reflects the student performance, the standardization of the assessment criteria is the
first step [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, accomplishing a uniform marking standard for all the students
Copyright © 2017 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
in those works assessed by several raters is difficult, even with settled criteria. A staff
member may indicate a very good performance level for a student on a particular
criterion while another staff member may grade it just as adequate [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The analysis of
these differences may reveal the different behaviors and cognitive process of the
raters during the assessment and could further facilitate taking remediation actions
such as the improvement of rater selection, training, or monitoring procedures into the
evaluation processes. Those actions could help reducing or minimizing the impact of
rater inaccuracy or bias in scores and improving the assessment procedure [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        In many situations the data gathered during an evaluation process may include
different students, with different works and each work being scored by different raters,
so its analysis to detect rater effects it is no trivial. So, it is important to provide
software that automates some of the rater monitoring aspects [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]; for example, by
analyzing statistics related to particular raters and automatically detecting some scoring
patterns.
      </p>
      <p>This paper presents RaMon (Rating Monitoring), a system that helps monitoring
evaluations and also detecting and measuring rater effects. The system provides
automatic analysis of statistics and graphical visualizations to help detecting rater effects
and controversial evaluations.</p>
      <p>
        RaMon has been tested in the assessment of Final Year Projects (FYP). In the
context of FYPs, assuring impartial and unbiased evaluations is very difficult due to the
existence of different evaluation boards and the high amount of raters involved [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        This field has been chosen for two main reasons: (1) the assessment of the FYPs has
been identified as one of the major concerns and problems in FYP development, and
(2) the authors of this paper have been intensively working in the improvement of the
development and evaluation processes of Final Year Projects. In order to overcome
the problems in FYPs, the authors proposed a methodology implying a formative
rubric-based assessment [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. The implementation of the new methodology and the
use of rubrics have helped making the assessment less obscure and more objective, as
the evaluation criteria is known both by students and lecturers and a higher coherence
and agreement level in the assessment has been achieved [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. However, some
controversies in several evaluations were observed and, thus, the need for a means to
supervise the evaluation process in order to assure its fairness has arisen.
      </p>
      <p>This paper first presents the necessity of monitoring ratings. Next, a visual
monitoring system of ratings called RaMon is presented. After, the two main monitoring
aspects of RaMon are presented: Monitoring of raters or controversial evaluations.</p>
      <p>Finally some conclusions are future work are presented.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Rating monitoring necessity</title>
      <p>
        Monitoring ratings in those contexts where multiple raters are involved is crucial to
assure a fair evaluation. In the literature, different rater effects have been identified
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
 Leniency/Severity effect is the rater’s tendency to give significantly lower (severity)
or higher (leniency) scores than those given by other raters.
 Central tendency effect is the tendency to give scores only from the middle of the
scale, avoiding the highest and lowest values.
 Randomness effect consists on giving scores inconsistently with the other raters.
      </p>
      <p>This effect can appear if the rater does not know the evaluation criteria or has not
the sufficient knowledge to assess the work.
 Halo/Horn effect is the bias in which the rater gives a student always similar grades
based in some preconceived impression, rather than consider the assessment
criteria for the work being evaluated.
 Differential Leniency/Severity effect is the tendency to bias in a positive or negative
way the scores of a particular group of students for some purpose.</p>
      <p>
        To identify these kinds of effects, statistical analysis, including summaries of score
distributions that depict the performance of each rater, are usually carried out [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
When analyzing the rating patterns of a rater, the mean scoring and the
discriminability should, at least, be examined [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Mean scoring refers to the mean level of
scoring of each rater whereas discriminability is related to the dispersion of all scores of
different ratees from a rater. These data allow evaluating the leniency/severity effect.
If the mean scoring of a rater is very high, maybe the rater is too lenient, or if the
mean is too low, maybe the rater is being very severe.
      </p>
      <p>This information can be visualized in different ways. For example, Fig. 1 shows
this information for those raters who have carried out at least three evaluations in the
assessment context used throughout the paper (FYPs). The visualization options
include traditional boxplots (Fig. 1a) or violin plots (Fig. 1b) where in addition to the
grade distribution, the density of the grades for each value is also present.
a
b</p>
      <p>
        In this figure, rater 4 presents a suspicious performance: a restricted score range
and a high average. But is rater 4 really a lenient rater or this is mainly due to the high
quality of the works assessed? To answer this question, the validity of the ratings
should be also considered. This can be achieved using one of these two approaches,
an accuracy framework or an agreement framework [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The former estimates the
quality of the scores by comparing the scores of the raters with true scores whereas
the latter compares the score of each rater with those given by the others.
      </p>
      <p>The first approach is suitable, for example, in contexts in which students carry out
a peer-review process and the lecturers provide a real evaluation. However, it cannot
be implemented, for instance, for the Final Year Project evaluation where a true score
is not available.</p>
      <p>In addition, identifying the cases in which a controversial evaluation has occurred
is necessary, because even if a rater effect has not been previously detected, an
evaluation with significant differences among raters may indicate some kind of problem
that needs to be analyzed.</p>
      <p>Monitoring both raters and controversial evaluation allows detecting problems and
taking remediation actions to improve the assessment fairness. Next section presents
our proposal for RaMon, a system that relies on visualization techniques to provide
monitoring of raters and controversial evaluations.
3</p>
    </sec>
    <sec id="sec-3">
      <title>RaMon, a visual rating monitoring system</title>
      <p>As stated in [14], visualization is an important part of the learning analytics area [15]
which tries to improve the understanding of learning and its processes. Visualizations
can help having a deeper insight into the evaluation process and help improving
pedagogical interventions [16, 17]. RaMon relies on visualizations for monitoring both
raters and rated assignments in order to detect rater effects and to find controversial
evaluations.</p>
      <p>In addition, RaMon allows the scoring leaders to define alarms that will raise
whenever a rater or a rated assignment with an agreement or an accuracy below a
settled threshold is detected.</p>
      <p>Fig. 2 shows an example of an alarm when a controversial evaluation has been
identified. Clicking on the alarm icon allows the user to visualize the information of
the evaluations that raised the alarm.</p>
      <sec id="sec-3-1">
        <title>Rater Monitoring</title>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation monitoring</title>
        <p>When an alarm regarding raters’ performance is highlighted, the system behaves in
a similar way, allowing the user to directly access to the suspicious raters’
information.</p>
        <p>Next sections describe in detail some of the visualization capabilities that RaMon
provides for monitoring both raters and controversial evaluations.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Monitoring of raters</title>
      <p>As shown in Fig. 1, rater 4 presents a very small score range and a high mean
rating. But, is really rater 4 a lenient rater? This behavior could also be due to the high
quality of the works graded or the reduced amount of works assessed. In order to
answer this question, more information is required, e.g, the information provided by
either an accuracy framework or an agreement framework.</p>
      <p>RaMon allows the user to analyze this information through the visualizations
depicted in the next sections.
4.1</p>
      <sec id="sec-4-1">
        <title>Distribution of ratings</title>
        <p>When analyzing the dispersion of ratings, a box or violin plot such as those shown in
Fig. 1are not enough to monitor the raters and extract accurate conclusions about the
presence of rater effects.</p>
        <p>One of the factors that can affect the scoring dispersion is the number of projects
each rater has evaluated. When this number is very small, it is not rare to have a small
scoring range. Therefore, RaMon can enrich the information provided with the
number of projects evaluated by each rater as shown in Fig. 3. The users can choose to
visualize the data using violin plots, as shown in the figure, or box plots according to
their preferences.</p>
        <p>In this case (see Fig. 3), rater number 4 has evaluated a relative small amount of
projects. So, maybe, the small score range might be influenced by this factor.
However, if we compare the score range for rater 4 and rater 3, the size of the range is very
different whilst the amount of evaluated projects is similar.</p>
        <p>Using an accuracy framework or an agreement framework can provide a higher
insight of the raters’ performance, including the fairness of the scores. As in FYP
evaluation the true score is not available, an agreement framework is used and, thus, the
visualization of the score distribution is enriched with the average agreement score of
each rater (see Fig. 4).According to the information shown in Fig. 1 or Fig. 3, rater 4
could be identified as suspicious of being lenient, i.e., giving always very good marks.
However, analyzing data in Fig. 4, it can be seen that rater 4 has a high agreement
score. Therefore, the small dispersion of the marks of rater 4 is probably due to the
quality of the projects this rater has evaluated.</p>
        <p>On the other hand, even if raters 7 and 8 have a higher dispersion they have a
smaller agreement with the other members of the evaluation board. Although this
agreement score does not allow detecting rater effects by itself, raters 7 and 8 should
be analyzed in more detail using other statistics.</p>
        <p>As previously depicted, sometimes raters give scores based on personal criteria or
interests (Differential Leniency/Severity effect). In many institutions, the supervisor of
a FYP is a member of the evaluation board, and people might think they could
perform differently depending on whether they are rating their pupil’s work or other’s.
Therefore, detecting differences in the distribution of the ratings according to the role
of the raters (supervisor or member of the evaluation board) might also be helpful.</p>
        <p>RaMon provides different visualization, such as the violin plots shown in Fig. 5, to
analyze the dispersion of their rating according to their role in the project rated: only
member of the evaluation board or supervisor.</p>
        <p>All the raters shown in Fig. 5, with the exception of rater 8, present very different
plots according to their role. It can be inferred that those raters tend to give higher
marks and with smaller dispersions when they supervise the project being evaluated
(see for example rater 6). This behavior might be considered as an evidence of the
Differential Leniency effect.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Deviations in the ratings</title>
        <p>Analyzing the dispersion of the ratings is interesting but provides limited information
for detecting whether the rater is lenient or harsh. In order to detect this aspect, the
deviation of each rater from the true score is required. However, as mentioned above,
this information is not available in FYP evaluation. In contexts where the true score is
not available, RaMon uses the average score of the work. Moreover, when the rating
is provided through the aggregation of the ratings for several elements, the system
allows to compare the deviation from the projects average for the different
components in order to detect in which aspects the raters are more critique.</p>
        <p>For example, Fig. 6 shows the deviations for the Final report (a) and the Oral
defense (b). In this case, it can be derived that raters 7 and 8 are lenient whereas raters 2
and 6 seem to be more severe in the assessment of the Final report (Fig. 6a).</p>
        <p>However, it is also interesting the analysis of both plots together to see differences
according to the evaluable element being rated. For example, according to Fig. 6,
rater 15 seems to be very lenient for the Final report (a), whilst being very severe for
the Oral defense (b). This can be due to the fact that the rater gives greater relevance
to the presentation and evaluates it more thoughtfully, or that the rater has not read the
Final report very carefully and prefers not to be very severe in its evaluation.</p>
        <p>RaMon also provides the means to analyze the behavior difference between those
projects under the supervision of the rater and those in which he or she has only been
member of the evaluation board to identify Differential Leniency/Severity effects.</p>
        <p>Fig. 7 shows the deviation from the average of the Final report for different raters.
In this figure, it can be detected that raters 3 and 9 are more severe when evaluating
projects that have not supervised.</p>
        <p>This kind of figure might enrich the analysis done from previous visualizations.
For example, rater 4, who was suspicious of giving high marks according to the initial
analysis, shows to be harsher than his or her counterparts in the evaluation board
because his or her grades are a bit lower than the average even in the projects under his
or her supervision.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Analysis of the rubrics</title>
        <p>So far, all the visualizations shown have been limited to the overall score distribution,
regardless the way the score has been computed. However, when the score is
computed using evaluation rubrics, RaMon supplies further analysis capabilities.</p>
        <p>
          These capabilities help analyzing the tendencies when performing a rubric-based
evaluation. The frequency distribution of ratings, especially when graphically shown,
helps detecting the raters tendency [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. It makes evident whether raters tend to select
the upper or lower categories (Leniency/Severity effect) or the middle ones (Central
tendency effect).
        </p>
        <p>For example, Fig. 8 shows the frequency distribution of performance levels
selected for each dimension of the Oral defense rubric. Analyzing this plot, it can be
observed that raters 6 and 8 have a greater tendency to select higher performance levels
for the projects under their supervision in certain dimensions (Content dimension for
rater 6; Content and Time dimensions for rater 8) whilst using the whole range of
performance levels for projects supevised by others.</p>
        <p>In addition, RaMon can enrich this visualization by using different colors and
transparency levels according to the number of evaluations made or the agreement
level of the raters. This way, if the grades from a rater are very biased, but the rater
has few evaluations, the rater effect can be considered less conclusive.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Monitoring of controversial evaluations</title>
      <p>In order to identify controversial evaluations, RaMon can use an accuracy framework
when a true assessment is available or an agreement framework otherwise. Plotting
the accuracy or agreement value can help identifying those evaluations in which
suspicious behaviors are happening (Fig. 9).</p>
      <p>Alternatively, RaMon can show the scores given to each assignment by each rater.
Fig. 10 shows all the assignments where rater number 7 has been part of the
evaluation board. In the example used through this paper the evaluation board for each
assignment was formed by 2 or 3 raters.</p>
      <p>Analyzing either Fig. 9 or Fig. 10, it can be observed that there is a problem in the
evaluation of project 67. Its agreement score is very low and there is a rater (number
7) who has given a remarkable higher grade than the other components of the
evaluation board. Therefore, this project should be analyzed in more detail.</p>
      <p>Once this situation is detected for any project, RaMon offers different ways to
analyze the details of the evaluation considering each dimension of the rubrics used. For
example, in Fig. 11 a heatmap for the Final report rubric for project 67 is shown.</p>
      <p>Although the raters agree on the performance level for the Conclusions dimension,
and there are not great differences in the Bibliography dimension, there are great
differences in all the other dimensions. In this case, rater 7 always gives significantly
higher scores whilst rater 6 tends to give lower scores. This suggests that the
evaluation of this work should be reviewed to assure a fair score.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and future work</title>
      <p>In educational environments where assessment is carried out by several raters,
monitoring the evaluation results can be useful to assure the fairness of the process. In this
paper RaMon, a system for Rating Monitoring in educational environments, has been
presented.</p>
      <p>RaMon supplies different visual ways of analyzing the information regarding the
assessment that might help identifying different rater effects and controversial
evaluations. To this end, RaMon uses diverse metrics (e.g. agreement or accuracy) to
determine the quality of the ratings in addition to the mean scoring and dispersion of the
grades.</p>
      <p>RaMon has been applied in the context of the evaluation of Final Year Projects
where more than 100 projects were evaluated by 15 raters. The visualizations
provided have helped detecting different issues regarding both the raters and project
evaluations.</p>
      <p>For example, RaMon has helped finding differential lenient raters and identifying
some controversial project evaluations (i.e. projects with low agreement among the
evaluation board members).</p>
      <p>Considering this information, remediation actions could be taken to improve the
assessment process. For example, differential lenient raters can be warned to be more
unbiased and controversial evaluations can be reviewed by other raters trying to
achieve a fairer assessment.</p>
      <p>In the near future, RaMon is going to be applied in more courses where multi-rater
evaluations are carried out. Moreover, in some of these courses, the accuracy
framework is going to be used to analyze the rater effects in a student peer-review
assignment where a true score, given by the teaching staff, is available.</p>
      <p>In addition, the availability of more data about the academic record of each
student, will allow analyzing the performance of a student along the time trying to detect
the presence of Halo/Horn effects in the evaluations.</p>
      <p>Acknowledgements. This work is supported by the Basque Government (IT980-16),
the University of the Basque Country UPV/EHU (EHUA16/22) and SNOLA,
officially recognized Thematic Network of Excellence (TIN2015-71669-REDT) by the
Spanish Ministry of Economy and Competitiveness.
14. Kay, J., Bull, S.: New Opportunities with Open Learner Models and Visual Learning
Analytics. In: Conati, C., Heffernan, N., Mitrovic, A., and Verdejo, M.F. (eds.) Actas de
Artificial Intelligence in Education. pp. 666–669. Springer International Publishing, Cham
(2015).
15. Siemens, G.: Learning analytics: envisioning a research discipline and a domain of practice.</p>
      <p>In: Actas de International Conference on Learning Analytics and Knowledge. pp. 4–8.</p>
      <p>ACM (2012).
16. Pardo, A., Dawson, S.: Learning Analytics: How can Data be used to Improve Learning
Practice. In: P. Reimann, S. Bull, M. Kickmeier-Rust, R. K. Vatrapu &amp; B. Wasson (Eds.),
Measuring and visualizing learning in the information-rich classroom,. pp. 41–55.</p>
      <p>Routledge (2016).
17. Tervakari, A.M., Silius, K., Koro, J., Paukkeri, J., Pirttilä, O.: Usefulness of information
visualizations based on educational data. In: Actas de IEEE Global Engineering Education
Conference. pp. 142–151. IEEE Computer Society (2014).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Engelhard</given-names>
            <surname>Jr George</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          , Jue: Unfolding Rater Accuracy in Performance Assessments.
          <source>Rasch Meas. Trans</source>
          .
          <volume>28</volume>
          ,
          <fpage>1489</fpage>
          -
          <lpage>1491</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chan</surname>
            ,
            <given-names>K.L.</given-names>
          </string-name>
          :
          <article-title>Statistical analysis of final year project marks in the computer engineering undergraduate program</article-title>
          .
          <source>IEEE Trans. Educ</source>
          .
          <volume>44</volume>
          ,
          <fpage>258</fpage>
          -
          <lpage>261</lpage>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Teo</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            ,
            <given-names>D.J.:</given-names>
          </string-name>
          <article-title>A systematic approach to the implementation of final year project in an electrical engineering undergraduate course</article-title>
          .
          <source>IEEE Trans. Educ</source>
          .
          <volume>41</volume>
          ,
          <fpage>25</fpage>
          -
          <lpage>30</lpage>
          (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Rater effects in creativity assessment: A mixed methods investigation</article-title>
          .
          <source>Think. Ski. Creat</source>
          .
          <volume>15</volume>
          ,
          <fpage>13</fpage>
          -
          <lpage>25</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Wolfe</surname>
            ,
            <given-names>E.W.</given-names>
          </string-name>
          :
          <article-title>Identifying rater effects using latent trait models</article-title>
          .
          <source>Psychol. Sci</source>
          .
          <volume>46</volume>
          ,
          <fpage>35</fpage>
          -
          <lpage>51</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Wolfe</surname>
            ,
            <given-names>E.W.</given-names>
          </string-name>
          :
          <article-title>Methods for monitoring rating quality: Current practices and suggested changes</article-title>
          .
          <source>Iowa City IA Pearson</source>
          .
          <article-title>(</article-title>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Valderrama</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rullan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sánchez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pons</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mans</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giné</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiménez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peig</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Guidelines for the final year project assessment in engineering</article-title>
          . In: Actas de IEEE Frontiers in Education Conference. pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . IEEE Computer Society, San Antonio, Texas,
          <string-name>
            <surname>EE.UU.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Villamañe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Análisis y mejora de los marcos actuales de desarrollo y evaluación de los Trabajos Fin de Grado mediante el uso de las TIC, (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Villamañe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrero</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Álvarez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larrañaga</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arruarte</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elorriaga</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Dealing with common problems in engineering degrees' Final Year Projects</article-title>
          . In: Actas de IEEE Frontiers in Education Conference. pp.
          <fpage>2663</fpage>
          -
          <lpage>2670</lpage>
          . IEEE Computer Society, Madrid (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Villamañe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Álvarez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larrañaga</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrero</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          : Desarrollo y validación de un conjunto de rúbricas para la evaluación de Trabajos Fin de Grado.
          <source>ReVisión</source>
          .
          <volume>10</volume>
          ,
          <fpage>17</fpage>
          -
          <lpage>27</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Myford</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolfe</surname>
            ,
            <given-names>E.W.</given-names>
          </string-name>
          :
          <article-title>Detecting and measuring rater effects using many-facet Rasch measurement: part I</article-title>
          .
          <source>J. Appl. Meas</source>
          .
          <volume>4</volume>
          ,
          <fpage>386</fpage>
          -
          <lpage>422</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>K.F.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwong</surname>
          </string-name>
          , J.Y.Y.:
          <article-title>Effects of rater goals on rating patterns: Evidence from an experimental field study</article-title>
          .
          <source>J. Appl. Psychol</source>
          .
          <volume>92</volume>
          ,
          <fpage>577</fpage>
          -
          <lpage>585</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Howard</surname>
            <given-names>E. A.</given-names>
          </string-name>
          <string-name>
            <surname>Tinsley</surname>
          </string-name>
          , Weiss, D.J.:
          <article-title>Interrater Reliability and Agreement</article-title>
          . In: Howard E. A. TinsleySteven and D. Brown (eds.)
          <source>Handbook of Applied Multivariate Statistics and Mathematical Modeling</source>
          . pp.
          <fpage>95</fpage>
          -
          <lpage>124</lpage>
          . Academic Press, San Diego (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>