=Paper=
{{Paper
|id=Vol-2308/isee2019paper04
|storemode=property
|title=Toward the Automatic Assessment of Text Exercises
|pdfUrl=https://ceur-ws.org/Vol-2308/isee2019paper04.pdf
|volume=Vol-2308
|authors=Jan Philip Bernius,Bernd Bruegge
|dblpUrl=https://dblp.org/rec/conf/se/BerniusB19
}}
==Toward the Automatic Assessment of Text Exercises==
<pdf width="1500px">https://ceur-ws.org/Vol-2308/isee2019paper04.pdf</pdf>
<pre>
Toward the Automatic Assessment of Text Exercises
                               Jan Philip Bernius                                             Bernd Bruegge
                          Department of Informatics                                    Department of Informatics
                        Technical University of Munich                               Technical University of Munich
                              Munich, Germany                                              Munich, Germany
                           janphilip.bernius@tum.de                                        bruegge@in.tum.de


   Abstract—Exercises are an essential part of learning. Manual            multiple instructors can lead to inconsistencies. Providing
assessment of exercises requires efforts from instructors and              timely or instant feedback in a large class is hard [6]. Waiting
can also lead to quality problems and inconsistencies between              for feedback delays the students learning progress and hinders
assessments. Especially with growing student populations, this
also leads to delayed grading, and it is more and more difficult           interactive learning. We strive toward a system to provide
to provide individual feedback.                                            automated text assessments based on instructor feedback de-
   The goal is to provide timely responses to homework sub-                creasing student feedback waiting times.
missions in large classes. By reducing the required efforts for               This paper is structured as follows: Section I introduces the
assessments, instructors can invest more time in supporting                domain and outlines the problems with the current correction
students and providing individual feedback.
   This paper argues that automated assessment provides more
                                                                           process for text exercise. Our vision is described in Section II
individual feedback for students, combined with quicker feedback           in the form of a visionary scenario. Section III describes
and grading cycles. We introduce a concept for automatic                   the assessment workflow of a possible implementation and
assessment of text exercises using machine learning techniques.            V IRTUAL O NE -T O -O NE, a machine learning based mecha-
Also, we describe our plans to use this concept in a case study            nism for providing individualized feedback for students in
with 1900 students.
                                                                           large classes. Section V discusses applicability and limitations
               I. I NTRODUCTION AND P ROBLEM                               of the system. We present related work in Section VI. Sec-
                                                                           tion IV proposes our evaluation approach, and Section VII
   Instructors face a large population of students in their                concludes the paper.
courses. Students require feedback on their exercises to reflect
on their progress [1]. The concepts of interactive learning                                       II. V ISIONARY S CENARIO
[2, 3] helps to increase the interaction between instructors                  The following scenario describes how we envision to im-
and students but also increases the workload for instructors.              prove the assessment of text exercises:
Software engineering students need to learn constructive and                  Anna and Tom are students participating in a software
creative capabilities. It is important for the instructor to facil-        engineering course. During a lecture, the instructor starts
itate the problem-solving learning process. Concrete problem-              an in-class text exercise to be completed in the assessment
solving strategies are taught in paradigms, accepted by the                system. Anna and Tom both submit a solution to the system.
profession [4]. Each paradigm provides a set of problem-                   The instructor starts manually assessing a set of submissions
solving exercises. These are usual textual exercises that involve          selected by the system. The system asks the instructor to assess
the application of problem-solving techniques.                             Annas solution. The instructor provides a score and a comment
   Exercises are a proven method to train higher cognitive                 explaining his assessment. After receiving the assessment,
skills including the acquisition of domain-specific knowledge,             the system decides to assess Toms solution automatically
analysis and design methods and the evaluation of the results.             based on the assessments provided previously. Anna and Tom
Trivial exercises, such as multiple-choice quizzes, do not                 get individual feedback for their solution to reflect on their
stimulate higher cognitive skills and do not reflect engineers             learning progress.
daily work [1].                                                               Tom is not satisfied with his submission after receiving
   Exercises help students to learn, understand and apply a                his feedback. He decides to improve his work and resubmits
paradigm. A student needs feedback to reflect and improve on               a refined version of his solution. The system automatically
their solution to the exercise. Text exercise assessment causes            assesses Toms resubmission and provides a new assessment.
time-intensive efforts with instructors, preventing them from              Tom is now satisfied with his assessment and fished the
spending time on improving their lectures, having discussions              exercise.
with their students or update exercises to incorporate technol-
ogy evolution.                                                                              III. A SSESSMENT W ORKFLOW
   Increasing student populations make it harder to keep as-                 In a first prototypical implementation, we extend the
sessments fair and at equal quality. Students do not benefit               ArTEMiS system, already capable of assessing programming
from quantitative feedback alone [5]. Qualitative feedback                 and modelling exercises automatically [1, 7], by adding semi-
helps students to improve.Splitting assessment efforts with                automated text assessment. A student submits his solution for


ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany                                          19
                         Student                                                                ArTEMiS                                         Instructor
                                                                                                 Automatic assessment
                      Submit                                                                          possible?                                  Assess
                                             Submission
                      solution                                                                                                   no              manually
                                                                                               yes

                       Reﬁne                                                                     Assess
                      solution                                                                automatically
                                                                                                                   «affects»
                      no
                                                                               Automatic                   Assessment                            Manual
                                 Satisﬁed?                                     Feedback                      Model                              Feedback
            yes

                                                                                                       Train Assessment
                      Review                                    Calculate                                    Model
                                          Assessment
                    assessment                                 Total Score

                                  Fig. 1. Automatic assessment workflow, considering manual and automatic assessment.


a text exercise to the ArTEMiS system. The activity diagram in               submissions need to be broken down to text blocks automati-
Fig. 1 depicts the assessment workflow. The system supports                  cally, first. Second, a vector representation of the text blocks
two means of assessment: Manual assessment provided by the                   is calculated as an input value for further computations. Third,
instructor (Section III-A) and automatic assessment generated                the assessment needs to be generated for each text block.
by the system based on an assessment model (Section III-B).                     A first, simple approach is using sentences as text blocks.
ArTEMiS decides which assessment method is required for                      We split submissions into sentences using delimiter characters
each submission based on the quality of the assessment model.                ( . : ? ! ) or line breaks. In a later stage, we plan on applying
Both means of assessment provide a set of Feedback Items.                    techniques such as topic modelling for text block calculation
   The assessment of the submission is a composition of all                  if the simple approach does not provide sufficient results. All
feedback items. The final score is the sum of all feedback                   text blocks need feedback to complete an assessment.
scores (see Fig. 2). Student review the assessment of their                     ArTEMiS calculates a vector representation for each
submission. If they are not satisfied, they can submit a re-                 text block. Therefore, blocks are translated into a multi-
fined solution for assessment, enabling continuous interactive               dimensional vector space, following the word2vec algorithm
learning [1] with text exercises.

A. Manual Assessment                                                                                       AssessmentModel
                                                                                                                                  ✱
                                                                                                                                        SimilarityCluster

   ArTEMiS selects text exercise submissions for manual as-
sessment by instructors if the assessment model does not allow
                                                                                                                Student               VectorRepresentation
for a confident assessment. Instructors are used to grading
exercises using a set of rubrics. A rubric defines a set of traits
of the students’ submission, which are evaluated based on a                     Text Exercise                 Submission                      TextBlock
                                                                                                       ✱
scale [9]. Rubrics can exist in different levels of detail, such              problemStatement             solution                   phrase
as only listing aspects of the assignment or defining different
                                                                              sampleSolution
scoring levels. If instructors do not define a rubric beforehand                                           submit()                                0..1
explicitly, they build a rubric in their mind while assessing.                participate()                           0..1                    Feedback
   Instructors break down a submission into blocks and match                                                  Assessment              score
each block with a rubric. As illustrated in Fig. 3, instructors
                                                                                                           score                      comment
define text blocks themselves as a phrase, sentence or para-
graph by selecting a piece of text as they see fit. They assess
each block quantitatively and qualitatively using a score and
                                                                                                              Instructor        Manual               Automatic
a feedback comment (see Feedback in Fig. 2).                                                                                   Feedback              Feedback
                                                                                                                                                     conﬁdence
                                                                                                                               provide()
B. Automatic Assessment
   ArTEMiS assesses submissions automatically, if the quality                Fig. 2. The relevant entities in the system are depicted in a class diagram.
                                                                             A student creates a submission for a text exercise. An assessment is a
of the assessment model allows for a confident assessment.                   composition of multiple feedback items referencing text blocks. A feedback
The assessment model is trained based on the manual assess-                  item can be a manual or automatic feedback item. An instructor provides
ments of text blocks provided by instructors. Fig. 4 depicts                 manual feedback. Automatic feedback items are a proxy [8] for manual
                                                                             feedback items. A similarity cluster aggregates the vector representations of
the automatic assessment process. For automatic assessment,                  text blocks. The assessment model consists of many similarity clusters.


ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany                                                              20
 Exercise: Strategy pattern vs. Bridge Pattern                                         Score: 2 / 6     by comparing automatic assessments with the corresponding
 Problem Statement: Explain the diﬀerence between                 Reviewer: Jan Philip Bernius          manual assessment.
 the bridge pattern and the strategy pattern.
                                                                                                         Hypophysis 1: Automatic assessments of text exercises
 Student Submission:                             Assessments:
                                                                                                         following the presented concept produce results identical to
                                                 Score for " The bridge pattern in meant to decouple
  The bridge pattern in meant to decouple an                                                             manual assessments with an accuracy greater than 85%.
                                                  an abstraction from is implementation. "
  abstraction from is implementation.
                                                 Score:           2                                        In a qualitative study, we will interview the instructors to
  The strategy pattern is a structural pattern   Feedback: Correct                                      analyze the block-based assessment concept (Sec. III-A), and
  and allows providing multiple algorithms at                                                           its applicability to grading and providing feedback.
                                                 Score for " The strategy pattern in an structural
  compile time.                                   pattern and allows providing multiple algorithms at     Hypophysis 2: The assessment concept allows capturing
                                                  compile time.   "
  Both patterns are structural patterns.
                                                                                                          all feedback necessary for assessment of text exercises. No
                                                 Score:           0                                       information is lost compared to traditional assessment.
                  Assess                         Feedback: The strategy patterns is a
                                                           behavioral pattern. It is                       In the second stage, we will conduct a second study in a later
                                                           used to select an algorithm
                                                           at runtime.                                  EIST lecture to evaluate the complete automatic assessment
                                                                                                        workflow. We will evaluate how many manual assessments
Fig. 3. Assessment of student submission for problem statement ”Explain                                 are needed to generate accurate assessments and the effects
the difference between the bridge pattern and the strategy pattern.” Example                            on assessment time.
question taken from an EIST exam. Instructors define text blocks to build up
their assessment. Each block is assessed with a score and a feedback text.                               Hypophysis 3: Employing automatic assessment can save
The total score is based on all feedback items in the assessment.                                        more than 50% in total required assessment time for all sub-
                                                                                                         missions. The assessment time per submission will increase
[10, 11] and its doc2vec extension for sentences and para-                                               compared to paper-based assessments.
graphs [12]. The algorithm can employ different strategies to                                              A qualitative study with student interviews assesses the
calculate one-hot word vectors.                                                                         usefulness of automated feedback for them. Further, we want
   Using the resulting vector representation, we use cluster                                            to understand students feeling toward automatic feedback.
analysis to detect clusters of submission blocks [13] from
all submissions of the same exercise. These clusters list the                                                                  V. D ISCUSSION
different statements submitted by all students as a part of their                                          We discuss applicability, limitations and implications of
solutions.                                                                                              automatic text assessment. Feedback generated following the
   Our primary assumption is that a single feedback item can                                            concepts introduced in this paper can only be as good as
be valid for text blocks from multiple submissions. Feedback                                            the feedback provided by the instructor. The system supports
for text blocks within the same similarity cluster can be applied                                       the assessment process by automating the repetitive process
to other nodes within the same cluster. This allows the system                                          involved in assessing text submissions.
to provide V IRTUAL O NE -T O -O NE feedback: Real instructor                                              Grading based on automatic assessment leads to ethical
feedback is applied to equivalent text blocks in a new submis-                                          problems. It is unclear whether non-native language or special
sion automatically. ArTEMiS chooses a previously assessed                                               figures of speech could lead to decreased scores. Applications
text block located closely in the same similarity cluster, the                                          in grading should be preceded by an extensive evaluation of
nearest neighbour. The instructor feedback is selected for the                                          assessment quality. While applications in grading are out-of-
new submission and ArTEMiS creates an automatic feedback                                                scope for this paper, we propose application in a two-phase
item, a proxy for the manual feedback item (see Fig. 2).                                                grading process only. We intend to apply the system as a
   If a cluster does not contain a manual feedback item, the                                            learning-support system. The generated feedback should help
system decides that an automatic assessment is not possible                                             students during their learning progress and should not be used
and requests a manual assessment from the instructor.                                                   during a grading process.
                                                                                                           The applicability of the described systems depends on the
                       IV. E VALUATION A PPROACH                                                        variety of possible solutions. Exercises with a variable answer
   We plan to conduct a case study to evaluate the automated                                            space require more knowledge for assessment, increasing
assessment quality in the Introduction to Software Engineering                                          the complexity. The system focuses on assessing exercises
(EIST) lecture taught at the Technical University of Munich                                             from the lower spectrum of the revised Bloom’s Taxonomy:
to 1900 students. Students in the course complete weekly                                                Remember, Understand, Apply and Analyze [14]. Exercises
homework exercises. We will use the system for text exercise                                            of the given categories provide a lower variability of possible
submissions and assessments in two stages.                                                              solutions and therefore limit the number of similarity clusters.
   As the first stage, we conduct a shadow test using our proto-                                        Exercises from the categories Evaluate and Create are out of
typical implementation. The learners submit their solution to a                                         scope for this paper.
text question using our system. Instructors establish a truth set                                          The design of the system allows for a hybrid assessment ap-
by assessing all submissions manually. Automatic assessment                                             proach. A future system could combine manual and automatic
is not used during this stage. The truth set will be used for                                           feedback to further reduce the efforts for instructors. This
quantitative evaluation of the automatic assessment accuracy                                            could be especially useful if a certain aspect of the solution


ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany                                                                        21
                                                                             Text
                                                                             Block
                         Split Submission                                            Calculate Vector                 Vector
                                                        Text
                                                         TextBlock
                         into Text Blocks                 TextBlock
                                                               Block                 Representation                Representation


                                                        Find existing                                               Find Similarity
                            Feedback                    Feedback in                  Similarity Cluster             Cluster of Text
                                                      Similarity Cluster                                               Blocks


                         Fig. 4. The automatic assessment process. Zoomed into the ”Assess automatically” activity in Fig. 1.


has a larger variability. A possible example is an exercise                                               VII. C ONCLUSION
asking for two definitions and a comparison of the terms.                      Assessments of text exercises require time-intensive efforts
The variability for the definitions is small, but the variability           from instructors today. We argue that an automated process
for the comparison part is larger. A hybrid approach allows                 to generate V IRTUAL O NE -T O -O NE feedback can reduce
instructors to focus the manual assessment on the comparison                assessment efforts for instructors and increase the amount
part, as soon as the definitions can be assessed confidently.               of feedback for students. The system should use machine
                                                                            learning techniques to detect text blocks of the same meaning
                                                                            in submissions and automatically link real instructor feedback
                       VI. R ELATED W ORK
                                                                            to equivalent blocks.
   Kiefer and Pado suggest a system to simplify the grading                                                 R EFERENCES
process presenting responses to instructors in a sorted manner               [1] S. Krusche and A. Seitz, “Increasing the Interactivity in Software
[15]. Submissions are sorted by similarity with a defined                        Engineering MOOCs - A Case Study,” in 31th Conference on Software
sample solution. Terms used in both the sample solution and                      Engineering Education and Training, 2019.
                                                                             [2] D. Kolb, Experiential Learning: Experience As The Source Of Learning
the submission are highlighted. The tool supports instructors                    And Development. Prentice Hall, 1984, vol. 1.
during the grading process but does not automatically as-                    [3] S. Krusche, A. Seitz, J. Börstler, and B. Bruegge, “Interactive Learning:
sess submissions. The only criterion is the sample solution.                     Increasing Student Participation through Shorter Exercise Cycles,” in
                                                                                 19th Australasian Computing Education Conf. ACM, 2017, pp. 17–26.
Instructor assessments are not considered for the following                  [4] T. S. Kuhn, The Structure of Scientific Revolutions.        University of
submissions.                                                                     Chicago Press, 1996.
                                                                             [5] P. Sadler and E. Good, “The Impact of Self- and Peer-Grading on Student
   Wolska et al. and Basu et al. suggest a grading process                       Learning,” Educational Assessment, vol. 11, no. 1, pp. 1–31, Feb. 2006.
where instructors grade submissions sorted by clusters of sim-               [6] G. Jerse and M. Lokar, “Providing Better Feedback for Students Using
ilar submissions for exercises in the domains of German as a                     Projekt Tomo,” in 1st ISEE Workshop, 2018, pp. 28–31.
                                                                             [7] S. Krusche and A. Seitz, “ArTEMiS - An Automatic Assessment Man-
foreign language [16] and the United States Citizenship Exam                     agement System for Interactive Learning,” in 49th Technical Symposium
[17]. They propose clusters of entire submissions, compared                      on Computer Science Education. ACM, 2018.
to the text block based clustering approach presented in this                [8] B. Bruegge and A. Dutoit, Object-Oriented Software Engineering Using
                                                                                 UML, Patterns, and Java, 3rd ed. Prentice Hall, 2009.
paper. Basu et al. introduce grading of an entire cluster of                 [9] V. J. A. Barbara E. Walvoord, Effective Grading: A Tool for Learning
submissions as a single action [17].                                             and Assessment in College, 2nd ed. Jossey-Bass, 2009.
   Gradescope Inc. offers its tool Gradescope, a commercial                 [10] J. Mitchell and M. Lapata, “Vector-based Models of Semantic Compo-
                                                                                 sition,” in 46th Annual Meeting of the Association for Computational
solution for grading assistance and ”AI-assisted Grading”.                       Linguistics: Human Language Technologies, 2008, pp. 236–244.
Their core product offers a rubric based grading system,                    [11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of
allowing instructors to define a set of scores with feedback                     Word Representations in Vector Space,” CoRR, vol. 1301.3781, 2013.
                                                                            [12] Q. Le and T. Mikolov, “Distributed Representations of Sentences and
comments per exercise. Instructors manually select rubrics for                   Documents,” in 31st International Conference on Machine Learning,
each submission. Changes to the scores and comments in a                         vol. 32, 2014, pp. II–1188–II–1196.
rubric are applied to previously assessed submissions. The                  [13] N. Bansal, A. Blum, and S. Chawla, “Correlation Clustering,” Machine
                                                                                 Learning, vol. 56, no. 1-3, pp. 89–113, Jul. 2004.
”AI-assisted Grading” feature creates groups of submissions                 [14] D. Krathwohl, “A revision of bloom’s taxonomy: An overview,” Theory
(compare with similarity clusters), allowing the instructor to                   into Practice, vol. 41, no. 4, pp. 212–218, 2002.
select rubrics for the entire group of submissions, similar to the          [15] C. Kiefer and U. Pado, “Freitextaufgaben in Online-Tests – Bewertung
                                                                                 und Bewertungsunterstützung,” HMD Praxis der Wirtschaftsinformatik,
approach of Basu et al. [17]. The automatic creation of groups                   vol. 52, no. 1, pp. 96–107, Feb. 2015.
is limited to multiple-choice and fill-in-the-blank exercises. It           [16] M. Wolska, A. Horbach, and A. Palmer, “Computer-Assisted Scoring
does not offer an automatic grouping of text questions.                          of Short Responses: The Efficiency of a Clustering-Based Approach
                                                                                 in a Real-Life Task,” in Advances in Natural Language Processing.
   These works focus on traditional exam assessment. The                         Springer, 2014, pp. 298–310.
primary objective is an accelerated grading process, rather                 [17] S. Basu, C. Jacobs, and L. Vanderwende, “Powergrading: a Clustering
                                                                                 Approach to Amplify Human Effort for Short Answer Grading,” Trans-
than providing feedback through comments. The focus of our                       actions of the Association for Computational Linguistics, vol. 1, pp.
approach is primarily providing more qualitative feedback to                     391–402, 2013.
students on homework and in-class assignments.


ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany                                                         22

</pre>