=Paper=
{{Paper
|id=Vol-2308/isee2019paper04
|storemode=property
|title=Toward the Automatic Assessment of Text Exercises
|pdfUrl=https://ceur-ws.org/Vol-2308/isee2019paper04.pdf
|volume=Vol-2308
|authors=Jan Philip Bernius,Bernd Bruegge
|dblpUrl=https://dblp.org/rec/conf/se/BerniusB19
}}
==Toward the Automatic Assessment of Text Exercises==
Toward the Automatic Assessment of Text Exercises
Jan Philip Bernius Bernd Bruegge
Department of Informatics Department of Informatics
Technical University of Munich Technical University of Munich
Munich, Germany Munich, Germany
janphilip.bernius@tum.de bruegge@in.tum.de
Abstract—Exercises are an essential part of learning. Manual multiple instructors can lead to inconsistencies. Providing
assessment of exercises requires efforts from instructors and timely or instant feedback in a large class is hard [6]. Waiting
can also lead to quality problems and inconsistencies between for feedback delays the students learning progress and hinders
assessments. Especially with growing student populations, this
also leads to delayed grading, and it is more and more difficult interactive learning. We strive toward a system to provide
to provide individual feedback. automated text assessments based on instructor feedback de-
The goal is to provide timely responses to homework sub- creasing student feedback waiting times.
missions in large classes. By reducing the required efforts for This paper is structured as follows: Section I introduces the
assessments, instructors can invest more time in supporting domain and outlines the problems with the current correction
students and providing individual feedback.
This paper argues that automated assessment provides more
process for text exercise. Our vision is described in Section II
individual feedback for students, combined with quicker feedback in the form of a visionary scenario. Section III describes
and grading cycles. We introduce a concept for automatic the assessment workflow of a possible implementation and
assessment of text exercises using machine learning techniques. V IRTUAL O NE -T O -O NE, a machine learning based mecha-
Also, we describe our plans to use this concept in a case study nism for providing individualized feedback for students in
with 1900 students.
large classes. Section V discusses applicability and limitations
I. I NTRODUCTION AND P ROBLEM of the system. We present related work in Section VI. Sec-
tion IV proposes our evaluation approach, and Section VII
Instructors face a large population of students in their concludes the paper.
courses. Students require feedback on their exercises to reflect
on their progress [1]. The concepts of interactive learning II. V ISIONARY S CENARIO
[2, 3] helps to increase the interaction between instructors The following scenario describes how we envision to im-
and students but also increases the workload for instructors. prove the assessment of text exercises:
Software engineering students need to learn constructive and Anna and Tom are students participating in a software
creative capabilities. It is important for the instructor to facil- engineering course. During a lecture, the instructor starts
itate the problem-solving learning process. Concrete problem- an in-class text exercise to be completed in the assessment
solving strategies are taught in paradigms, accepted by the system. Anna and Tom both submit a solution to the system.
profession [4]. Each paradigm provides a set of problem- The instructor starts manually assessing a set of submissions
solving exercises. These are usual textual exercises that involve selected by the system. The system asks the instructor to assess
the application of problem-solving techniques. Annas solution. The instructor provides a score and a comment
Exercises are a proven method to train higher cognitive explaining his assessment. After receiving the assessment,
skills including the acquisition of domain-specific knowledge, the system decides to assess Toms solution automatically
analysis and design methods and the evaluation of the results. based on the assessments provided previously. Anna and Tom
Trivial exercises, such as multiple-choice quizzes, do not get individual feedback for their solution to reflect on their
stimulate higher cognitive skills and do not reflect engineers learning progress.
daily work [1]. Tom is not satisfied with his submission after receiving
Exercises help students to learn, understand and apply a his feedback. He decides to improve his work and resubmits
paradigm. A student needs feedback to reflect and improve on a refined version of his solution. The system automatically
their solution to the exercise. Text exercise assessment causes assesses Toms resubmission and provides a new assessment.
time-intensive efforts with instructors, preventing them from Tom is now satisfied with his assessment and fished the
spending time on improving their lectures, having discussions exercise.
with their students or update exercises to incorporate technol-
ogy evolution. III. A SSESSMENT W ORKFLOW
Increasing student populations make it harder to keep as- In a first prototypical implementation, we extend the
sessments fair and at equal quality. Students do not benefit ArTEMiS system, already capable of assessing programming
from quantitative feedback alone [5]. Qualitative feedback and modelling exercises automatically [1, 7], by adding semi-
helps students to improve.Splitting assessment efforts with automated text assessment. A student submits his solution for
ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany 19
Student ArTEMiS Instructor
Automatic assessment
Submit possible? Assess
Submission
solution no manually
yes
Refine Assess
solution automatically
«affects»
no
Automatic Assessment Manual
Satisfied? Feedback Model Feedback
yes
Train Assessment
Review Calculate Model
Assessment
assessment Total Score
Fig. 1. Automatic assessment workflow, considering manual and automatic assessment.
a text exercise to the ArTEMiS system. The activity diagram in submissions need to be broken down to text blocks automati-
Fig. 1 depicts the assessment workflow. The system supports cally, first. Second, a vector representation of the text blocks
two means of assessment: Manual assessment provided by the is calculated as an input value for further computations. Third,
instructor (Section III-A) and automatic assessment generated the assessment needs to be generated for each text block.
by the system based on an assessment model (Section III-B). A first, simple approach is using sentences as text blocks.
ArTEMiS decides which assessment method is required for We split submissions into sentences using delimiter characters
each submission based on the quality of the assessment model. ( . : ? ! ) or line breaks. In a later stage, we plan on applying
Both means of assessment provide a set of Feedback Items. techniques such as topic modelling for text block calculation
The assessment of the submission is a composition of all if the simple approach does not provide sufficient results. All
feedback items. The final score is the sum of all feedback text blocks need feedback to complete an assessment.
scores (see Fig. 2). Student review the assessment of their ArTEMiS calculates a vector representation for each
submission. If they are not satisfied, they can submit a re- text block. Therefore, blocks are translated into a multi-
fined solution for assessment, enabling continuous interactive dimensional vector space, following the word2vec algorithm
learning [1] with text exercises.
A. Manual Assessment AssessmentModel
✱
SimilarityCluster
ArTEMiS selects text exercise submissions for manual as-
sessment by instructors if the assessment model does not allow
Student VectorRepresentation
for a confident assessment. Instructors are used to grading
exercises using a set of rubrics. A rubric defines a set of traits
of the students’ submission, which are evaluated based on a Text Exercise Submission TextBlock
✱
scale [9]. Rubrics can exist in different levels of detail, such problemStatement solution phrase
as only listing aspects of the assignment or defining different
sampleSolution
scoring levels. If instructors do not define a rubric beforehand submit() 0..1
explicitly, they build a rubric in their mind while assessing. participate() 0..1 Feedback
Instructors break down a submission into blocks and match Assessment score
each block with a rubric. As illustrated in Fig. 3, instructors
score comment
define text blocks themselves as a phrase, sentence or para-
graph by selecting a piece of text as they see fit. They assess
each block quantitatively and qualitatively using a score and
Instructor Manual Automatic
a feedback comment (see Feedback in Fig. 2). Feedback Feedback
confidence
provide()
B. Automatic Assessment
ArTEMiS assesses submissions automatically, if the quality Fig. 2. The relevant entities in the system are depicted in a class diagram.
A student creates a submission for a text exercise. An assessment is a
of the assessment model allows for a confident assessment. composition of multiple feedback items referencing text blocks. A feedback
The assessment model is trained based on the manual assess- item can be a manual or automatic feedback item. An instructor provides
ments of text blocks provided by instructors. Fig. 4 depicts manual feedback. Automatic feedback items are a proxy [8] for manual
feedback items. A similarity cluster aggregates the vector representations of
the automatic assessment process. For automatic assessment, text blocks. The assessment model consists of many similarity clusters.
ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany 20
Exercise: Strategy pattern vs. Bridge Pattern Score: 2 / 6 by comparing automatic assessments with the corresponding
Problem Statement: Explain the difference between Reviewer: Jan Philip Bernius manual assessment.
the bridge pattern and the strategy pattern.
Hypophysis 1: Automatic assessments of text exercises
Student Submission: Assessments:
following the presented concept produce results identical to
Score for " The bridge pattern in meant to decouple
The bridge pattern in meant to decouple an manual assessments with an accuracy greater than 85%.
an abstraction from is implementation. "
abstraction from is implementation.
Score: 2 In a qualitative study, we will interview the instructors to
The strategy pattern is a structural pattern Feedback: Correct analyze the block-based assessment concept (Sec. III-A), and
and allows providing multiple algorithms at its applicability to grading and providing feedback.
Score for " The strategy pattern in an structural
compile time. pattern and allows providing multiple algorithms at Hypophysis 2: The assessment concept allows capturing
compile time. "
Both patterns are structural patterns.
all feedback necessary for assessment of text exercises. No
Score: 0 information is lost compared to traditional assessment.
Assess Feedback: The strategy patterns is a
behavioral pattern. It is In the second stage, we will conduct a second study in a later
used to select an algorithm
at runtime. EIST lecture to evaluate the complete automatic assessment
workflow. We will evaluate how many manual assessments
Fig. 3. Assessment of student submission for problem statement ”Explain are needed to generate accurate assessments and the effects
the difference between the bridge pattern and the strategy pattern.” Example on assessment time.
question taken from an EIST exam. Instructors define text blocks to build up
their assessment. Each block is assessed with a score and a feedback text. Hypophysis 3: Employing automatic assessment can save
The total score is based on all feedback items in the assessment. more than 50% in total required assessment time for all sub-
missions. The assessment time per submission will increase
[10, 11] and its doc2vec extension for sentences and para- compared to paper-based assessments.
graphs [12]. The algorithm can employ different strategies to A qualitative study with student interviews assesses the
calculate one-hot word vectors. usefulness of automated feedback for them. Further, we want
Using the resulting vector representation, we use cluster to understand students feeling toward automatic feedback.
analysis to detect clusters of submission blocks [13] from
all submissions of the same exercise. These clusters list the V. D ISCUSSION
different statements submitted by all students as a part of their We discuss applicability, limitations and implications of
solutions. automatic text assessment. Feedback generated following the
Our primary assumption is that a single feedback item can concepts introduced in this paper can only be as good as
be valid for text blocks from multiple submissions. Feedback the feedback provided by the instructor. The system supports
for text blocks within the same similarity cluster can be applied the assessment process by automating the repetitive process
to other nodes within the same cluster. This allows the system involved in assessing text submissions.
to provide V IRTUAL O NE -T O -O NE feedback: Real instructor Grading based on automatic assessment leads to ethical
feedback is applied to equivalent text blocks in a new submis- problems. It is unclear whether non-native language or special
sion automatically. ArTEMiS chooses a previously assessed figures of speech could lead to decreased scores. Applications
text block located closely in the same similarity cluster, the in grading should be preceded by an extensive evaluation of
nearest neighbour. The instructor feedback is selected for the assessment quality. While applications in grading are out-of-
new submission and ArTEMiS creates an automatic feedback scope for this paper, we propose application in a two-phase
item, a proxy for the manual feedback item (see Fig. 2). grading process only. We intend to apply the system as a
If a cluster does not contain a manual feedback item, the learning-support system. The generated feedback should help
system decides that an automatic assessment is not possible students during their learning progress and should not be used
and requests a manual assessment from the instructor. during a grading process.
The applicability of the described systems depends on the
IV. E VALUATION A PPROACH variety of possible solutions. Exercises with a variable answer
We plan to conduct a case study to evaluate the automated space require more knowledge for assessment, increasing
assessment quality in the Introduction to Software Engineering the complexity. The system focuses on assessing exercises
(EIST) lecture taught at the Technical University of Munich from the lower spectrum of the revised Bloom’s Taxonomy:
to 1900 students. Students in the course complete weekly Remember, Understand, Apply and Analyze [14]. Exercises
homework exercises. We will use the system for text exercise of the given categories provide a lower variability of possible
submissions and assessments in two stages. solutions and therefore limit the number of similarity clusters.
As the first stage, we conduct a shadow test using our proto- Exercises from the categories Evaluate and Create are out of
typical implementation. The learners submit their solution to a scope for this paper.
text question using our system. Instructors establish a truth set The design of the system allows for a hybrid assessment ap-
by assessing all submissions manually. Automatic assessment proach. A future system could combine manual and automatic
is not used during this stage. The truth set will be used for feedback to further reduce the efforts for instructors. This
quantitative evaluation of the automatic assessment accuracy could be especially useful if a certain aspect of the solution
ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany 21
Text
Block
Split Submission Calculate Vector Vector
Text
TextBlock
into Text Blocks TextBlock
Block Representation Representation
Find existing Find Similarity
Feedback Feedback in Similarity Cluster Cluster of Text
Similarity Cluster Blocks
Fig. 4. The automatic assessment process. Zoomed into the ”Assess automatically” activity in Fig. 1.
has a larger variability. A possible example is an exercise VII. C ONCLUSION
asking for two definitions and a comparison of the terms. Assessments of text exercises require time-intensive efforts
The variability for the definitions is small, but the variability from instructors today. We argue that an automated process
for the comparison part is larger. A hybrid approach allows to generate V IRTUAL O NE -T O -O NE feedback can reduce
instructors to focus the manual assessment on the comparison assessment efforts for instructors and increase the amount
part, as soon as the definitions can be assessed confidently. of feedback for students. The system should use machine
learning techniques to detect text blocks of the same meaning
in submissions and automatically link real instructor feedback
VI. R ELATED W ORK
to equivalent blocks.
Kiefer and Pado suggest a system to simplify the grading R EFERENCES
process presenting responses to instructors in a sorted manner [1] S. Krusche and A. Seitz, “Increasing the Interactivity in Software
[15]. Submissions are sorted by similarity with a defined Engineering MOOCs - A Case Study,” in 31th Conference on Software
sample solution. Terms used in both the sample solution and Engineering Education and Training, 2019.
[2] D. Kolb, Experiential Learning: Experience As The Source Of Learning
the submission are highlighted. The tool supports instructors And Development. Prentice Hall, 1984, vol. 1.
during the grading process but does not automatically as- [3] S. Krusche, A. Seitz, J. Börstler, and B. Bruegge, “Interactive Learning:
sess submissions. The only criterion is the sample solution. Increasing Student Participation through Shorter Exercise Cycles,” in
19th Australasian Computing Education Conf. ACM, 2017, pp. 17–26.
Instructor assessments are not considered for the following [4] T. S. Kuhn, The Structure of Scientific Revolutions. University of
submissions. Chicago Press, 1996.
[5] P. Sadler and E. Good, “The Impact of Self- and Peer-Grading on Student
Wolska et al. and Basu et al. suggest a grading process Learning,” Educational Assessment, vol. 11, no. 1, pp. 1–31, Feb. 2006.
where instructors grade submissions sorted by clusters of sim- [6] G. Jerse and M. Lokar, “Providing Better Feedback for Students Using
ilar submissions for exercises in the domains of German as a Projekt Tomo,” in 1st ISEE Workshop, 2018, pp. 28–31.
[7] S. Krusche and A. Seitz, “ArTEMiS - An Automatic Assessment Man-
foreign language [16] and the United States Citizenship Exam agement System for Interactive Learning,” in 49th Technical Symposium
[17]. They propose clusters of entire submissions, compared on Computer Science Education. ACM, 2018.
to the text block based clustering approach presented in this [8] B. Bruegge and A. Dutoit, Object-Oriented Software Engineering Using
UML, Patterns, and Java, 3rd ed. Prentice Hall, 2009.
paper. Basu et al. introduce grading of an entire cluster of [9] V. J. A. Barbara E. Walvoord, Effective Grading: A Tool for Learning
submissions as a single action [17]. and Assessment in College, 2nd ed. Jossey-Bass, 2009.
Gradescope Inc. offers its tool Gradescope, a commercial [10] J. Mitchell and M. Lapata, “Vector-based Models of Semantic Compo-
sition,” in 46th Annual Meeting of the Association for Computational
solution for grading assistance and ”AI-assisted Grading”. Linguistics: Human Language Technologies, 2008, pp. 236–244.
Their core product offers a rubric based grading system, [11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of
allowing instructors to define a set of scores with feedback Word Representations in Vector Space,” CoRR, vol. 1301.3781, 2013.
[12] Q. Le and T. Mikolov, “Distributed Representations of Sentences and
comments per exercise. Instructors manually select rubrics for Documents,” in 31st International Conference on Machine Learning,
each submission. Changes to the scores and comments in a vol. 32, 2014, pp. II–1188–II–1196.
rubric are applied to previously assessed submissions. The [13] N. Bansal, A. Blum, and S. Chawla, “Correlation Clustering,” Machine
Learning, vol. 56, no. 1-3, pp. 89–113, Jul. 2004.
”AI-assisted Grading” feature creates groups of submissions [14] D. Krathwohl, “A revision of bloom’s taxonomy: An overview,” Theory
(compare with similarity clusters), allowing the instructor to into Practice, vol. 41, no. 4, pp. 212–218, 2002.
select rubrics for the entire group of submissions, similar to the [15] C. Kiefer and U. Pado, “Freitextaufgaben in Online-Tests – Bewertung
und Bewertungsunterstützung,” HMD Praxis der Wirtschaftsinformatik,
approach of Basu et al. [17]. The automatic creation of groups vol. 52, no. 1, pp. 96–107, Feb. 2015.
is limited to multiple-choice and fill-in-the-blank exercises. It [16] M. Wolska, A. Horbach, and A. Palmer, “Computer-Assisted Scoring
does not offer an automatic grouping of text questions. of Short Responses: The Efficiency of a Clustering-Based Approach
in a Real-Life Task,” in Advances in Natural Language Processing.
These works focus on traditional exam assessment. The Springer, 2014, pp. 298–310.
primary objective is an accelerated grading process, rather [17] S. Basu, C. Jacobs, and L. Vanderwende, “Powergrading: a Clustering
Approach to Amplify Human Effort for Short Answer Grading,” Trans-
than providing feedback through comments. The focus of our actions of the Association for Computational Linguistics, vol. 1, pp.
approach is primarily providing more qualitative feedback to 391–402, 2013.
students on homework and in-class assignments.
ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany 22