=Paper= {{Paper |id=Vol-3357/paper3 |storemode=property |title=The Determination of the Learning Performance based on Assessment Item Analysis |pdfUrl=https://ceur-ws.org/Vol-3357/paper3.pdf |volume=Vol-3357 |authors=Doru Anastasiu Popescu,Ovidiu Domsa,Nicolae Bold |dblpUrl=https://dblp.org/rec/conf/csw/PopescuDB23 }} ==The Determination of the Learning Performance based on Assessment Item Analysis== https://ceur-ws.org/Vol-3357/paper3.pdf

The Determination of the Learning Performance
based on Assessment Item Analysis
Doru Anastasiu Popescu1,*,† , Ovidiu Doms, a2,† and Nicolae Bold3,†
1
Department of Mathematics and Computer Science, University of Pites, ti, Romania
2
1 Decembrie 1918 University of Alba Iulia, Romania
3
UASVM Bucharest, Faculty of Management and Rural Development, Slatina Branch, Slatina, Romania

Abstract
The analysis of the performance of the educational process is one of the essential aspects of the contempo-
rary approach of the educational system. Technology has permitted the analysis of various components
of the learning process, which has developed in the process of learning analytics. This paper presents
the model and implementation of a concept that uses learning analytics to determine the outcome of
an educational process and its performance. Its performance refers to the group understanding of a
specific concept measured using the results to systematic evaluation during a period of time. The model,
called Course Item Management Generation (CIM-GET), is part of a larger model that is centered on the
educational assessment process and which uses machine learning-based techniques and evolutionary
algorithms to generate assessment tests used for learning purposes. The current model uses statistical
and item response analysis parameters in order to create a report regarding the items within the tests
that are given over a period of time to specific students within a faculty of university.
In the first part, the CIM-GET model will be presented in the context of the larger model called Dynamic
Model for Assessment and Interpretation of Results (DMAIR), then several results obtained after the
technical and statistical implementation will be presented. The CIM-GET model uses items from an item
dataset, extracted using machine learning-based tools by the defining keywords of the item, which also
represent the topics of an item, which form an optimal test using a generation algorithm (e.g., genetic
algorithm). After the test is given to students, the results are stored in a database, a report is output and
a list of topics that need to be revised is generated. In this matter, the practical results of the presented
model will be shown, in order to show the practical importance of the results.

Keywords
assessment, education, item analysis, test, answer evaluation

1. Introduction
The development of the computing models and method has permitted the creation of several
research fields that study and implement these models and methods in the educational domain.
In this matter, the social context and the problems that arose regarding the educational activity
were also a catalyst in order for this phenomenon to occur. As a result, the activity of a typical

WSDM 2023 Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, March
3, 2023, Singapore
*
Corresponding author.
†
These authors contributed equally.
$ dopopan@gmail.com (D. A. Popescu); domsa.ovidiu@gmail.com (O. Doms, a); bold1_nicolae@yahoo.com
(N. Bold)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
1613-0073
CEURWorkshopProceedingshttp://ceur-ws.orgISSN
teacher has been changing due to the inclusion of the technological developments, especially in
educational management.
We should also take into consideration the major subject of recent research in the educational
domain, which is based on creating meaningful and effective educational process, especially
using digital technology and computing-based methods. The objective of an educational process
is the fulfillment of the objectives and one of the main possibilities to achieve these objectives
is related to an objective assessment. The objectivity of an assessment process is related to the
appropriate design of the assessment tools and the valid analysis of the assessment results,
which can be conceptually accomplished by the usage of specific pedagogical methods, such as
Universal Design for Learning (UDL) [1] or, more specifically, Universal Design for Assessment
(UDA) [2], technically implemented using computing methods, such as machine learning,
evolutionary algorithms and thoroughly studied using, for example, Learning Analytics (LA)
[3] or statistical indicators [4, 5]. The main objective of a successful automated design and
analysis of an assessment test is to be as close as a human-centered approach of a design result
with similar requirements, because human experience is still hard to be surpassed in terms of
the specific assessment test and item design and analysis [6].
In order to achieve such a specific objective, any type of assessment design must take
into consideration design and analysis frameworks that check four major aspects [7]:
communication, orientation, learning experience and evaluation, with regard to reliability and
validity of the assessment. While communication refers to the correct reciprocal transmission
and understanding of the assessment objectives to all the participants to the assessment, the
orientation refers to the optimal choice of the assessment form based on the studied content.
As for the other two aspects, the learning experience takes into account the closeness of
the assessment to the real-life situations and the validity stands for the extent to which the
assessment objectives were accomplished. Also, several factors of the assessment must be
taken into consideration, such as subject content, electronic flexibility, language usage, format
options, time limits or a direct link with the goals and objectives of the course.
This paper presents the development of the CIM-GET model, describing the architecture of the
model, the implementation and its results. In short, the model consists in a specific method
related to assessment item analysis in a specific educational context. In this matter, the model is
based on the hypothesis that the statistical data related to an assessment item is influenced by
the conceptual understanding of the item subject. Also, several factors, such as item-based
factors (e.g., the degree of difficulty of the item, the theoretical / practical nature of the item,
the item type, item number), statistical and item test factors (e.g., mean and standard deviation,
item discrimination, item attempts, reliability coefficient), student-centered factors (e.g., student
educational level) or group-centered factors (e.g., assessment score mean), which will be
presented in the next sections, are to be taken into consideration regarding the item response.
In this matter, given a specific period of educational time, such a semester or a year, and the
periodic assessment of the students taken by a teacher, the item analysis conceptualized in
the CIM-GET model, with its practical implementation, underlies the notions that can be
elaborated more during the courses, due to lower rates of correct answers of the items that
check these specific notions during the periodic assessment tests. The model is also enhanced
by determining supplementary item verification mechanisms using automated methods of
clustering of items in order to predict whether an item is prone to have lower rates of response
based on the factors taken into consideration, tendency that will be confirmed by the factual
item analysis. Obviously, the automated clustering will provide finer results with a larger factor
set taken into consideration.
The CIM-GET model is a modular part of an integrated model denoted by Dynamic Model for
Assessment and Interpretation of Results (DMAIR), which is formed of three main components:
the test generation, the answer check and the response analysis. For each component, a
different model with its implementation will be described further. The DMAIR model takes into
account several areas of educational assessment, especially regarding item generation using
various methods, such as machine learning, natural language processing and evolutionary
algorithms used to automatically generate items for a specific assessment test, with several
requirements related to the test (the item subject, the degree of difficulty of the item, the
theoretical / practical nature of the item and the item type). The CIM-GET model is used to
analyze the answers, in the integrated model being responsible for the Answer Evaluation (AE)
and Item Analysis (IA) parts.
For the detailed description of the model and its implementation, the paper is structured in
several sections. In the first section, several literature landmarks and trends from the research
fields are presented. The next section presents the description of the CIM-GET model and the
integrated model that CIM-GET is part of, followed by a short description of a web-based
implementation of the model and several results that show the practical potential of an
implementation of this model.

2. Literature review
Extensive literature has been published regarding the optimization of the assessment process
regarding design and analysis of the assessment components. In this matter, the most part of
the automated educational assessment area consists in the development of assessment models
and tools for Question Generation (QG) and Answer Evaluation (AE). An important part in the
AE branch is dedicated to the Automated Essay Scoring (AES), as shown in [8, 9, 10], which has
largely been researched in recent years.
Regarding QG branch, the majority of the research papers were directed to the generation of
objective questions, such as multiple-choice [11, 12], true-false [13] or open-cloze questions
[14, 15]. For a long time, classical subjects of QG research are related to the formulation of
the questions from learning material, thus the recent research has been extensively related to
sentence-to-question generation [16] and the generation of questions from any type of text
[17], including artificial intelligence [18]. In order to visualize the extent of the research on
this subject, an empiric research regarding article subjects in scientific databases revealed that
the topic has a wide interest in the area of research. This research has been made based on
a search operation on specific keywords (e.g., for the specific keyword „automatic question
generation”). The search on the Google Scholar paper database returned 292 unique results for
2022. As for the methods used for the accomplishment of this task, one of the most used is the
Natural Language Processing (NLP), which has been developed and refined over time.
For the AE branch, the research is focused on the short and essay answers analysis, which
also uses NLP-based techniques in order to accomplish the performant analysis of the text
in the answer. One of the most researched topics is the evaluation of the correctness of the
response, especially related to specific type of questions (e.g., multiple-choice questions [19, 20]).
However, an increasing interest can be observed related to the automatic answer evaluation
regarding essay-type items [21, 22].
Another important part of the research in educational assessment is related to Item Analysis (IA),
a field situated at the border of several domains, such as statistics, psychometry, assessment or
education. It showcases a wide range of research topics related to the mathematical and statistical
aspects of assessment analysis [23], which remain landmarks regarding the item analysis topic
and integrated intensively in learning management systems as basic functionalities for the
human-centered analysis related to educational activity on a specific platform. The item analysis
is an extremely important method on studying the student performance over given periods of
time [24]. For this subjects, two approaches are thought to be the most fitted for item analysis:
Classical Test Theory (CTT) and Item Response Theory (IRT). While CTT uses extensively
statistical-based tools [25], such as proportions, averages and correlations, and is used for
smaller-scale assessment contexts, the IRT has a more recent development and it is studied in
respect to its more adaptive character [26]. The adaptive character of the IRT method consists
in the important account given of the human factor related to the assessment process. One of
the most important differences between the two approaches is based on the previous learning
experience of the assessee, because IRT create an adaptive analysis based on a measurement
precision which takes into account latent-attribute values, while CTT starts with the assumption
that this precision is equal for all individuals [27]. In this paper, tangential concepts are used
for the description of the development of CIM-GET model, especially regarding the statistical
item analysis.
In a further development of the research literature, an important field which had recent serious
practical implications in the educational process is the Deep Knowledge Tracking [28]. It has
gained a lot of exposure in the recent period of the continuous development of online education,
due to the fact that it proposes the analysis and prediction of the student educational behaviour
based on previous personal learning experiences.
In order to accomplish the purpose of the current paper, we will use the literature concepts and
follow a specific approach that has conceptual basis on several aspects of the cited literature.
In this matter, the assessment item generation serves at a better integration of the assessment
model and the IA cited literature shows an introductory part for the description of CIM-GET
model.

3. Model description
3.1. DMAIR integrated model
The DMAIR model comprises several components that are essential to an assessment system.
This system must be formed of three main functionalities: generation of items (func1 (I)), check
mechanisms (func2 (II)) and answer evaluation (func3 (III)). While the module responsible for
the generation of items, func1 (I), uses methods and tools for obtaining assessment tests suited to
specific requirements, the check mechanism, func2 (II) is related to the validation of the answers
Figure 1: Visual depiction of the DMAIR model.

given by the users and the answer evaluation module, func3 (III), which will be presented further
as the CIM-GET model, introduces the development of the item analysis for the given generated
items and answers and is the most related to the learning analytics. A visual depiction of the
model, including a graphical user interface component, is presented in the next figure.

3.1.1. Model structure
The main components of the model are the questions, named items after the generation process,
the test, the requirements and the generation mechanism. The question is a particular case of
an item, as well as a request or an exercise, this being the reason for which the questions will be
considered particular cases of items and we will refer further to the questions, requests, exercises
etc. as items. An item 𝑞(𝑖𝑑; 𝑠𝑡; 𝑑𝑑; 𝑉 ; 𝑡𝑝; 𝑡) is an object formed of the next components:

• the identification number of the item 𝑖𝑑_𝑞, which has the role of being the unique identifier
of the item in the implementation phase;
• the statement 𝑠𝑡_𝑞, which is formed of a phrase or a set of phrases that describes the
initial data and the requests of the item, which must be solved;
Figure 2: Visual representation of an item 𝑞.

• the set of keywords 𝑘𝑤_𝑞, which consists in the list of keywords that describe the best
the topic of the item;
• the degree of difficulty 𝑑𝑑_𝑞, 𝑑𝑑_𝑞 ∈ [0, 1]; the degree of difficulty is calculated as the
ratio between the number of incorrect answers at a specific item and the total number of
answers. The degree of difficulty can also be calculated using the method presented in
[29];
• choices set 𝑉 _𝑞 (wherever necessary), which can be formed of a list of two or more
possible answers when the item type is multiple or is null when the item type is short or
essay;
• the theoretical or practical character of the question 𝑡𝑝; 𝑡𝑝 ∈ 0, 1, where 0 is theoretical
and 1 is practical;
• the item type 𝑡_𝑞, 𝑡_𝑞 ∈ ’multiple’, ’short’, ’essay’, illustrating the type of the item, whether
it has choices or the answer is a textual one, given by the user, in case of short and essay
types.

The item dataset, denoted further by 𝐵𝐷1, contains items that are automatically generated
using NLP methods or introduced manually by a teacher.
The 𝐵𝐷1 dataset is schematically represented in Figure 2, where we can also see the main
components of a general item: the item type (𝑡_𝑞), the statement 𝑠𝑡_𝑞, the choice set 𝑉 _𝑞, the
list of keywords 𝑘𝑤_𝑞 and the degree of difficulty 𝑑𝑑_𝑞.
A test 𝑇 (𝑆, 𝐷𝐷, 𝑇 𝑃, 𝑄𝑇 ) is a set of items 𝑞𝑖 , 𝑖 = 1, 𝑆, where 𝑆 is the set of items that form
the test and 𝐷𝐷 is the degree of difficulty of the test:
𝑆
∑︁
𝐷𝐷 = 𝑞𝑑𝑑𝑖 (1)
𝑖=1

In the equation, 𝑞𝑑𝑑𝑖 consists in the degree of difficulty of the item 𝑞𝑖 within the test 𝑇 .
The other components of a test are:
• 𝑇 𝑃 is the theoretical-practical ratio, which gives the predominant type of the test. 𝑇 𝑃 ∈
[0, 1], the value of the ratio consisting in the proportion of the theoretical questions and
the difference 1˘𝑇 𝑃 being the proportion of practical question;
• 𝑄𝑇 introduces the predominant item type in the test; 𝑄𝑇 is an array with three values:
[𝑞𝑡_𝑚, 𝑞𝑡_𝑠, 𝑞𝑡_𝑒]. The values of the array contain the number of items of each type
within the test, 𝑞𝑡_𝑚 being the number of multiple-choice items, 𝑞𝑡_𝑠 being the number
of short-type items and 𝑞𝑡_𝑒 being the number of essay-type items.

3.1.2. Model functionality
The generation mechanism uses a predefined set of actions that describe the generation and
evaluation of the generated assessment tests.
Input data consists in:
• the desired subject given by the set of 𝑛𝑟_𝑘 keywords generated by the user 𝑘𝑤 = 𝑘𝑤1 ,
𝑘𝑤2 , . . . , 𝑘𝑤𝑛𝑟 _𝑘;
• the number of questions required for each keyword 𝑛𝑟_𝑘𝑤 = 𝑛𝑟_𝑘𝑤1 , 𝑛𝑟_𝑘𝑤2 , . . . ,
𝑛𝑟_𝑘𝑤𝑛𝑟 _𝑘
• the desired degree of difficulty 𝐷𝐷_𝑢;
• the desired theoretical-practical ratio 𝑇 𝑃 _𝑢;
• the desired predominant question type 𝑄𝑇 _𝑢 = <𝑞𝑡_𝑚, 𝑞𝑡_𝑠, 𝑞𝑡_𝑒>.
The model functionality contains the next main algorithms:
• the item generation algorithm (which will be denoted further by 𝐺𝑒𝑛𝑇 𝑒𝑠𝑡), correspondent
to the functionality func1 (I), which contains actions related to the specific generation.
This action can be established by following a specific set of steps:
– step 1: Each keyword is parsed and for each of them a cluster of questions that
have similar keywords with the current one is formed. The similarity is computed
using NLP methods and the clusters are being formed using ML-based technique
K-means. The 𝑛𝑟_𝑘𝑤𝑖 number of questions are taken into consideration for each
𝑘𝑤𝑖 keyword
– step 2. The partial dataset of questions 𝐶𝑖 that can be used for the generation of
the test is formed. The main requirement taken into consideration is the subject of
the test.
– step 3. The test is generated based on other requirements using a specific type of
method (e.g., genetic algorithms).
• the check mechanisms algorithm (which will be denoted further by 𝐶ℎ𝑘𝐼𝑡𝑒𝑚), corre-
spondent to the functionality func2 (II), related to automated check of answers, which
will be developed in future research;
• the answer evaluation algorithm (which will be denoted further by 𝐸𝑣𝑎𝑙𝑆𝑡𝑢𝑑), corre-
spondent to the functionality func3 (III), the algorithm presented in the CIM-GET model
section, which represents the part of the model responsible with learning analytics and
which will be described in the next section. This algorithm also uses the item prediction
algorithm (which will be denoted further by 𝐼𝑡𝑒𝑚𝑃 𝑟𝑒𝑑) and which will be presented in
the last part of the section 3.
The item generation uses the algorithm GenTest in order to generate the test using methods
presented before, where M is the number of items in 𝐵𝐷1. A schematic approach of this
algorithm is:

for i = 1, M do
select items from BD1, resulting clusters Cj, j = 1, nr_kw;
a Ti test is generated using questions from Ci;
the Ti test is visually generated and given to the students
endfor

3.2. CIM-GET model
3.2.1. Model components
The CIM-GET model represents the part of DMAIR model, being one of the three main modules
that were presented at the end of the previous subsection. In this matter, this model consists in
the answer evaluation module, which aims the determination of the assessment performance,
especially regarding the student performance for a specific item.
The CIM-GET module is designed starting from the premise that an incorrect answer to an item
may indicate that the subject of the item is not fully understand, especially in certain conditions
(e.g„ other items within the test are responded correctly for a student response, the item gets
repeatedly wrong answers for more students etc.). In this matter, the model takes into account
several factors in order to determine the direct causality between the poor understanding of the
subject and the incorrect answer to an item with the respective subject.
The model needs additional variables in order to be completely described. These variables are:
• the period of time 𝑇 ;
• the number of tests given in the time 𝑇 , 𝑀 ;
• the frequency of the assessment 𝑓 ;
• the number of the students in the group 𝑁 ;
The model structure consists in the existence of several components:
• the item 𝑞, described in the previous subsection, but with several additional characteristics,
that will be presented in the next list;
• the test 𝑇 , also described in the previous subsection, which will be enriched with several
statistical indicators;
• the student result 𝑆, which contains information related to the assessment results of a
specific student;
• the group of students result 𝐺, which contain statistical information related to the assess-
ment results of a specific group of students (e.g., class, group).
Within the model, an item is considered to be correctly answered (marked with 1) or incorrectly
answered (marked with 0). As a premise for the items that have fractional values of points, we
will take into consideration that an item which has received equal or more than 0.5 points is
marked as correctly answered and incorrectly if the value is less than 0.5 points. The additional
characteristics of an item q for the CIM-GET model are:
• the scores to an item obtained by all students 𝑠𝑐_𝑞, stored as an array;
• the average score of the item 𝑚_𝑞, which is the average value of all the responses of all
students to the item 𝑞, taking into account also the fractional values of the scores;
• the number of correct answers 𝑙_𝑞, which contains the number of all correct answers
(marked with 1), as stated previously;
• the number of students that answered the item 𝑡𝑎_𝑞, which determine the number of all
students that answered the item;
• the total number of attempts 𝑎𝑡_𝑞, which stores the number of the attempts for the item
𝑞, for the case in which the teacher permits several attempts for an item;
• the average number of attempts 𝑚𝑎𝑡_𝑞, which stores the average number of attempts for
an item q, for the case in which the teacher permits several attempts for an item;
• the standard deviation 𝑠𝑑_𝑞, calculated as a normal standard deviation of the item using
the specific formula for standard deviation for an item 𝑞, where 𝑠𝑐_𝑞𝑖 are the item scores
for the item 𝑞𝑖 , the 𝑚_𝑞 is the average score of the item 𝑞𝑖 and 𝑁 is the number of students
that responded to the item:
√︃
𝑖 = 1𝑁 𝑠𝑐_𝑞𝑖 − 𝑚_𝑞
∑︀
𝑠𝑑_𝑞 = (2)
𝑁

• the upper-lower count 𝑢𝑐_𝑞 and 𝑙𝑐_𝑞, considering that the group of students is split in
three groups: high (27% of students), middle (46% of students) and low (27% of students)
scores; thus, 𝑢𝑐_𝑞 equals the count of correct answers from the upper 27% of 𝑁 and 𝑙𝑐_𝑞
equals the count of correct answers from the lower 27% of 𝑁 ; for example, from a list of
100 answers sorted descendingly by score, 𝑢𝑐_𝑞 would be the number of correct answers
from the first 27 answers and 𝑙𝑐_𝑞 would be the number of correct answers from the last
27 answers;
• the item discrimination 𝑑_𝑞, 𝑑_𝑞 ∈ [-1, +1], which determines for an item the amount
of discrimination between the responses of the upper and the lower group and which is
calculated using the specific formula, where 𝑢𝑐_𝑞 is the upper count, 𝑙𝑐_𝑞 is the lower
count and 𝑁 is the number of students that responded at the item:
√︂
𝑢𝑐_𝑞 − 𝑙𝑐_𝑞
𝑑_𝑞 = (3)
0.27 × 𝑁
• the point biserial 𝑝𝑏𝑠_𝑞, 𝑝𝑏𝑠_𝑞 ∈ [-1, +1], which shows whether the item is discriminating
high-performing students from low-performing students, determining if the question is
well written, and which is calculated as a Pearson correlation coefficient between the
number of correct responses of a student to an item q and the number of all the correct
answers to the other items than q in the test.

The additional characteristics for a given test T for the CIM-GET module are:

• the test length 𝑡_𝑙, related to the number of questions in the test;
• the average score of the test 𝑚_𝑇 ;
• the diversity index of the item type 𝐷_𝑇 , which shows the diversity of the item types
taken into consideration (multiple-choice, short answer and essay) and which is calculated
as a Simpson’s Index of Diversity, as follows, where 𝑞𝑡_𝑚 is the number of multiple-choice
items, 𝑞𝑡_𝑠 the number of short-type items and 𝑞𝑡_𝑒 the number of essay-type items:

𝑞𝑡_𝑚 × (𝑞𝑡_𝑚 − 1) + 𝑞𝑡_𝑠 × (𝑞𝑡_𝑠 − 1) + 𝑞𝑡_𝑒 × (𝑞𝑡_𝑒 − 1)
𝐷_𝑇 = (4)
𝑡_𝑙 × (𝑡_𝑙 − 1)
The characteristics of a student results component 𝑆 are:

• the total score of a student to all tests 𝑡𝑡_𝑆;
• the average score of a student to all tests 𝑚𝑡_𝑆;
• the total score of a student to individual tests 𝑡_𝑆;
• the average scores of a student to individual tests 𝑚_𝑆;
• the total score of a student to the items of the same subject 𝑡𝑠_𝑆.

The characteristics of a group results component 𝐺 are:

• the average score of a group to all tests 𝑚𝑡_𝐺;
• the average scores of a group to individual tests 𝑚_𝐺.

3.2.2. Model functionality
The model has a simple premise and is built on the generation phase of the items. Shortly, after
the 𝐺𝑒𝑛𝑇 𝑒𝑠𝑡 algorithm is applied, the evaluation of the items is made, using the methodology
described previously for 𝐸𝑣𝑎𝑙𝑆𝑡𝑢𝑑. A visual representation of the model can be seen in Figure
3.
The functionality of the model consists in the actions that can be performed within the model.
The main two actions consist in:
1. the determination of the subjects that need to be revised based on the answers given by
the students to the periodic assessment, denoted by 𝐸𝑣𝑎𝑙𝑆𝑡𝑢𝑑;
2. the determination of the probability of an item to be correctly answered by a student or
by a group using k-means clustering, denoted by 𝐼𝑡𝑒𝑚𝑃 𝑟𝑒𝑑.
The algorithm 𝐸𝑣𝑎𝑙𝑆𝑡𝑢𝑑 consists in the navigation of the following steps:
1. The students log in and solve the tests.
2. For each student and a specific test, a report is generated, created by following the next
steps:
a) The items that have obtained lower values of 𝑚_𝑞 and 𝑙_𝑞 are filtered.
b) The values of the item parameters 𝑑_𝑞, 𝑝𝑏𝑠_𝑞, 𝑡𝑎_𝑞, 𝑑𝑑_𝑞, 𝐷_𝑇 , 𝑡𝑠_𝑆 are verified.
c) The subjects of the items are then extracted and verified to have obtained lower
values for 𝑚_𝑞 and 𝑙_𝑞 in other items with the same subject for a large number of
students.
3. The subjects of the items that validate the rule presented in substep 2c) are output.
4. The reports are introduced in a dataset of reports, referred further as 𝐵𝐷2.
Figure 3: Visual depiction of the CIM-GET model.

A schematic approach of this algorithm is:

for i = 1, M do
for j = 1, N do
student Sj solve test Ti;
report Ri is generated for Sj;
Ri is introduced in BD2
endfor
endfor

The algorithm 𝐼𝑡𝑒𝑚𝑃 𝑟𝑒𝑑 consists in applying a k-means clustering to the set of data, which
will be a training set for the algorithm, in order to determine whether an item is likely to be
answered correctly / incorrectly for a specific student. In this matter, two clusters are formed,
𝐶𝑜𝑟𝑟𝑒𝑐𝑡 and 𝐼𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡, correspondent to the probability of an item to be responded correctly
/ incorrectly by a student. The training data will consist in the next values: 𝑑𝑑_𝑞, 𝑙_𝑇 , 𝑡_𝑞,𝑁 ,
𝑚𝑡_𝑆, 𝑚_𝑆. This algorithm will be extended in a further research.
Figure 4: Screenshots of the implementation of the CIM-GET model.

4. Implementation and results
An implementation of the 𝐺𝑒𝑛𝑇 𝑒𝑠𝑡 part of the DMAIR model has been made and it is presented
in previous research, such as the one described in [29].
The implementation was made using the PHP web programming language and the interface
was created using the Bootstrap library, which is based on HTML, CSS and JavaScript languages.
A representation of the interface for the generation of tests component and the item analysis
component is shown in the next figure.
As for the results obtained related to the implementation, a specific context with several
parameters was considered. The item dataset 𝐵𝐷1 is not presented in this paper due to the
large amount of data, but it is available in a repository [30]. The tests and items related to them
Table 1
The items taken into account and their characteristics
ID Statement Keywords 𝑑𝑑_𝑞
T1Q1 Which of the following is an operating system? system, operating 0,05
T1Q2 Two differences between Windows and Linux are: system, operating, Windows 0,80
T1Q3 What is the file path for the file "hello.txt" found in partition D:, in the folder "Lucrari"? path, system 0,35
T1Q4 How many bits form a byte? storage, system 0,05
T1Q5 Which of these variants is a storage-related hardware? hardware, storage, system 0,15
T2Q1 The Microsoft Word program is Word, category 0,35
T2Q2 The extension of a file created in Word is: Word, extension 0,00
T2Q3 Selecting the entire content of a document is done with the key combination: keyboard shortcut, selecting, Word 0,10
T2Q4 The process of removing an unwanted part of an image is called image, editing, Word 0,70
T2Q5 What is the name of the direction of a page used for viewing and printing? Word, formatting 0,75
T3Q1 A program that is used to view websites is called: browser, Internet 0,05
T3Q2 What is the term for unsolicited emails? Internet, email 0,05
T3Q3 TCP/IP is: protocol, Internet 0,30
T3Q4 URL means: browser, URL, Internet 0,45
T3Q5 Which of the following variants is not a browser? browser, Internet 0,45
T4Q1 The step-by-step procedure for solving a problem is called: programming, algorithm 0,20
T4Q2 This characteristic of algorithms often draws the line between what is feasible... programming, algorithm, characteristic 0,75
T4Q3 A water lily covers the surface of the water in 30 days. How many days do it... programming, exercise 0,45
T4Q4 What is the result of the expression: (5 > 7) AND (0 < 2 * 5 < 15)? programming, boolean, exercise 0,55
T4Q5 What is the minimum number of comparisons to sort ascendingly ...? programming, exercise, sort 0,80
T5Q1 What is the name of the intersection of a column and a row in a worksheet? Excel, row, line, cell 0,05
T5Q2 What function in Excel returns the sum of a number range? function, Excel, sum 0,20
T5Q3 The process of arranging the elements of a column in a particular ... sort, Excel 0,55
T5Q4 In Excel, the rows are numbered with: Excel, row, line, cell 0,40
T5Q5 Which function in Excel returns the average of a range of numbers? function, Excel, average 0,20

will be presented in Table 1.
The initial context was considered to be formed of a group of 20 students which participated to
an ICT course for a period of a semester (14 weeks) and a number of 5 tests was given during
this period. Each test was generated in order to contain 5 questions with specific subjects related
to the usage of various applications (Word, Excel) or notions regarding Internet, programming
and operating systems. The type of all questions was multiple-choice. The columns presented
in Table 1 show the item unique identifier (𝑖𝑑_𝑞), the item statement (𝑠𝑡_𝑞), the item list of
keywords (𝑘𝑤_𝑞) and the degree of difficulty of the item (𝑑𝑑_𝑞).
For the items described in Table 1, the responses were analyzed by determining the values
of the parameters of the model taken into account. For this specific example, the score was
equal to the correct number of responses, due to the fact that every question had a score of 1
point. The results are shown in Table 2. The columns presented in Table 1 show the degree of
difficulty (𝑑𝑑_𝑞), the standard deviation (𝑠𝑑_𝑞), the item discrimination (𝑑_𝑞), the point-biserial
(𝑝𝑏𝑠_𝑞), the average score 𝑚_𝑞) and the number of correct answers (𝑚_𝑞).
After the responses, several items were determined as being more difficult than the others
and the list of the subjects which can be revised that was obtained after the analysis of the
Table 2
The analysis results of the items
Score 𝑑𝑑_𝑞 𝑠𝑑_𝑞 𝑑_𝑞 𝑝𝑏𝑠_𝑞 𝑚_𝑞 𝑙_𝑞
T1Q1 19 0,05 0,22 0,2 0,48 0,95 19
T1Q2 4 0,80 0,40 0,6 0,06 0,20 4
T1Q3 13 0,35 0,48 0,8 0,26 0,65 13
T1Q4 19 0,05 0,22 0,2 0,17 0,95 19
T1Q5 17 0,15 0,36 0,2 0,04 0,85 17
T2Q1 13 0,35 0,48 0,8 0,26 0,65 13
T2Q2 20 0,00 0,00 0,0 - 1,00 20
T2Q3 18 0,10 0,30 0,4 0,24 0,90 18
T2Q4 6 0,70 0,46 0,8 0,28 0,30 6
T2Q5 5 0,75 0,43 0,8 0,55 0,25 5
T3Q1 19 0,05 0,22 0,2 0,40 0,95 19
T3Q2 19 0,05 0,22 0,2 0,16 0,95 19
T3Q3 14 0,30 0,46 1,0 0,56 0,70 14
T3Q4 11 0,45 0,50 1,0 0,43 0,55 11
T3Q5 11 0,45 0,50 0,4 -0,16 0,55 11
T4Q1 16 0,20 0,40 0,4 -0,03 0,80 16
T4Q2 5 0,75 0,43 0,2 -0,22 0,25 5
T4Q3 11 0,45 0,50 0,8 0,16 0,55 11
T4Q4 9 0,55 0,50 0,8 0,22 0,45 9
T4Q5 4 0,80 0,40 0,4 0,11 0,20 4
T5Q1 19 0,05 0,22 0,2 0,37 0,95 19
T5Q2 16 0,20 0,40 0,4 0,30 0,80 16
T5Q3 9 0,55 0,50 0,8 0,18 0,45 9
T5Q4 12 0,40 0,49 0,8 0,00 0,60 12
T5Q5 16 0,20 0,40 0,6 0,30 0,80 16

results contains topics such as operating systems, Windows OS, programming, Microsoft
Word, formatting, algorithm, algorithm characteristics and practical applications related to
programming. In this matter, the number of items that were selected was approximately 27% of
the total number of items. The selection of the items was made based on a threshold which has
a statistical meaning, as in the case of the upper-lower count, which is the 27% from the total
number of items, or, in case of large sets of items, the items that obtained a score lower than
27% of the maximum score of the test. The items that generated these revisable topics were
Q2 from Test 1, Q4 and Q5 from Test 2 and Q2 and Q5 from Test 4, which obtained the lowest
number of correct responses.
The item analysis confirmed that the mentioned items had the highest degree of difficulty.
The other parameters related to the validity of the test showed that the majority of the questions
were designed properly. Related to each parameter, the next results were obtained:
• The item discrimination (𝑑_𝑞) showed that the majority of items which had a lower degree
of difficulty were not good discriminators, while the more difficult ones discriminated
better between the best scores and the lower ones, which is indicated in an assessment
test.
Table 3
The items that were selected and their characteristics
Score 𝑑𝑑_𝑞 𝑠𝑑_𝑞 𝑑_𝑞 𝑝𝑏𝑠_𝑞 𝑚_𝑞 𝑙_𝑞
T1Q2 4 0,80 0,40 0,6 0,06 0,20 4
T2Q4 6 0,70 0,46 0,8 0,28 0,30 6
T2Q5 5 0,75 0,43 0,8 0,55 0,25 5
T4Q2 5 0,75 0,43 0,2 -0,22 0,25 5
T4Q5 4 0,80 0,40 0,4 0,11 0,20 4

• The point-biserial coefficient shows that several item can be improved in order to form a
well-designed test. The values below 0.1 show items that can be improved as discrimina-
tors, especially for those that had higher degrees of difficulty.
• The score and the degree of difficulty and also the standard deviation were altogether
correlated (the items with specific values of 𝑠𝑑_𝑞 between 0.40 and 0.46 were the ones
which had the lowest score and the highest degree of difficulty).

5. Conclusions
The most important part of the assessment performance is related to the good design of the
assessment test. In this matter, the implementation of this model provides a really useful tool
for a good design of the items of the test and, in the same time, provides information related
to topics that can be revised during a period of time in an educational context. In this matter,
this implementation can be extremely helpful for the determination of the subjects that need
additional time for teaching and understanding. The model shows to be a viable one, due to the
nature of the issue that responds to and the methods used to solve this issue. Given the fact
that assessment is currently one of the most researched topics in education, the model and its
implementation can be considered as important in order to obtain a well-designed test, after
the implementation can be scaled for more general environments.
Regarding the issue of the determination of the revisable topics during an educational period of
time based on the assessment, the traditional approach of item analysis was a starting point
that allowed both the usage of proven scientifical tools related to the analysis of the items and
the responses and the checking tool to validate the results obtained using the approach from
the CIM-GET model. In this matter, statistical data resulted from the item analysis approach
has proven to be a standard validator for the methods used for the described model.
As future work, the model will be improved with an automatic answer checking tool and also
with the refinement of the tools presented in the paper. In the same time, the semi-automated
aspect of the model will be transformed to an automated one in future research papers. Also,
the implementation and documentation of the DMAIR model will be completed and described
in further research, leading to the possibility of the usage of an assessment tool which can
provide useful and accurate results for the assessment process.
References
[1] S. L. Craig, S. J. Smith, B. Frey, Professional development with universal design for learning:
supporting teachers as learners to increase the implementation of udl, Professional De-
velopment in Education 48 (2019) 22 – 37. doi:https://doi.org/10.1080/19415257.
2019.1685563.
[2] L. R. Ketterlin-Geller, Knowing what all students know: Procedures for developing
universal design for assessment, The Journal of Technology, Learning and Assessment 4
(2005). URL: https://ejournals.bc.edu/index.php/jtla/article/view/1649.
[3] D. Clow, An overview of learning analytics, Teaching in Higher Education 18 (2013) 683–
695. URL: https://doi.org/10.1080/13562517.2013.827653. doi:10.1080/13562517.2013.
827653. arXiv:https://doi.org/10.1080/13562517.2013.827653.
[4] L. Bokander, Psychometric Assessments, Taylor 38; Francis, 2022, p. 454–465. URL: http://
urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-58770. doi:10.4324/9781003270546-36, [ed]
S. Li, P. Hiver 38; M. Papi.
[5] T. Moses, A Review of Developments and Applications in Item Analysis, 2017, pp. 19–46.
doi:10.1007/978-3-319-58689-2_2.
[6] M. Webb, D. Gibson, A. Forkosh-Baruch, Challenges for information
technology supporting educational assessment, Journal of Computer As-
sisted Learning 29 (2013) 451–462. URL: https://onlinelibrary.wiley.com/
doi/abs/10.1111/jcal.12033. doi:https://doi.org/10.1111/jcal.12033.
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/jcal.12033.
[7] S. Z. Siddiqui, Framework for an effective assessment: From rocky roads to silk route, Pak-
istan Journal of Medical Sciences 32 (2017) 505 – 509. doi:10.12669/pjms.332.12334.
[8] A. Ben-Simon, R. Bennett, Toward more substantively meaningful automated essay scoring,
Journal of Technology, Learning, and Assessment 6 (2007) 4–44.
[9] P. Deane, On the relation between automated essay scoring and modern views of the
writing construct, Assessing Writing 18 (2013) 7–24. URL: https://www.sciencedirect.com/
science/article/pii/S1075293512000451. doi:https://doi.org/10.1016/j.asw.2012.
10.002, automated Assessment of Writing.
[10] J. Gardner, M. O’Leary, L. Yuan, Artificial intelligence in educational assessment: ‘break-
through? or buncombe and ballyhoo?’, Journal of Computer Assisted Learning 37 (2021)
1207–1216. doi:https://doi.org/10.1111/jcal.12577.
[11] B. Das, M. Majumder, S. Phadikar, A. A. Sekh, Multiple-choice question generation with
auto-generated distractors for computer-assisted educational assessment, Multimedia
Tools and Applications 80 (2021) 1–19. doi:10.1007/s11042-021-11222-2.
[12] S. K. Saha, D. R. CH, Development of a practical system for computer-
ized evaluation of descriptive answers of middle school level students, In-
teractive Learning Environments 30 (2019) 215–228. URL: https://doi.org/
10.1080/10494820.2019.1651743. doi:10.1080/10494820.2019.1651743.
arXiv:https://doi.org/10.1080/10494820.2019.1651743.
[13] B. Zou, P. Li, L. Pan, A. T. Aw, Automatic true/false question generation for educational
purpose, in: Proceedings of the 17th Workshop on Innovative Use of NLP for Building
Educational Applications (BEA 2022), Association for Computational Linguistics, Seattle,
Washington, 2022, pp. 61–70. URL: https://aclanthology.org/2022.bea-1.10. doi:10.18653/
v1/2022.bea-1.10.
[14] B. Das, M. Majumder, Factual open cloze question generation for assessment of learner’s
knowledge 14 (2017). doi:10.1186/s41239-017-0060-3.
[15] A. Malafeev, Automatic generation of text-based open cloze exercises, volume 436, 2014,
pp. 140–151. doi:10.1007/978-3-319-12580-0_14.
[16] H. Ali, Y. Chali, S. A. Hasan, Automatic question generation from sentences, in: Actes
de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles
courts, ATALA, Montréal, Canada, 2010, pp. 213–218. URL: https://aclanthology.org/2010.
jeptalnrecital-court.36.
[17] X. Zheng, Automatic question generation from freeform text, 2022. doi:10356_163315.
[18] C. Diwan, S. Srinivasa, G. Suri, S. Agarwal, P. Ram, Ai-based learning con-
tent generation and learning pathway augmentation to increase learner engage-
ment, Computers and Education: Artificial Intelligence 4 (2023) 100110. URL: https://
www.sciencedirect.com/science/article/pii/S2666920X22000650. doi:https://doi.org/
10.1016/j.caeai.2022.100110.
[19] S. Burrows, I. Gurevych, B. Stein, The Eras and Trends of Automatic Short Answer Grading,
Artificial Intelligence in Education 25 (2015) 60–117. URL: http://link.springer.com/article/
10.1007/s40593-014-0026-8. doi:10.1007/s40593-014-0026-8.
[20] M. J. A. Aziz, F. D. Ahmad, A. A. A. Ghani, R. Mahmod, Automated marking system for
short answer examination (ams-sae), 2009 IEEE Symposium on Industrial Electronics &
Applications 1 (2009) 47–51.
[21] V. Zhong, W. Shi, W.-t. Yih, L. Zettlemoyer, Romqa: A benchmark for robust, multi-
evidence, multi-answer question answering, 2022. URL: https://arxiv.org/abs/2210.14353.
doi:10.48550/ARXIV.2210.14353.
[22] D. R. Ch, S. K. Saha, Automatic multiple choice question generation from text: A survey,
IEEE Transactions on Learning Technologies 13 (2020) 14–25.
[23] G. Rasch, An individualistic approach to item analysis, Readings in mathematical social
science (1966) 89–108.
[24] A. K. Hussein, A. M. A. Al-Hussein, Testing & the Impact of Item Analysis in Improving
Studentsâ€™ Performance in End-of-Year Final Exams, English Linguistics Research 11
(2022) 30–36. URL: https://ideas.repec.org/a/jfr/elr111/v11y2022i2p30-36.html.
[25] M. R. Novick, The axioms and principal results of classical test theory, Journal of Mathe-
matical Psychology 3 (1966) 1–18. doi:http://dx.doi.org/10.1016/0022-2496(66)
90002-2.
[26] D. J. Weiss, M. E. Yoes, Item response theory, Advances in Educational and Psychological
Testing: Theory and Applications (1991) 69–95. doi:http://dx.doi.org/10.1007/
978-94-009-2195-5\textunderscore3.
[27] R. K. Hambleton, R. W. Jones, An ncme instructional module on comparison of classical test
theory and item response theory and their applications to test development, Educational
Measurement: Issues and Practice 12 (1993) 38–47. doi:http://dx.doi.org/10.1111/
j.1745-3992.1993.tb00543.x.
[28] G. Abdelrahman, Q. Wang, B. Nunes, Knowledge tracing: A survey, ACM Comput. Surv.
55 (2023). URL: https://doi.org/10.1145/3569576. doi:10.1145/3569576.
[29] D. A. Popescu, N. Bold, The development of a web application for assessment by tests
generated using genetic-based algorithms, CEUR Workshop Proceedings (2016).
[30] N. Bold, Item Dataset, https://github.com/nicolaebold/cim_get, 2023.