<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Determination of the Learning Performance based on Assessment Item Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Doru Anastasiu Popescu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ovidiu Doms</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolae Bold</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Pite s</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UASVM Bucharest, Faculty of Management and Rural Development, Slatina Branch</institution>
          ,
          <addr-line>Slatina</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The analysis of the performance of the educational process is one of the essential aspects of the contemporary approach of the educational system. Technology has permitted the analysis of various components of the learning process, which has developed in the process of learning analytics. This paper presents the model and implementation of a concept that uses learning analytics to determine the outcome of an educational process and its performance. Its performance refers to the group understanding of a specific concept measured using the results to systematic evaluation during a period of time. The model, called Course Item Management Generation (CIM-GET), is part of a larger model that is centered on the educational assessment process and which uses machine learning-based techniques and evolutionary algorithms to generate assessment tests used for learning purposes. The current model uses statistical and item response analysis parameters in order to create a report regarding the items within the tests that are given over a period of time to specific students within a faculty of university. In the first part, the CIM-GET model will be presented in the context of the larger model called Dynamic Model for Assessment and Interpretation of Results (DMAIR), then several results obtained after the technical and statistical implementation will be presented. The CIM-GET model uses items from an item dataset, extracted using machine learning-based tools by the defining keywords of the item, which also represent the topics of an item, which form an optimal test using a generation algorithm (e.g., genetic algorithm). After the test is given to students, the results are stored in a database, a report is output and a list of topics that need to be revised is generated. In this matter, the practical results of the presented model will be shown, in order to show the practical importance of the results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;assessment</kwd>
        <kwd>education</kwd>
        <kwd>item analysis</kwd>
        <kwd>test</kwd>
        <kwd>answer evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The development of the computing models and method has permitted the creation of several
research fields that study and implement these models and methods in the educational domain.
In this matter, the social context and the problems that arose regarding the educational activity
were also a catalyst in order for this phenomenon to occur. As a result, the activity of a typical
teacher has been changing due to the inclusion of the technological developments, especially in
educational management.</p>
      <p>
        We should also take into consideration the major subject of recent research in the educational
domain, which is based on creating meaningful and efective educational process, especially
using digital technology and computing-based methods. The objective of an educational process
is the fulfillment of the objectives and one of the main possibilities to achieve these objectives
is related to an objective assessment. The objectivity of an assessment process is related to the
appropriate design of the assessment tools and the valid analysis of the assessment results,
which can be conceptually accomplished by the usage of specific pedagogical methods, such as
Universal Design for Learning (UDL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or, more specifically, Universal Design for Assessment
(UDA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], technically implemented using computing methods, such as machine learning,
evolutionary algorithms and thoroughly studied using, for example, Learning Analytics (LA)
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or statistical indicators [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. The main objective of a successful automated design and
analysis of an assessment test is to be as close as a human-centered approach of a design result
with similar requirements, because human experience is still hard to be surpassed in terms of
the specific assessment test and item design and analysis [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In order to achieve such a specific objective, any type of assessment design must take
into consideration design and analysis frameworks that check four major aspects [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]:
communication, orientation, learning experience and evaluation, with regard to reliability and
validity of the assessment. While communication refers to the correct reciprocal transmission
and understanding of the assessment objectives to all the participants to the assessment, the
orientation refers to the optimal choice of the assessment form based on the studied content.
As for the other two aspects, the learning experience takes into account the closeness of
the assessment to the real-life situations and the validity stands for the extent to which the
assessment objectives were accomplished. Also, several factors of the assessment must be
taken into consideration, such as subject content, electronic flexibility, language usage, format
options, time limits or a direct link with the goals and objectives of the course.
This paper presents the development of the CIM-GET model, describing the architecture of the
model, the implementation and its results. In short, the model consists in a specific method
related to assessment item analysis in a specific educational context. In this matter, the model is
based on the hypothesis that the statistical data related to an assessment item is influenced by
the conceptual understanding of the item subject. Also, several factors, such as item-based
factors (e.g., the degree of dificulty of the item, the theoretical / practical nature of the item,
the item type, item number), statistical and item test factors (e.g., mean and standard deviation,
item discrimination, item attempts, reliability coeficient), student-centered factors (e.g., student
educational level) or group-centered factors (e.g., assessment score mean), which will be
presented in the next sections, are to be taken into consideration regarding the item response.
In this matter, given a specific period of educational time, such a semester or a year, and the
periodic assessment of the students taken by a teacher, the item analysis conceptualized in
the CIM-GET model, with its practical implementation, underlies the notions that can be
elaborated more during the courses, due to lower rates of correct answers of the items that
check these specific notions during the periodic assessment tests. The model is also enhanced
by determining supplementary item verification mechanisms using automated methods of
clustering of items in order to predict whether an item is prone to have lower rates of response
based on the factors taken into consideration, tendency that will be confirmed by the factual
item analysis. Obviously, the automated clustering will provide finer results with a larger factor
set taken into consideration.
      </p>
      <p>The CIM-GET model is a modular part of an integrated model denoted by Dynamic Model for
Assessment and Interpretation of Results (DMAIR), which is formed of three main components:
the test generation, the answer check and the response analysis. For each component, a
diferent model with its implementation will be described further. The DMAIR model takes into
account several areas of educational assessment, especially regarding item generation using
various methods, such as machine learning, natural language processing and evolutionary
algorithms used to automatically generate items for a specific assessment test, with several
requirements related to the test (the item subject, the degree of dificulty of the item, the
theoretical / practical nature of the item and the item type). The CIM-GET model is used to
analyze the answers, in the integrated model being responsible for the Answer Evaluation (AE)
and Item Analysis (IA) parts.</p>
      <p>For the detailed description of the model and its implementation, the paper is structured in
several sections. In the first section, several literature landmarks and trends from the research
ifelds are presented. The next section presents the description of the CIM-GET model and the
integrated model that CIM-GET is part of, followed by a short description of a web-based
implementation of the model and several results that show the practical potential of an
implementation of this model.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <p>
        Extensive literature has been published regarding the optimization of the assessment process
regarding design and analysis of the assessment components. In this matter, the most part of
the automated educational assessment area consists in the development of assessment models
and tools for Question Generation (QG) and Answer Evaluation (AE). An important part in the
AE branch is dedicated to the Automated Essay Scoring (AES), as shown in [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ], which has
largely been researched in recent years.
      </p>
      <p>
        Regarding QG branch, the majority of the research papers were directed to the generation of
objective questions, such as multiple-choice [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], true-false [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or open-cloze questions
[14, 15]. For a long time, classical subjects of QG research are related to the formulation of
the questions from learning material, thus the recent research has been extensively related to
sentence-to-question generation [16] and the generation of questions from any type of text
[17], including artificial intelligence [ 18]. In order to visualize the extent of the research on
this subject, an empiric research regarding article subjects in scientific databases revealed that
the topic has a wide interest in the area of research. This research has been made based on
a search operation on specific keywords (e.g., for the specific keyword „automatic question
generation”). The search on the Google Scholar paper database returned 292 unique results for
2022. As for the methods used for the accomplishment of this task, one of the most used is the
Natural Language Processing (NLP), which has been developed and refined over time.
For the AE branch, the research is focused on the short and essay answers analysis, which
also uses NLP-based techniques in order to accomplish the performant analysis of the text
in the answer. One of the most researched topics is the evaluation of the correctness of the
response, especially related to specific type of questions (e.g., multiple-choice questions [ 19, 20]).
However, an increasing interest can be observed related to the automatic answer evaluation
regarding essay-type items [21, 22].
      </p>
      <p>Another important part of the research in educational assessment is related to Item Analysis (IA),
a field situated at the border of several domains, such as statistics, psychometry, assessment or
education. It showcases a wide range of research topics related to the mathematical and statistical
aspects of assessment analysis [23], which remain landmarks regarding the item analysis topic
and integrated intensively in learning management systems as basic functionalities for the
human-centered analysis related to educational activity on a specific platform. The item analysis
is an extremely important method on studying the student performance over given periods of
time [24]. For this subjects, two approaches are thought to be the most fitted for item analysis:
Classical Test Theory (CTT) and Item Response Theory (IRT). While CTT uses extensively
statistical-based tools [25], such as proportions, averages and correlations, and is used for
smaller-scale assessment contexts, the IRT has a more recent development and it is studied in
respect to its more adaptive character [26]. The adaptive character of the IRT method consists
in the important account given of the human factor related to the assessment process. One of
the most important diferences between the two approaches is based on the previous learning
experience of the assessee, because IRT create an adaptive analysis based on a measurement
precision which takes into account latent-attribute values, while CTT starts with the assumption
that this precision is equal for all individuals [27]. In this paper, tangential concepts are used
for the description of the development of CIM-GET model, especially regarding the statistical
item analysis.</p>
      <p>In a further development of the research literature, an important field which had recent serious
practical implications in the educational process is the Deep Knowledge Tracking [28]. It has
gained a lot of exposure in the recent period of the continuous development of online education,
due to the fact that it proposes the analysis and prediction of the student educational behaviour
based on previous personal learning experiences.</p>
      <p>In order to accomplish the purpose of the current paper, we will use the literature concepts and
follow a specific approach that has conceptual basis on several aspects of the cited literature.
In this matter, the assessment item generation serves at a better integration of the assessment
model and the IA cited literature shows an introductory part for the description of CIM-GET
model.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Model description</title>
      <sec id="sec-3-1">
        <title>3.1. DMAIR integrated model</title>
        <p>The DMAIR model comprises several components that are essential to an assessment system.
This system must be formed of three main functionalities: generation of items (func1 (I)), check
mechanisms (func2 (II)) and answer evaluation (func3 (III)). While the module responsible for
the generation of items, func1 (I), uses methods and tools for obtaining assessment tests suited to
specific requirements, the check mechanism, func2 (II) is related to the validation of the answers
given by the users and the answer evaluation module, func3 (III), which will be presented further
as the CIM-GET model, introduces the development of the item analysis for the given generated
items and answers and is the most related to the learning analytics. A visual depiction of the
model, including a graphical user interface component, is presented in the next figure.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Model structure</title>
          <p>The main components of the model are the questions, named items after the generation process,
the test, the requirements and the generation mechanism. The question is a particular case of
an item, as well as a request or an exercise, this being the reason for which the questions will be
considered particular cases of items and we will refer further to the questions, requests, exercises
etc. as items. An item (; ; ;  ; ; ) is an object formed of the next components:
• the identification number of the item _, which has the role of being the unique identifier
of the item in the implementation phase;
• the statement _, which is formed of a phrase or a set of phrases that describes the
initial data and the requests of the item, which must be solved;</p>
          <p>
            • the set of keywords _, which consists in the list of keywords that describe the best
the topic of the item;
• the degree of dificulty _, _ ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]; the degree of dificulty is calculated as the
ratio between the number of incorrect answers at a specific item and the total number of
answers. The degree of dificulty can also be calculated using the method presented in
[29];
• choices set  _ (wherever necessary), which can be formed of a list of two or more
possible answers when the item type is multiple or is null when the item type is short or
essay;
• the theoretical or practical character of the question ;  ∈ 0, 1, where 0 is theoretical
and 1 is practical;
• the item type _, _ ∈ ’multiple’, ’short’, ’essay’, illustrating the type of the item, whether
it has choices or the answer is a textual one, given by the user, in case of short and essay
types.
          </p>
          <p>The item dataset, denoted further by 1, contains items that are automatically generated
using NLP methods or introduced manually by a teacher.</p>
          <p>The 1 dataset is schematically represented in Figure 2, where we can also see the main
components of a general item: the item type (_), the statement _, the choice set  _, the
list of keywords _ and the degree of dificulty _.</p>
          <p>A test  (, ,  ,  ) is a set of items ,  = 1, , where  is the set of items that form
the test and  is the degree of dificulty of the test:

 = ∑︁ 
=1
(1)</p>
          <p>
            In the equation,  consists in the degree of dificulty of the item
The other components of a test are:
 within the test  .
•   is the theoretical-practical ratio, which gives the predominant type of the test.   ∈
[
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ], the value of the ratio consisting in the proportion of the theoretical questions and
the diference 1˘   being the proportion of practical question;
•  introduces the predominant item type in the test;  is an array with three values:
[_, _, _]. The values of the array contain the number of items of each type
within the test, _ being the number of multiple-choice items, _ being the number
of short-type items and _ being the number of essay-type items.
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Model functionality</title>
          <p>The generation mechanism uses a predefined set of actions that describe the generation and
evaluation of the generated assessment tests.</p>
          <p>Input data consists in:
• the desired subject given by the set of _ keywords generated by the user  = 1,
2, . . . , _;
• the number of questions required for each keyword _ = _1, _2, . . . ,
__
• the desired degree of dificulty _;
• the desired theoretical-practical ratio   _;
• the desired predominant question type  _ = &lt;_, _, _&gt;.</p>
          <p>The model functionality contains the next main algorithms:
• the item generation algorithm (which will be denoted further by  ), correspondent
to the functionality func1 (I), which contains actions related to the specific generation.
This action can be established by following a specific set of steps:
– step 1: Each keyword is parsed and for each of them a cluster of questions that
have similar keywords with the current one is formed. The similarity is computed
using NLP methods and the clusters are being formed using ML-based technique
K-means. The _ number of questions are taken into consideration for each
 keyword
– step 2. The partial dataset of questions  that can be used for the generation of
the test is formed. The main requirement taken into consideration is the subject of
the test.
– step 3. The test is generated based on other requirements using a specific type of
method (e.g., genetic algorithms).
• the check mechanisms algorithm (which will be denoted further by ℎ),
correspondent to the functionality func2 (II), related to automated check of answers, which
will be developed in future research;
• the answer evaluation algorithm (which will be denoted further by ),
correspondent to the functionality func3 (III), the algorithm presented in the CIM-GET model
section, which represents the part of the model responsible with learning analytics and
which will be described in the next section. This algorithm also uses the item prediction
algorithm (which will be denoted further by  ) and which will be presented in
the last part of the section 3.</p>
          <p>The item generation uses the algorithm GenTest in order to generate the test using methods
presented before, where M is the number of items in 1. A schematic approach of this
algorithm is:
for i = 1, M do
select items from BD1, resulting clusters Cj, j = 1, nr_kw;
a Ti test is generated using questions from Ci;
the Ti test is visually generated and given to the students
endfor</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. CIM-GET model</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Model components</title>
          <p>The CIM-GET model represents the part of DMAIR model, being one of the three main modules
that were presented at the end of the previous subsection. In this matter, this model consists in
the answer evaluation module, which aims the determination of the assessment performance,
especially regarding the student performance for a specific item.</p>
          <p>The CIM-GET module is designed starting from the premise that an incorrect answer to an item
may indicate that the subject of the item is not fully understand, especially in certain conditions
(e.g„ other items within the test are responded correctly for a student response, the item gets
repeatedly wrong answers for more students etc.). In this matter, the model takes into account
several factors in order to determine the direct causality between the poor understanding of the
subject and the incorrect answer to an item with the respective subject.</p>
          <p>The model needs additional variables in order to be completely described. These variables are:
• the period of time  ;
• the number of tests given in the time  ,  ;
• the frequency of the assessment  ;
• the number of the students in the group  ;
The model structure consists in the existence of several components:
• the item , described in the previous subsection, but with several additional characteristics,
that will be presented in the next list;
• the test  , also described in the previous subsection, which will be enriched with several
statistical indicators;
• the student result , which contains information related to the assessment results of a
specific student;
• the group of students result , which contain statistical information related to the
assessment results of a specific group of students (e.g., class, group).</p>
          <p>Within the model, an item is considered to be correctly answered (marked with 1) or incorrectly
answered (marked with 0). As a premise for the items that have fractional values of points, we
will take into consideration that an item which has received equal or more than 0.5 points is
marked as correctly answered and incorrectly if the value is less than 0.5 points. The additional
characteristics of an item q for the CIM-GET model are:
• the scores to an item obtained by all students _, stored as an array;
• the average score of the item _, which is the average value of all the responses of all
students to the item , taking into account also the fractional values of the scores;
• the number of correct answers _, which contains the number of all correct answers
(marked with 1), as stated previously;
• the number of students that answered the item _, which determine the number of all
students that answered the item;
• the total number of attempts _, which stores the number of the attempts for the item
, for the case in which the teacher permits several attempts for an item;
• the average number of attempts _, which stores the average number of attempts for
an item q, for the case in which the teacher permits several attempts for an item;
• the standard deviation _, calculated as a normal standard deviation of the item using
the specific formula for standard deviation for an item , where _ are the item scores
for the item , the _ is the average score of the item  and  is the number of students
that responded to the item:
_ =
√︃ ∑︀  = 1 _ − _

• the upper-lower count _ and _, considering that the group of students is split in
three groups: high (27% of students), middle (46% of students) and low (27% of students)
scores; thus, _ equals the count of correct answers from the upper 27% of  and _
equals the count of correct answers from the lower 27% of  ; for example, from a list of
100 answers sorted descendingly by score, _ would be the number of correct answers
from the first 27 answers and _ would be the number of correct answers from the last
27 answers;
• the item discrimination _, _ ∈ [-1, +1], which determines for an item the amount
of discrimination between the responses of the upper and the lower group and which is
calculated using the specific formula, where _ is the upper count, _ is the lower
count and  is the number of students that responded at the item:
_ =
√︂ _ − _</p>
          <p>0.27 × 
• the point biserial _, _ ∈ [-1, +1], which shows whether the item is discriminating
high-performing students from low-performing students, determining if the question is
well written, and which is calculated as a Pearson correlation coeficient between the
number of correct responses of a student to an item q and the number of all the correct
answers to the other items than q in the test.</p>
          <p>The additional characteristics for a given test T for the CIM-GET module are:
• the test length _, related to the number of questions in the test;
• the average score of the test _ ;
(2)
(3)
• the diversity index of the item type _ , which shows the diversity of the item types
taken into consideration (multiple-choice, short answer and essay) and which is calculated
as a Simpson’s Index of Diversity, as follows, where _ is the number of multiple-choice
items, _ the number of short-type items and _ the number of essay-type items:
_ =
(4)
The characteristics of a student results component  are:
• the total score of a student to all tests _;
• the average score of a student to all tests _;
• the total score of a student to individual tests _;
• the average scores of a student to individual tests _;
• the total score of a student to the items of the same subject _.</p>
          <p>The characteristics of a group results component  are:
• the average score of a group to all tests _;
• the average scores of a group to individual tests _.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Model functionality</title>
          <p>The model has a simple premise and is built on the generation phase of the items. Shortly, after
the   algorithm is applied, the evaluation of the items is made, using the methodology
described previously for . A visual representation of the model can be seen in Figure
3.</p>
          <p>The functionality of the model consists in the actions that can be performed within the model.
The main two actions consist in:
1. the determination of the subjects that need to be revised based on the answers given by
the students to the periodic assessment, denoted by ;
2. the determination of the probability of an item to be correctly answered by a student or
by a group using k-means clustering, denoted by  .</p>
          <p>The algorithm  consists in the navigation of the following steps:
1. The students log in and solve the tests.
2. For each student and a specific test, a report is generated, created by following the next
steps:
a) The items that have obtained lower values of _ and _ are filtered.
b) The values of the item parameters _, _, _, _, _ , _ are verified.
c) The subjects of the items are then extracted and verified to have obtained lower
values for _ and _ in other items with the same subject for a large number of
students.
3. The subjects of the items that validate the rule presented in substep 2c) are output.
4. The reports are introduced in a dataset of reports, referred further as 2.</p>
          <p>A schematic approach of this algorithm is:
for i = 1, M do
for j = 1, N do
student Sj solve test Ti;
report Ri is generated for Sj;</p>
          <p>Ri is introduced in BD2
endfor
endfor</p>
          <p>The algorithm   consists in applying a k-means clustering to the set of data, which
will be a training set for the algorithm, in order to determine whether an item is likely to be
answered correctly / incorrectly for a specific student. In this matter, two clusters are formed,
 and , correspondent to the probability of an item to be responded correctly
/ incorrectly by a student. The training data will consist in the next values: _, _ , _, ,
_, _. This algorithm will be extended in a further research.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Implementation and results</title>
      <p>An implementation of the   part of the DMAIR model has been made and it is presented
in previous research, such as the one described in [29].</p>
      <p>The implementation was made using the PHP web programming language and the interface
was created using the Bootstrap library, which is based on HTML, CSS and JavaScript languages.
A representation of the interface for the generation of tests component and the item analysis
component is shown in the next figure.</p>
      <p>As for the results obtained related to the implementation, a specific context with several
parameters was considered. The item dataset 1 is not presented in this paper due to the
large amount of data, but it is available in a repository [30]. The tests and items related to them
will be presented in Table 1.</p>
      <p>The initial context was considered to be formed of a group of 20 students which participated to
an ICT course for a period of a semester (14 weeks) and a number of 5 tests was given during
this period. Each test was generated in order to contain 5 questions with specific subjects related
to the usage of various applications (Word, Excel) or notions regarding Internet, programming
and operating systems. The type of all questions was multiple-choice. The columns presented
in Table 1 show the item unique identifier ( _), the item statement (_), the item list of
keywords (_) and the degree of dificulty of the item ( _).</p>
      <p>For the items described in Table 1, the responses were analyzed by determining the values
of the parameters of the model taken into account. For this specific example, the score was
equal to the correct number of responses, due to the fact that every question had a score of 1
point. The results are shown in Table 2. The columns presented in Table 1 show the degree of
dificulty ( _), the standard deviation (_), the item discrimination (_), the point-biserial
(_), the average score _) and the number of correct answers (_).</p>
      <p>After the responses, several items were determined as being more dificult than the others
and the list of the subjects which can be revised that was obtained after the analysis of the
results contains topics such as operating systems, Windows OS, programming, Microsoft
Word, formatting, algorithm, algorithm characteristics and practical applications related to
programming. In this matter, the number of items that were selected was approximately 27% of
the total number of items. The selection of the items was made based on a threshold which has
a statistical meaning, as in the case of the upper-lower count, which is the 27% from the total
number of items, or, in case of large sets of items, the items that obtained a score lower than
27% of the maximum score of the test. The items that generated these revisable topics were
Q2 from Test 1, Q4 and Q5 from Test 2 and Q2 and Q5 from Test 4, which obtained the lowest
number of correct responses.</p>
      <p>The item analysis confirmed that the mentioned items had the highest degree of dificulty.
The other parameters related to the validity of the test showed that the majority of the questions
were designed properly. Related to each parameter, the next results were obtained:
• The item discrimination (_) showed that the majority of items which had a lower degree
of dificulty were not good discriminators, while the more dificult ones discriminated
better between the best scores and the lower ones, which is indicated in an assessment
test.
• The point-biserial coeficient shows that several item can be improved in order to form a
well-designed test. The values below 0.1 show items that can be improved as
discriminators, especially for those that had higher degrees of dificulty.
• The score and the degree of dificulty and also the standard deviation were altogether
correlated (the items with specific values of _ between 0.40 and 0.46 were the ones
which had the lowest score and the highest degree of dificulty).</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The most important part of the assessment performance is related to the good design of the
assessment test. In this matter, the implementation of this model provides a really useful tool
for a good design of the items of the test and, in the same time, provides information related
to topics that can be revised during a period of time in an educational context. In this matter,
this implementation can be extremely helpful for the determination of the subjects that need
additional time for teaching and understanding. The model shows to be a viable one, due to the
nature of the issue that responds to and the methods used to solve this issue. Given the fact
that assessment is currently one of the most researched topics in education, the model and its
implementation can be considered as important in order to obtain a well-designed test, after
the implementation can be scaled for more general environments.</p>
      <p>Regarding the issue of the determination of the revisable topics during an educational period of
time based on the assessment, the traditional approach of item analysis was a starting point
that allowed both the usage of proven scientifical tools related to the analysis of the items and
the responses and the checking tool to validate the results obtained using the approach from
the CIM-GET model. In this matter, statistical data resulted from the item analysis approach
has proven to be a standard validator for the methods used for the described model.
As future work, the model will be improved with an automatic answer checking tool and also
with the refinement of the tools presented in the paper. In the same time, the semi-automated
aspect of the model will be transformed to an automated one in future research papers. Also,
the implementation and documentation of the DMAIR model will be completed and described
in further research, leading to the possibility of the usage of an assessment tool which can
provide useful and accurate results for the assessment process.
Washington, 2022, pp. 61–70. URL: https://aclanthology.org/2022.bea-1.10. doi:10.18653/
v1/2022.bea-1.10.
[14] B. Das, M. Majumder, Factual open cloze question generation for assessment of learner’s
knowledge 14 (2017). doi:10.1186/s41239-017-0060-3.
[15] A. Malafeev, Automatic generation of text-based open cloze exercises, volume 436, 2014,
pp. 140–151. doi:10.1007/978-3-319-12580-0_14.
[16] H. Ali, Y. Chali, S. A. Hasan, Automatic question generation from sentences, in: Actes
de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles
courts, ATALA, Montréal, Canada, 2010, pp. 213–218. URL: https://aclanthology.org/2010.
jeptalnrecital-court.36.
[17] X. Zheng, Automatic question generation from freeform text, 2022. doi:10356_163315.
[18] C. Diwan, S. Srinivasa, G. Suri, S. Agarwal, P. Ram, Ai-based learning
content generation and learning pathway augmentation to increase learner
engagement, Computers and Education: Artificial Intelligence 4 (2023) 100110. URL: https://
www.sciencedirect.com/science/article/pii/S2666920X22000650. doi:https://doi.org/
10.1016/j.caeai.2022.100110.
[19] S. Burrows, I. Gurevych, B. Stein, The Eras and Trends of Automatic Short Answer Grading,
Artificial Intelligence in Education 25 (2015) 60–117. URL: http://link.springer.com/article/
10.1007/s40593-014-0026-8. doi:10.1007/s40593-014-0026-8.
[20] M. J. A. Aziz, F. D. Ahmad, A. A. A. Ghani, R. Mahmod, Automated marking system for
short answer examination (ams-sae), 2009 IEEE Symposium on Industrial Electronics &amp;
Applications 1 (2009) 47–51.
[21] V. Zhong, W. Shi, W.-t. Yih, L. Zettlemoyer, Romqa: A benchmark for robust,
multievidence, multi-answer question answering, 2022. URL: https://arxiv.org/abs/2210.14353.
doi:10.48550/ARXIV.2210.14353.
[22] D. R. Ch, S. K. Saha, Automatic multiple choice question generation from text: A survey,</p>
      <p>IEEE Transactions on Learning Technologies 13 (2020) 14–25.
[23] G. Rasch, An individualistic approach to item analysis, Readings in mathematical social
science (1966) 89–108.
[24] A. K. Hussein, A. M. A. Al-Hussein, Testing &amp; the Impact of Item Analysis in Improving
Studentsâ€™ Performance in End-of-Year Final Exams, English Linguistics Research 11
(2022) 30–36. URL: https://ideas.repec.org/a/jfr/elr111/v11y2022i2p30-36.html.
[25] M. R. Novick, The axioms and principal results of classical test theory, Journal of
Mathematical Psychology 3 (1966) 1–18. doi:http://dx.doi.org/10.1016/0022-2496(66)
90002-2.
[26] D. J. Weiss, M. E. Yoes, Item response theory, Advances in Educational and Psychological
Testing: Theory and Applications (1991) 69–95. doi:http://dx.doi.org/10.1007/
978-94-009-2195-5\textunderscore3.
[27] R. K. Hambleton, R. W. Jones, An ncme instructional module on comparison of classical test
theory and item response theory and their applications to test development, Educational
Measurement: Issues and Practice 12 (1993) 38–47. doi:http://dx.doi.org/10.1111/
j.1745-3992.1993.tb00543.x.
[28] G. Abdelrahman, Q. Wang, B. Nunes, Knowledge tracing: A survey, ACM Comput. Surv.</p>
      <p>55 (2023). URL: https://doi.org/10.1145/3569576. doi:10.1145/3569576.
[29] D. A. Popescu, N. Bold, The development of a web application for assessment by tests
generated using genetic-based algorithms, CEUR Workshop Proceedings (2016).
[30] N. Bold, Item Dataset, https://github.com/nicolaebold/cim_get, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Craig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Frey</surname>
          </string-name>
          ,
          <article-title>Professional development with universal design for learning: supporting teachers as learners to increase the implementation of udl, Professional Development in Education 48 (</article-title>
          <year>2019</year>
          )
          <fpage>22</fpage>
          -
          <lpage>37</lpage>
          . doi:https://doi.org/10.1080/19415257.
          <year>2019</year>
          .
          <volume>1685563</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Ketterlin-Geller</surname>
          </string-name>
          ,
          <article-title>Knowing what all students know: Procedures for developing universal design for assessment</article-title>
          ,
          <source>The Journal of Technology, Learning and Assessment</source>
          <volume>4</volume>
          (
          <year>2005</year>
          ). URL: https://ejournals.bc.edu/index.php/jtla/article/view/1649.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Clow</surname>
          </string-name>
          ,
          <article-title>An overview of learning analytics</article-title>
          ,
          <source>Teaching in Higher Education</source>
          <volume>18</volume>
          (
          <year>2013</year>
          )
          <fpage>683</fpage>
          -
          <lpage>695</lpage>
          . URL: https://doi.org/10.1080/13562517.
          <year>2013</year>
          .
          <volume>827653</volume>
          . doi:
          <volume>10</volume>
          .1080/13562517.
          <year>2013</year>
          .
          <volume>827653</volume>
          . arXiv:https://doi.org/10.1080/13562517.
          <year>2013</year>
          .
          <volume>827653</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bokander</surname>
          </string-name>
          , Psychometric Assessments, Taylor 38; Francis,
          <year>2022</year>
          , p.
          <fpage>454</fpage>
          -
          <lpage>465</lpage>
          . URL: http:// urn.kb.se/resolve?urn=urn:nbn:se:hj:
          <fpage>diva</fpage>
          -
          <lpage>58770</lpage>
          . doi:
          <volume>10</volume>
          .4324/
          <fpage>9781003270546</fpage>
          -36, [ed]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          , P. Hiver 38;
          <string-name>
            <given-names>M.</given-names>
            <surname>Papi</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Moses</surname>
          </string-name>
          ,
          <article-title>A Review of Developments and Applications in Item Analysis</article-title>
          ,
          <year>2017</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>46</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -58689-
          <issue>2</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Webb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gibson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Forkosh-Baruch</surname>
          </string-name>
          ,
          <article-title>Challenges for information technology supporting educational assessment</article-title>
          ,
          <source>Journal of Computer Assisted Learning</source>
          <volume>29</volume>
          (
          <year>2013</year>
          )
          <fpage>451</fpage>
          -
          <lpage>462</lpage>
          . URL: https://onlinelibrary.wiley.com/ doi/abs/10.1111/jcal.12033. doi:https://doi.org/10.1111/jcal.12033. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/jcal.12033.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. Z.</given-names>
            <surname>Siddiqui</surname>
          </string-name>
          ,
          <article-title>Framework for an efective assessment: From rocky roads to silk route</article-title>
          ,
          <source>Pakistan Journal of Medical Sciences</source>
          <volume>32</volume>
          (
          <year>2017</year>
          )
          <fpage>505</fpage>
          -
          <lpage>509</lpage>
          . doi:
          <volume>10</volume>
          .12669/pjms.332.12334.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben-Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <article-title>Toward more substantively meaningful automated essay scoring</article-title>
          ,
          <source>Journal of Technology, Learning, and Assessment</source>
          <volume>6</volume>
          (
          <year>2007</year>
          )
          <fpage>4</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Deane</surname>
          </string-name>
          ,
          <article-title>On the relation between automated essay scoring and modern views of the writing construct</article-title>
          ,
          <source>Assessing Writing</source>
          <volume>18</volume>
          (
          <year>2013</year>
          )
          <fpage>7</fpage>
          -
          <lpage>24</lpage>
          . URL: https://www.sciencedirect.com/ science/article/pii/S1075293512000451. doi:https://doi.org/10.1016/j.asw.
          <year>2012</year>
          .
          <volume>10</volume>
          .002, automated Assessment of Writing.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. O'Leary</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Yuan</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence in educational assessment: 'breakthrough? or buncombe and ballyhoo?'</article-title>
          ,
          <source>Journal of Computer Assisted Learning</source>
          <volume>37</volume>
          (
          <year>2021</year>
          )
          <fpage>1207</fpage>
          -
          <lpage>1216</lpage>
          . doi:https://doi.org/10.1111/jcal.12577.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>B. Das</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Phadikar</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          <string-name>
            <surname>Sekh</surname>
          </string-name>
          ,
          <article-title>Multiple-choice question generation with auto-generated distractors for computer-assisted educational assessment</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>80</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11042-021-11222-2.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. R. CH</surname>
          </string-name>
          ,
          <article-title>Development of a practical system for computerized evaluation of descriptive answers of middle school level students</article-title>
          ,
          <source>Interactive Learning Environments</source>
          <volume>30</volume>
          (
          <year>2019</year>
          )
          <fpage>215</fpage>
          -
          <lpage>228</lpage>
          . URL: https://doi.org/ 10.1080/10494820.
          <year>2019</year>
          .
          <volume>1651743</volume>
          . doi:
          <volume>10</volume>
          .1080/10494820.
          <year>2019</year>
          .
          <volume>1651743</volume>
          . arXiv:https://doi.org/10.1080/10494820.
          <year>2019</year>
          .
          <volume>1651743</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Aw</surname>
          </string-name>
          ,
          <article-title>Automatic true/false question generation for educational purpose</article-title>
          ,
          <source>in: Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA</source>
          <year>2022</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Seattle,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>