1. Introduction

The Determination of the Learning Performance based on Assessment Item Analysis

Doru Anastasiu Popescu

Ovidiu Doms

Nicolae Bold

1 0 Department of Mathematics and Computer Science, University of Pite s 1 UASVM Bucharest, Faculty of Management and Rural Development, Slatina Branch , Slatina , Romania

The analysis of the performance of the educational process is one of the essential aspects of the contemporary approach of the educational system. Technology has permitted the analysis of various components of the learning process, which has developed in the process of learning analytics. This paper presents the model and implementation of a concept that uses learning analytics to determine the outcome of an educational process and its performance. Its performance refers to the group understanding of a specific concept measured using the results to systematic evaluation during a period of time. The model, called Course Item Management Generation (CIM-GET), is part of a larger model that is centered on the educational assessment process and which uses machine learning-based techniques and evolutionary algorithms to generate assessment tests used for learning purposes. The current model uses statistical and item response analysis parameters in order to create a report regarding the items within the tests that are given over a period of time to specific students within a faculty of university. In the first part, the CIM-GET model will be presented in the context of the larger model called Dynamic Model for Assessment and Interpretation of Results (DMAIR), then several results obtained after the technical and statistical implementation will be presented. The CIM-GET model uses items from an item dataset, extracted using machine learning-based tools by the defining keywords of the item, which also represent the topics of an item, which form an optimal test using a generation algorithm (e.g., genetic algorithm). After the test is given to students, the results are stored in a database, a report is output and a list of topics that need to be revised is generated. In this matter, the practical results of the presented model will be shown, in order to show the practical importance of the results.

eol>assessment education item analysis test answer evaluation

1. Introduction

The development of the computing models and method has permitted the creation of several research fields that study and implement these models and methods in the educational domain. In this matter, the social context and the problems that arose regarding the educational activity were also a catalyst in order for this phenomenon to occur. As a result, the activity of a typical teacher has been changing due to the inclusion of the technological developments, especially in educational management.

We should also take into consideration the major subject of recent research in the educational domain, which is based on creating meaningful and efective educational process, especially using digital technology and computing-based methods. The objective of an educational process is the fulfillment of the objectives and one of the main possibilities to achieve these objectives is related to an objective assessment. The objectivity of an assessment process is related to the appropriate design of the assessment tools and the valid analysis of the assessment results, which can be conceptually accomplished by the usage of specific pedagogical methods, such as Universal Design for Learning (UDL) [ 1 ] or, more specifically, Universal Design for Assessment (UDA) [ 2 ], technically implemented using computing methods, such as machine learning, evolutionary algorithms and thoroughly studied using, for example, Learning Analytics (LA) [ 3 ] or statistical indicators [ 4, 5 ]. The main objective of a successful automated design and analysis of an assessment test is to be as close as a human-centered approach of a design result with similar requirements, because human experience is still hard to be surpassed in terms of the specific assessment test and item design and analysis [ 6 ].

In order to achieve such a specific objective, any type of assessment design must take into consideration design and analysis frameworks that check four major aspects [ 7 ]: communication, orientation, learning experience and evaluation, with regard to reliability and validity of the assessment. While communication refers to the correct reciprocal transmission and understanding of the assessment objectives to all the participants to the assessment, the orientation refers to the optimal choice of the assessment form based on the studied content. As for the other two aspects, the learning experience takes into account the closeness of the assessment to the real-life situations and the validity stands for the extent to which the assessment objectives were accomplished. Also, several factors of the assessment must be taken into consideration, such as subject content, electronic flexibility, language usage, format options, time limits or a direct link with the goals and objectives of the course. This paper presents the development of the CIM-GET model, describing the architecture of the model, the implementation and its results. In short, the model consists in a specific method related to assessment item analysis in a specific educational context. In this matter, the model is based on the hypothesis that the statistical data related to an assessment item is influenced by the conceptual understanding of the item subject. Also, several factors, such as item-based factors (e.g., the degree of dificulty of the item, the theoretical / practical nature of the item, the item type, item number), statistical and item test factors (e.g., mean and standard deviation, item discrimination, item attempts, reliability coeficient), student-centered factors (e.g., student educational level) or group-centered factors (e.g., assessment score mean), which will be presented in the next sections, are to be taken into consideration regarding the item response. In this matter, given a specific period of educational time, such a semester or a year, and the periodic assessment of the students taken by a teacher, the item analysis conceptualized in the CIM-GET model, with its practical implementation, underlies the notions that can be elaborated more during the courses, due to lower rates of correct answers of the items that check these specific notions during the periodic assessment tests. The model is also enhanced by determining supplementary item verification mechanisms using automated methods of clustering of items in order to predict whether an item is prone to have lower rates of response based on the factors taken into consideration, tendency that will be confirmed by the factual item analysis. Obviously, the automated clustering will provide finer results with a larger factor set taken into consideration.

The CIM-GET model is a modular part of an integrated model denoted by Dynamic Model for Assessment and Interpretation of Results (DMAIR), which is formed of three main components: the test generation, the answer check and the response analysis. For each component, a diferent model with its implementation will be described further. The DMAIR model takes into account several areas of educational assessment, especially regarding item generation using various methods, such as machine learning, natural language processing and evolutionary algorithms used to automatically generate items for a specific assessment test, with several requirements related to the test (the item subject, the degree of dificulty of the item, the theoretical / practical nature of the item and the item type). The CIM-GET model is used to analyze the answers, in the integrated model being responsible for the Answer Evaluation (AE) and Item Analysis (IA) parts.

For the detailed description of the model and its implementation, the paper is structured in several sections. In the first section, several literature landmarks and trends from the research ifelds are presented. The next section presents the description of the CIM-GET model and the integrated model that CIM-GET is part of, followed by a short description of a web-based implementation of the model and several results that show the practical potential of an implementation of this model.

2. Literature review

Extensive literature has been published regarding the optimization of the assessment process regarding design and analysis of the assessment components. In this matter, the most part of the automated educational assessment area consists in the development of assessment models and tools for Question Generation (QG) and Answer Evaluation (AE). An important part in the AE branch is dedicated to the Automated Essay Scoring (AES), as shown in [ 8, 9, 10 ], which has largely been researched in recent years.

Regarding QG branch, the majority of the research papers were directed to the generation of objective questions, such as multiple-choice [ 11, 12 ], true-false [ 13 ] or open-cloze questions [14, 15]. For a long time, classical subjects of QG research are related to the formulation of the questions from learning material, thus the recent research has been extensively related to sentence-to-question generation [16] and the generation of questions from any type of text [17], including artificial intelligence [ 18]. In order to visualize the extent of the research on this subject, an empiric research regarding article subjects in scientific databases revealed that the topic has a wide interest in the area of research. This research has been made based on a search operation on specific keywords (e.g., for the specific keyword „automatic question generation”). The search on the Google Scholar paper database returned 292 unique results for 2022. As for the methods used for the accomplishment of this task, one of the most used is the Natural Language Processing (NLP), which has been developed and refined over time. For the AE branch, the research is focused on the short and essay answers analysis, which also uses NLP-based techniques in order to accomplish the performant analysis of the text in the answer. One of the most researched topics is the evaluation of the correctness of the response, especially related to specific type of questions (e.g., multiple-choice questions [ 19, 20]). However, an increasing interest can be observed related to the automatic answer evaluation regarding essay-type items [21, 22].

Another important part of the research in educational assessment is related to Item Analysis (IA), a field situated at the border of several domains, such as statistics, psychometry, assessment or education. It showcases a wide range of research topics related to the mathematical and statistical aspects of assessment analysis [23], which remain landmarks regarding the item analysis topic and integrated intensively in learning management systems as basic functionalities for the human-centered analysis related to educational activity on a specific platform. The item analysis is an extremely important method on studying the student performance over given periods of time [24]. For this subjects, two approaches are thought to be the most fitted for item analysis: Classical Test Theory (CTT) and Item Response Theory (IRT). While CTT uses extensively statistical-based tools [25], such as proportions, averages and correlations, and is used for smaller-scale assessment contexts, the IRT has a more recent development and it is studied in respect to its more adaptive character [26]. The adaptive character of the IRT method consists in the important account given of the human factor related to the assessment process. One of the most important diferences between the two approaches is based on the previous learning experience of the assessee, because IRT create an adaptive analysis based on a measurement precision which takes into account latent-attribute values, while CTT starts with the assumption that this precision is equal for all individuals [27]. In this paper, tangential concepts are used for the description of the development of CIM-GET model, especially regarding the statistical item analysis.

In a further development of the research literature, an important field which had recent serious practical implications in the educational process is the Deep Knowledge Tracking [28]. It has gained a lot of exposure in the recent period of the continuous development of online education, due to the fact that it proposes the analysis and prediction of the student educational behaviour based on previous personal learning experiences.

In order to accomplish the purpose of the current paper, we will use the literature concepts and follow a specific approach that has conceptual basis on several aspects of the cited literature. In this matter, the assessment item generation serves at a better integration of the assessment model and the IA cited literature shows an introductory part for the description of CIM-GET model.

3. Model description 3.1. DMAIR integrated model

The DMAIR model comprises several components that are essential to an assessment system. This system must be formed of three main functionalities: generation of items (func1 (I)), check mechanisms (func2 (II)) and answer evaluation (func3 (III)). While the module responsible for the generation of items, func1 (I), uses methods and tools for obtaining assessment tests suited to specific requirements, the check mechanism, func2 (II) is related to the validation of the answers given by the users and the answer evaluation module, func3 (III), which will be presented further as the CIM-GET model, introduces the development of the item analysis for the given generated items and answers and is the most related to the learning analytics. A visual depiction of the model, including a graphical user interface component, is presented in the next figure.

3.1.1. Model structure

The main components of the model are the questions, named items after the generation process, the test, the requirements and the generation mechanism. The question is a particular case of an item, as well as a request or an exercise, this being the reason for which the questions will be considered particular cases of items and we will refer further to the questions, requests, exercises etc. as items. An item (; ; ; ; ; ) is an object formed of the next components: • the identification number of the item _, which has the role of being the unique identifier of the item in the implementation phase; • the statement _, which is formed of a phrase or a set of phrases that describes the initial data and the requests of the item, which must be solved;

• the set of keywords _, which consists in the list of keywords that describe the best the topic of the item; • the degree of dificulty _, _ ∈ [ 0, 1 ]; the degree of dificulty is calculated as the ratio between the number of incorrect answers at a specific item and the total number of answers. The degree of dificulty can also be calculated using the method presented in [29]; • choices set _ (wherever necessary), which can be formed of a list of two or more possible answers when the item type is multiple or is null when the item type is short or essay; • the theoretical or practical character of the question ; ∈ 0, 1, where 0 is theoretical and 1 is practical; • the item type _, _ ∈ ’multiple’, ’short’, ’essay’, illustrating the type of the item, whether it has choices or the answer is a textual one, given by the user, in case of short and essay types.

The item dataset, denoted further by 1, contains items that are automatically generated using NLP methods or introduced manually by a teacher.

The 1 dataset is schematically represented in Figure 2, where we can also see the main components of a general item: the item type (_), the statement _, the choice set _, the list of keywords _ and the degree of dificulty _.

A test (, , , ) is a set of items , = 1, , where is the set of items that form the test and is the degree of dificulty of the test: = ∑︁ =1 (1)

In the equation, consists in the degree of dificulty of the item The other components of a test are: within the test . • is the theoretical-practical ratio, which gives the predominant type of the test. ∈ [ 0, 1 ], the value of the ratio consisting in the proportion of the theoretical questions and the diference 1˘ being the proportion of practical question; • introduces the predominant item type in the test; is an array with three values: [_, _, _]. The values of the array contain the number of items of each type within the test, _ being the number of multiple-choice items, _ being the number of short-type items and _ being the number of essay-type items.

3.1.2. Model functionality

The generation mechanism uses a predefined set of actions that describe the generation and evaluation of the generated assessment tests.

Input data consists in: • the desired subject given by the set of _ keywords generated by the user = 1, 2, . . . , _; • the number of questions required for each keyword _ = _1, _2, . . . , __ • the desired degree of dificulty _; • the desired theoretical-practical ratio _; • the desired predominant question type _ = <_, _, _>.

The model functionality contains the next main algorithms: • the item generation algorithm (which will be denoted further by ), correspondent to the functionality func1 (I), which contains actions related to the specific generation. This action can be established by following a specific set of steps: – step 1: Each keyword is parsed and for each of them a cluster of questions that have similar keywords with the current one is formed. The similarity is computed using NLP methods and the clusters are being formed using ML-based technique K-means. The _ number of questions are taken into consideration for each keyword – step 2. The partial dataset of questions that can be used for the generation of the test is formed. The main requirement taken into consideration is the subject of the test. – step 3. The test is generated based on other requirements using a specific type of method (e.g., genetic algorithms). • the check mechanisms algorithm (which will be denoted further by ℎ), correspondent to the functionality func2 (II), related to automated check of answers, which will be developed in future research; • the answer evaluation algorithm (which will be denoted further by ), correspondent to the functionality func3 (III), the algorithm presented in the CIM-GET model section, which represents the part of the model responsible with learning analytics and which will be described in the next section. This algorithm also uses the item prediction algorithm (which will be denoted further by ) and which will be presented in the last part of the section 3.

The item generation uses the algorithm GenTest in order to generate the test using methods presented before, where M is the number of items in 1. A schematic approach of this algorithm is: for i = 1, M do select items from BD1, resulting clusters Cj, j = 1, nr_kw; a Ti test is generated using questions from Ci; the Ti test is visually generated and given to the students endfor

3.2. CIM-GET model 3.2.1. Model components

The CIM-GET model represents the part of DMAIR model, being one of the three main modules that were presented at the end of the previous subsection. In this matter, this model consists in the answer evaluation module, which aims the determination of the assessment performance, especially regarding the student performance for a specific item.

The CIM-GET module is designed starting from the premise that an incorrect answer to an item may indicate that the subject of the item is not fully understand, especially in certain conditions (e.g„ other items within the test are responded correctly for a student response, the item gets repeatedly wrong answers for more students etc.). In this matter, the model takes into account several factors in order to determine the direct causality between the poor understanding of the subject and the incorrect answer to an item with the respective subject.

The model needs additional variables in order to be completely described. These variables are: • the period of time ; • the number of tests given in the time , ; • the frequency of the assessment ; • the number of the students in the group ; The model structure consists in the existence of several components: • the item , described in the previous subsection, but with several additional characteristics, that will be presented in the next list; • the test , also described in the previous subsection, which will be enriched with several statistical indicators; • the student result , which contains information related to the assessment results of a specific student; • the group of students result , which contain statistical information related to the assessment results of a specific group of students (e.g., class, group).

Within the model, an item is considered to be correctly answered (marked with 1) or incorrectly answered (marked with 0). As a premise for the items that have fractional values of points, we will take into consideration that an item which has received equal or more than 0.5 points is marked as correctly answered and incorrectly if the value is less than 0.5 points. The additional characteristics of an item q for the CIM-GET model are: • the scores to an item obtained by all students _, stored as an array; • the average score of the item _, which is the average value of all the responses of all students to the item , taking into account also the fractional values of the scores; • the number of correct answers _, which contains the number of all correct answers (marked with 1), as stated previously; • the number of students that answered the item _, which determine the number of all students that answered the item; • the total number of attempts _, which stores the number of the attempts for the item , for the case in which the teacher permits several attempts for an item; • the average number of attempts _, which stores the average number of attempts for an item q, for the case in which the teacher permits several attempts for an item; • the standard deviation _, calculated as a normal standard deviation of the item using the specific formula for standard deviation for an item , where _ are the item scores for the item , the _ is the average score of the item and is the number of students that responded to the item: _ = √︃ ∑︀ = 1 _ − _ • the upper-lower count _ and _, considering that the group of students is split in three groups: high (27% of students), middle (46% of students) and low (27% of students) scores; thus, _ equals the count of correct answers from the upper 27% of and _ equals the count of correct answers from the lower 27% of ; for example, from a list of 100 answers sorted descendingly by score, _ would be the number of correct answers from the first 27 answers and _ would be the number of correct answers from the last 27 answers; • the item discrimination _, _ ∈ [-1, +1], which determines for an item the amount of discrimination between the responses of the upper and the lower group and which is calculated using the specific formula, where _ is the upper count, _ is the lower count and is the number of students that responded at the item: _ = √︂ _ − _

0.27 × • the point biserial _, _ ∈ [-1, +1], which shows whether the item is discriminating high-performing students from low-performing students, determining if the question is well written, and which is calculated as a Pearson correlation coeficient between the number of correct responses of a student to an item q and the number of all the correct answers to the other items than q in the test.

The additional characteristics for a given test T for the CIM-GET module are: • the test length _, related to the number of questions in the test; • the average score of the test _ ; (2) (3) • the diversity index of the item type _ , which shows the diversity of the item types taken into consideration (multiple-choice, short answer and essay) and which is calculated as a Simpson’s Index of Diversity, as follows, where _ is the number of multiple-choice items, _ the number of short-type items and _ the number of essay-type items: _ = (4) The characteristics of a student results component are: • the total score of a student to all tests _; • the average score of a student to all tests _; • the total score of a student to individual tests _; • the average scores of a student to individual tests _; • the total score of a student to the items of the same subject _.

The characteristics of a group results component are: • the average score of a group to all tests _; • the average scores of a group to individual tests _.

3.2.2. Model functionality

The model has a simple premise and is built on the generation phase of the items. Shortly, after the algorithm is applied, the evaluation of the items is made, using the methodology described previously for . A visual representation of the model can be seen in Figure 3.

The functionality of the model consists in the actions that can be performed within the model. The main two actions consist in: 1. the determination of the subjects that need to be revised based on the answers given by the students to the periodic assessment, denoted by ; 2. the determination of the probability of an item to be correctly answered by a student or by a group using k-means clustering, denoted by .

The algorithm consists in the navigation of the following steps: 1. The students log in and solve the tests. 2. For each student and a specific test, a report is generated, created by following the next steps: a) The items that have obtained lower values of _ and _ are filtered. b) The values of the item parameters _, _, _, _, _ , _ are verified. c) The subjects of the items are then extracted and verified to have obtained lower values for _ and _ in other items with the same subject for a large number of students. 3. The subjects of the items that validate the rule presented in substep 2c) are output. 4. The reports are introduced in a dataset of reports, referred further as 2.

A schematic approach of this algorithm is: for i = 1, M do for j = 1, N do student Sj solve test Ti; report Ri is generated for Sj;

Ri is introduced in BD2 endfor endfor

The algorithm consists in applying a k-means clustering to the set of data, which will be a training set for the algorithm, in order to determine whether an item is likely to be answered correctly / incorrectly for a specific student. In this matter, two clusters are formed, and , correspondent to the probability of an item to be responded correctly / incorrectly by a student. The training data will consist in the next values: _, _ , _, , _, _. This algorithm will be extended in a further research.

4. Implementation and results

An implementation of the part of the DMAIR model has been made and it is presented in previous research, such as the one described in [29].

The implementation was made using the PHP web programming language and the interface was created using the Bootstrap library, which is based on HTML, CSS and JavaScript languages. A representation of the interface for the generation of tests component and the item analysis component is shown in the next figure.

As for the results obtained related to the implementation, a specific context with several parameters was considered. The item dataset 1 is not presented in this paper due to the large amount of data, but it is available in a repository [30]. The tests and items related to them will be presented in Table 1.

The initial context was considered to be formed of a group of 20 students which participated to an ICT course for a period of a semester (14 weeks) and a number of 5 tests was given during this period. Each test was generated in order to contain 5 questions with specific subjects related to the usage of various applications (Word, Excel) or notions regarding Internet, programming and operating systems. The type of all questions was multiple-choice. The columns presented in Table 1 show the item unique identifier ( _), the item statement (_), the item list of keywords (_) and the degree of dificulty of the item ( _).

For the items described in Table 1, the responses were analyzed by determining the values of the parameters of the model taken into account. For this specific example, the score was equal to the correct number of responses, due to the fact that every question had a score of 1 point. The results are shown in Table 2. The columns presented in Table 1 show the degree of dificulty ( _), the standard deviation (_), the item discrimination (_), the point-biserial (_), the average score _) and the number of correct answers (_).

After the responses, several items were determined as being more dificult than the others and the list of the subjects which can be revised that was obtained after the analysis of the results contains topics such as operating systems, Windows OS, programming, Microsoft Word, formatting, algorithm, algorithm characteristics and practical applications related to programming. In this matter, the number of items that were selected was approximately 27% of the total number of items. The selection of the items was made based on a threshold which has a statistical meaning, as in the case of the upper-lower count, which is the 27% from the total number of items, or, in case of large sets of items, the items that obtained a score lower than 27% of the maximum score of the test. The items that generated these revisable topics were Q2 from Test 1, Q4 and Q5 from Test 2 and Q2 and Q5 from Test 4, which obtained the lowest number of correct responses.

The item analysis confirmed that the mentioned items had the highest degree of dificulty. The other parameters related to the validity of the test showed that the majority of the questions were designed properly. Related to each parameter, the next results were obtained: • The item discrimination (_) showed that the majority of items which had a lower degree of dificulty were not good discriminators, while the more dificult ones discriminated better between the best scores and the lower ones, which is indicated in an assessment test. • The point-biserial coeficient shows that several item can be improved in order to form a well-designed test. The values below 0.1 show items that can be improved as discriminators, especially for those that had higher degrees of dificulty. • The score and the degree of dificulty and also the standard deviation were altogether correlated (the items with specific values of _ between 0.40 and 0.46 were the ones which had the lowest score and the highest degree of dificulty).

5. Conclusions

The most important part of the assessment performance is related to the good design of the assessment test. In this matter, the implementation of this model provides a really useful tool for a good design of the items of the test and, in the same time, provides information related to topics that can be revised during a period of time in an educational context. In this matter, this implementation can be extremely helpful for the determination of the subjects that need additional time for teaching and understanding. The model shows to be a viable one, due to the nature of the issue that responds to and the methods used to solve this issue. Given the fact that assessment is currently one of the most researched topics in education, the model and its implementation can be considered as important in order to obtain a well-designed test, after the implementation can be scaled for more general environments.

Regarding the issue of the determination of the revisable topics during an educational period of time based on the assessment, the traditional approach of item analysis was a starting point that allowed both the usage of proven scientifical tools related to the analysis of the items and the responses and the checking tool to validate the results obtained using the approach from the CIM-GET model. In this matter, statistical data resulted from the item analysis approach has proven to be a standard validator for the methods used for the described model. As future work, the model will be improved with an automatic answer checking tool and also with the refinement of the tools presented in the paper. In the same time, the semi-automated aspect of the model will be transformed to an automated one in future research papers. Also, the implementation and documentation of the DMAIR model will be completed and described in further research, leading to the possibility of the usage of an assessment tool which can provide useful and accurate results for the assessment process. Washington, 2022, pp. 61–70. URL: https://aclanthology.org/2022.bea-1.10. doi:10.18653/ v1/2022.bea-1.10. [14] B. Das, M. Majumder, Factual open cloze question generation for assessment of learner’s knowledge 14 (2017). doi:10.1186/s41239-017-0060-3. [15] A. Malafeev, Automatic generation of text-based open cloze exercises, volume 436, 2014, pp. 140–151. doi:10.1007/978-3-319-12580-0_14. [16] H. Ali, Y. Chali, S. A. Hasan, Automatic question generation from sentences, in: Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts, ATALA, Montréal, Canada, 2010, pp. 213–218. URL: https://aclanthology.org/2010. jeptalnrecital-court.36. [17] X. Zheng, Automatic question generation from freeform text, 2022. doi:10356_163315. [18] C. Diwan, S. Srinivasa, G. Suri, S. Agarwal, P. Ram, Ai-based learning content generation and learning pathway augmentation to increase learner engagement, Computers and Education: Artificial Intelligence 4 (2023) 100110. URL: https:// www.sciencedirect.com/science/article/pii/S2666920X22000650. doi:https://doi.org/ 10.1016/j.caeai.2022.100110. [19] S. Burrows, I. Gurevych, B. Stein, The Eras and Trends of Automatic Short Answer Grading, Artificial Intelligence in Education 25 (2015) 60–117. URL: http://link.springer.com/article/ 10.1007/s40593-014-0026-8. doi:10.1007/s40593-014-0026-8. [20] M. J. A. Aziz, F. D. Ahmad, A. A. A. Ghani, R. Mahmod, Automated marking system for short answer examination (ams-sae), 2009 IEEE Symposium on Industrial Electronics & Applications 1 (2009) 47–51. [21] V. Zhong, W. Shi, W.-t. Yih, L. Zettlemoyer, Romqa: A benchmark for robust, multievidence, multi-answer question answering, 2022. URL: https://arxiv.org/abs/2210.14353. doi:10.48550/ARXIV.2210.14353. [22] D. R. Ch, S. K. Saha, Automatic multiple choice question generation from text: A survey,

IEEE Transactions on Learning Technologies 13 (2020) 14–25. [23] G. Rasch, An individualistic approach to item analysis, Readings in mathematical social science (1966) 89–108. [24] A. K. Hussein, A. M. A. Al-Hussein, Testing & the Impact of Item Analysis in Improving Studentsâ€™ Performance in End-of-Year Final Exams, English Linguistics Research 11 (2022) 30–36. URL: https://ideas.repec.org/a/jfr/elr111/v11y2022i2p30-36.html. [25] M. R. Novick, The axioms and principal results of classical test theory, Journal of Mathematical Psychology 3 (1966) 1–18. doi:http://dx.doi.org/10.1016/0022-2496(66) 90002-2. [26] D. J. Weiss, M. E. Yoes, Item response theory, Advances in Educational and Psychological Testing: Theory and Applications (1991) 69–95. doi:http://dx.doi.org/10.1007/ 978-94-009-2195-5\textunderscore3. [27] R. K. Hambleton, R. W. Jones, An ncme instructional module on comparison of classical test theory and item response theory and their applications to test development, Educational Measurement: Issues and Practice 12 (1993) 38–47. doi:http://dx.doi.org/10.1111/ j.1745-3992.1993.tb00543.x. [28] G. Abdelrahman, Q. Wang, B. Nunes, Knowledge tracing: A survey, ACM Comput. Surv.

55 (2023). URL: https://doi.org/10.1145/3569576. doi:10.1145/3569576. [29] D. A. Popescu, N. Bold, The development of a web application for assessment by tests generated using genetic-based algorithms, CEUR Workshop Proceedings (2016). [30] N. Bold, Item Dataset, https://github.com/nicolaebold/cim_get, 2023.

[1]

S. L.

Craig ,

S. J.

Smith ,

Frey , Professional development with universal design for learning: supporting teachers as learners to increase the implementation of udl, Professional Development in Education 48 ( 2019 ) 22 - 37 . doi:https://doi.org/10.1080/19415257. 2019 . 1685563 .

[2]

L. R.

Ketterlin-Geller , Knowing what all students know: Procedures for developing universal design for assessment , The Journal of Technology, Learning and Assessment 4 ( 2005 ). URL: https://ejournals.bc.edu/index.php/jtla/article/view/1649.

[3]

Clow , An overview of learning analytics , Teaching in Higher Education 18 ( 2013 ) 683 - 695 . URL: https://doi.org/10.1080/13562517. 2013 . 827653 . doi: 10 .1080/13562517. 2013 . 827653 . arXiv:https://doi.org/10.1080/13562517. 2013 . 827653 .

[4]

Bokander , Psychometric Assessments, Taylor 38; Francis, 2022 , p. 454 - 465 . URL: http:// urn.kb.se/resolve?urn=urn:nbn:se:hj: diva - 58770 . doi: 10 .4324/ 9781003270546 -36, [ed]

Li , P. Hiver 38;

Papi .

[5]

Moses , A Review of Developments and Applications in Item Analysis , 2017 , pp. 19 - 46 . doi: 10 .1007/978-3- 319 -58689- 2 _ 2 .

[6]

Webb ,

Gibson ,

Forkosh-Baruch , Challenges for information technology supporting educational assessment , Journal of Computer Assisted Learning 29 ( 2013 ) 451 - 462 . URL: https://onlinelibrary.wiley.com/ doi/abs/10.1111/jcal.12033. doi:https://doi.org/10.1111/jcal.12033. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/jcal.12033.

[7]

S. Z.

Siddiqui , Framework for an efective assessment: From rocky roads to silk route , Pakistan Journal of Medical Sciences 32 ( 2017 ) 505 - 509 . doi: 10 .12669/pjms.332.12334.

[8]

Ben-Simon ,

Bennett , Toward more substantively meaningful automated essay scoring , Journal of Technology, Learning, and Assessment 6 ( 2007 ) 4 - 44 .

[9]

Deane , On the relation between automated essay scoring and modern views of the writing construct , Assessing Writing 18 ( 2013 ) 7 - 24 . URL: https://www.sciencedirect.com/ science/article/pii/S1075293512000451. doi:https://doi.org/10.1016/j.asw. 2012 . 10 .002, automated Assessment of Writing.

[10]

Gardner , M. O'Leary , L. Yuan , Artificial intelligence in educational assessment: 'breakthrough? or buncombe and ballyhoo?' , Journal of Computer Assisted Learning 37 ( 2021 ) 1207 - 1216 . doi:https://doi.org/10.1111/jcal.12577.

[11] B. Das , M.

Majumder , S.

Phadikar , A. A.

Sekh , Multiple-choice question generation with auto-generated distractors for computer-assisted educational assessment , Multimedia Tools and Applications 80 ( 2021 ) 1 - 19 . doi: 10 .1007/s11042-021-11222-2.

[12]

S. K.

Saha , D. R. CH , Development of a practical system for computerized evaluation of descriptive answers of middle school level students , Interactive Learning Environments 30 ( 2019 ) 215 - 228 . URL: https://doi.org/ 10.1080/10494820. 2019 . 1651743 . doi: 10 .1080/10494820. 2019 . 1651743 . arXiv:https://doi.org/10.1080/10494820. 2019 . 1651743 .

[13]

Zou ,

Li ,

Pan ,

A. T.

Aw , Automatic true/false question generation for educational purpose , in: Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022 ), Association for Computational Linguistics , Seattle,