TEST AUTOMATIC GENERATION AN ALGORITHM FOR AN AUTOMATED TESTING SYSTEM Roman Horbatiuk 1, Taras Sitkar 1, Roman Lutsyshyn 1, Stepan Sitkar 1 and Mykhailo Ozhha1 1 Ternopil Volodymyr Hnatiuk National Pedagogical University, Maksyma Kryvonosa street, 2, Ternopil, 46001, Ukraine Abstract The main objective of this paper is to present the development of an automated testing system (ATS) for assessing and improving knowledge quality in educational institutions. The paper will explore the design, implementation, and assessment of the ATS, highlighting its significant features, benefits, limitations, and potential areas for further research. Furthermore, the paper will also discuss the system's impact on enhancing students' knowledge acquisition, retention, and application. Finally, the study will emphasize the importance of incorporating technology into the education sector to foster better learning outcomes, thus contributing to the ongoing discourse in the field of educational technology. Keywords 1 Testing system, assessing knowledge quality, ATS, algorithm. 1. Introduction An essential component of the educational system is evaluating students' knowledge. It aids teachers in figuring out how well students are picking up material and where they might need more assistance [1]. Additionally, it gives students feedback on their development, which is crucial for their growth. Written exams, projects, and presentations are just a few examples of the various assessment formats. The use of formative assessment methods to improve student learning has received more attention in recent years [2]. Formative assessment is a continuous process that offers feedback to teachers and students frequently rather than just at the conclusion of a unit or course. There are many different ways to conduct formative assessments, including tests, surveys, and games [3]. These tests are frequently low-stakes, which means they don't factor into the student's final grade [4]. Instead, they give students immediate feedback on how well they comprehend a particular idea or subject. This feedback can then be used by the teacher to adjust their instruction and provide additional support to students who may be struggling [5]. Students who participate in formative assessments feel more progress and accomplishment in their learning. One of the main advantages of formative assessment is that it enables teachers to spot potential problem areas in students at an early stage [6]. By doing this, teachers can keep their students from getting behind or losing motivation. Additionally, formative evaluations can encourage students to adopt a growth mindset because they realize that their knowledge and understanding can be improved with effort and practice rather than being fixed at a certain level [7]. Teachers can use a variety of formative assessment techniques in their classes. Exit tickets, for instance, are a quick and simple way to evaluate students' comprehension at the conclusion of a lesson. Proceedings ITTAP’2023: 3rd International Workshop on Information Technologies: Theoretical and Applied Problems, November 22–24, 2023, Ternopil, Ukraine, Opole, Poland EMAIL: gorbaroman@gmail.com (A. 1); sitkar@gmail.com (A. 2); lutsyshyn.ds@gmail.com (A. 3); sitkars@gmail.com (A. 4); misha.ochga@gmail.com (A. 5) ORCID: 0000-0002-1497-1866 (A. 1); 0000-0002-5120-341X (A. 2); 0000-0002-3390-874X (A. 3); 0000-0003-4599-454X (A. 4); 0000- 0002-6954-0318 (A. 5) ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Students can be asked to write down one new concept they learned or one unanswered question they have about the subject. Future instruction can then be modified using this feedback [8]. Peer assessments, self-assessments, and reflective journals are additional examples of formative assessments [11]. Formative evaluations can be used to guide school-wide decision-making in addition to giving students and teachers feedback. Data from formative assessments, for instance, can be used to pinpoint areas where teachers or students might need more help or professional development [12]. The creation of school-wide initiatives to enhance student learning outcomes can then be guided by this information. The use of formative assessments in the classroom comes with some difficulties. Making sure that the assessments are in line with the course's learning objectives and standards is a common challenge [13]. Additionally, designing and implementing formative assessments can take a lot of time, particularly if teachers employ a variety of methods. The teacher must also possess a certain level of expertise in order to interpret and effectively use the data from formative assessments in order to guide instruction [14]. Despite these difficulties, formative assessment is an effective method for fostering learning and participation among students in the classroom. Formative assessments help to make sure that all students have the support and resources they need to succeed by giving students and teachers continuous feedback. They are crucial to any successful educational system as a result. While formative assessments have numerous advantages, they can be difficult for teachers to implement and manage effectively. Designing and administering assessments, for example, can be time- consuming, especially if teachers use a variety of techniques. Furthermore, collecting and analyzing data from formative assessments can be challenging, especially if teachers use paper-based methods. These difficulties can make it difficult for teachers to consistently use formative assessments and provide students with the feedback they require to succeed. Many of these issues could be addressed with the help of an automated formative assessment system. A system like this could help teachers use formative assessment techniques more consistently by streamlining the process of designing and administering assessments [9]. Furthermore, an automated system could collect and analyze formative assessment data, providing teachers with valuable insights into student understanding and performance [10]. Teachers could save time and focus more on providing students with the support and feedback they require by automating many of the tasks associated with formative assessment. Using an automated system for formative assessment has several potential advantages. An automated system, for example, could aid in the alignment of assessments with learning objectives and standards, making it easier for teachers to design effective assessments [16]. Furthermore, an automated system could provide students with immediate feedback on their understanding of a specific concept or topic, potentially increasing engagement and motivation [17]. Finally, an automated system could help to ensure that all students, regardless of teacher or classroom, have access to the same assessments and feedback. However, there are some potential drawbacks to using an automated system for formative assessment. For example, ensuring that the system is accessible to all students, including those with disabilities or who do not have access to technology at home, may be difficult. Additionally, teachers may require training and support to effectively use the system, particularly if they are unfamiliar with technology-based assessment methods. Finally, there may be concerns about student data privacy and security, especially if the system is hosted by a third-party provider. Despite these obstacles, an automated formative assessment system has the potential to be a useful tool for promoting student learning and engagement. A system like this could help to ensure that all students have the support and resources they need to succeed by streamlining the process of designing and administering assessments and providing immediate feedback to students. As such, it is a promising area for further research and development. 2. Development of an automated system for generating test tasks to assess the quality of knowledge 2.1. Description of the test task generation process Neural networks are a type of artificial intelligence that is increasingly being used in educational settings. Neural networks are built to mimic the structure and function of the human brain, and they can learn from data and make predictions or classifications based on it. The feedforward neural network is one of the most common types of neural networks used in education. Feedforward neural networks are designed to process input data through a series of layers, with each layer made up of a group of neurons that perform a specific function. The output of one layer is fed as input to the next layer, and the network's output is provided by the final layer. Feedforward neural networks can be used in education for a variety of tasks, including predicting student performance and identifying student misconceptions. The recurrent neural network is another type of neural network that has been used in education. Recurrent neural networks are built to process sequential data like text or speech by allowing information to flow from one time step to the next. As a result, recurrent neural networks are wellsuited to tasks like language modeling and speech recognition. Many potential applications for neural networks in education exist, including personalized learning, student performance prediction, and student modeling. A neural network, for example, could be trained on data from previous students to predict how well a current student will perform on a specific task. This data could then be used to personalize instruction for the student, as well as provide additional support if necessary. Neural networks can also be used for student modeling, which is the process of creating a model of a student's knowledge and skills based on interactions with a learning system. This model can be used to provide personalized feedback and support to students in order to help them better understand the material. Overall, neural networks have a wide range of potential applications in education, and their use is likely to grow as more research in this area is conducted. However, there are some drawbacks to using them, such as the need for large amounts of data and the possibility of bias in the data used to train the networks. Many potential benefits of neural networks for assessing student knowledge include their ability to process large amounts of data quickly and accurately, learn from experience, and adapt to changing circumstances. The following are some of the specific benefits of using neural networks for student assessment: 1. Increased precision: Neural networks can make highly accurate predictions and classifications based on complex data sets. When compared to traditional assessment methods, this can result in more accurate assessments of student knowledge. 2. Personalized assessment: Because neural networks can be trained on data from individual students, more personalized assessments that account for each student's unique strengths and weaknesses are possible. 3. Immediate feedback: Neural networks can provide students with immediate feedback, allowing them to correct misconceptions and improve their understanding of the material faster. 4. Time-saving: Neural networks can process large amounts of data quickly, which can save time for teachers and other educators. Despite these benefits, there are several drawbacks to using neural networks for student assessment. Among these limitations are: 1. Limited generalizability: Because neural networks are only as good as the data on which they are trained, they may not be effective in assessing students in contexts that differ significantly from the training data. 2. Potential bias: If the training data is biased, neural networks can be biased, leading to inaccurate assessments and reinforcing existing inequalities. 3. Lack of transparency: Because neural networks are difficult to interpret, it can be difficult to understand how they reach their conclusions and assess their reliability. 4. Technical requirements: Designing, training, and implementing neural networks requires significant technical expertise, which can be a barrier for some educators. Overall, while neural networks have many potential advantages for assessing student knowledge, they must be carefully considered in terms of their limitations and potential biases before being used in educational settings. Description of the data collection process Any automated system for assessing student knowledge must include a data collection process. A neural network must be trained on a large dataset that is representative of the population it will be used to assess in order to accurately assess student knowledge. Several steps are involved in the data collection process, including: Identifying the variables of interest: The first step in data collection is to identify the variables that will be used to assess student knowledge. These variables could include student demographics, prior academic performance, and assessment results. Choosing the sample: After identifying the variables of interest, the next step is to choose a sample of students to participate in the study. The sample should be representative of the population being evaluated and large enough to allow the neural network to be trained on a diverse set of data. Data collection: Once the sample has been chosen, data is collected using a variety of methods. This could include giving tests, gathering information from online learning platforms, or conducting interviews or surveys. Cleaning and organizing the data: After collecting the data, it must be cleaned and organized to ensure that it is accurate and usable. This may entail removing outliers or errors, converting data into a usable format, and labeling the data to indicate the correct answer or level of comprehension. Splitting the data into training and testing sets: The final step in the data collection process is to divide the data into two groups: training and testing. The neural network is trained using the training set, and its performance is evaluated using the testing set. To avoid overfitting, ensure that the testing set is distinct from the training set. Overall, data collection is an important step in creating an automated system for assessing student knowledge. Before training the neural network, it is critical to carefully consider the variables of interest, select a representative sample, and ensure that the data is accurate and usable. Overview of the neural network architecture used for the system The neural network architecture used in an automated system for assessing student knowledge can vary depending on the system's specific needs. However, several common components are typically included in the architecture, such as: Input layer: The input layer is in charge of receiving data that will be used to evaluate student knowledge. Data on student demographics, academic performance, and previous assessment results could all be included. Hidden layers: The neural network's hidden layers are in charge of processing input data and making predictions about student knowledge. The number of hidden layers and neurons in each layer can vary according to the difficulty of the problem being solved. Activation functions: Activation functions are used to introduce nonlinearity into the neural network, allowing it to model complex relationships between input data and predicted output. Output layer: The neural network's output layer is in charge of producing final predictions about student knowledge. Predictions about a student's understanding of specific concepts or overall performance on an assessment could be included. Loss function: The loss function calculates the difference between the predicted and actual output. This enables the neural network to adjust its weights and biases over time in order to improve its predictions. Optimization algorithm: The optimization algorithm is used during training to update the neural network's weights and biases in order to minimize the loss function. Overall, the architecture of the neural network used in an automated system for assessing student knowledge will be determined by the system's specific requirements. However, by incorporating input layers, hidden layers, activation functions, output layers, loss functions, and optimization algorithms, a neural network capable of accurately assessing student knowledge can be designed. Explanation of the training and validation process The training and validation process is an essential step in creating an automated system for assessing student knowledge using neural networks. The procedure entails training the neural network on a subset of the available data and validating its performance on another subset of the data. This procedure ensures that the neural network can generalize to new data and is not overfitting to the training data. The neural network is trained by feeding it input data and the corresponding output labels, which could be the correct answer or a level of understanding. Based on the input data and the current values of its weights and biases, the neural network makes predictions. A loss function, such as mean squared error or cross-entropy loss, is used to compare these predictions to the actual output labels. The optimization algorithm is then used to adjust the weights and biases to minimize the loss function and improve prediction accuracy. It is critical to monitor the neural network's performance on a separate validation set during the training process. This set of data is not used for training, but rather to evaluate the neural network's performance during the training process. This assists in determining whether the model is overfitting to the training data, which means it is learning specific examples from the training set rather than general concepts that could be applied to other unseen examples. Validation metrics such as accuracy, precision, recall, and F1 score are used to assess the neural network's performance on the validation set. If the neural network's performance on the validation set does not improve, it may be necessary to adjust the neural network's hyperparameters, such as the learning rate or the number of hidden layers, or to restructure the neural network. Overall, the training and validation process is essential for creating an accurate and dependable automated system for assessing student knowledge. It is possible to develop a system that can accurately assess student knowledge in a variety of contexts by monitoring the neural network's performance during the training process and adjusting its structure and hyperparameters as needed. 2.2. Automatic Generation of Multiple Choice Questions We initiated the creation of a system that incorporates computational tools for producing MCQs from English texts automatically. This method would result in less complexity and time, as it can generate questions without any additional modifications. We drew connections between nonidentical approaches and methods from the studied literature for the task of Question Generation in English. In an early stage of the work, with the intention of testing some of those resources as useful to evolve the design and determine if they were feasible for unborn work (a primary trial was conducted after exploring various techniques in affiliated workshops) we developed a workflow that is inspired by what we have learned and can be applied to our intended goal. The system channel, which is divided into five steps: Pre-processing, Answer election, Question Generation, and Distractor Selection is shown in Figure 1. Essentially, Pre–Processing will prepare the textbook for the coming way, while in Revealed (as opposed to QPS) Election of rejoinder campaigners from the same textbook will serve as the foundation for generating questions using the materials. As its name implies, Question Generation is made up of the styles that produce the textbook of questions and stems. The process of question generation follows an iterative block called rejoinder election and continues until the rejointer is elected. This means that the system generates practical questions only and moves on to the coming rejoinded one. Finally, in Distractor election we use styles which elect seeker disclaimers and then select candidates who wish to return to their original methods. There are two methods to ask questions, including banning colorful options that can be chosen by electing distractors but are not available to both, with the difference being in Question Generation. Although both methods are rooted in workshop practices, the reason for carrying them is not solely to compare them but also because of our control over their interpretation. In the rule- grounded path, we have complete control and regulation can change at any time. The other path is based on workshops that manipulate the use of ANNs and data structures. Figure 1. Diagram of a system operation. Detailed explanations of the system steps are presented in Figure 1, along with some examples. The evaluation types and criteria chosen are also explained. When in pre-processing, we can prepare the input written contents using the following approach: divide the textbook into lowerpieces if it is considered too long. To deal with time or processing constraints, it may be feasible to sort the documents according to the applicability of rulings. Alternatively, summarization can alleviate this issue by keeping the main concepts in the textbook while making them less complex. This can be advantageous in practice. Lemmatization and punctuation junking are two types of pre-processing methods that can be used to identify words in a document. We focused on the resolution of co-references in our actions. Pronouns or expressions that make sense in written text are typically more complex in terms of context and textbook meaning because they relate to other expression types. When creating automatic questions, these same words can be included in similar questions too, making it difficult to fully grasp them. To prepare the contents for subsequent stages of the channels, it is advisable to substitute these expressions with bones that correspond to them and are more easily identifiable than the originals. The neuralcoref library was used to perform co-reference resolution, which allows us to combine textbook sections that relate to the same thing and restore the entire textbook with these previously replaced expression patterns. We used the "wikipedia" library to pre-save Wikipedia papers for trial and evaluation, which were also used as a source textbook and referencing tool for team dataset (Figure 2). Figure 2. The Coimbra composition's co-reference resolution illustration is derived from Wikipedia. Using the function that splits the document based on section titles, we loaded them into one of the papers. We made use of what had previously been textbook sections to divide the file by exploiting the newlines and leaving blank spaces for each section. Process for determining an Answer Selection The next step is to identify terms or expressions that could be used as answers after the textbook has been divided into lower documents and their content has already been processed. The system generates questions from answers because it requires the answers to be obtained first, similar to the Question Generation techniques of videlicet answer-apprehensive Transformers. To achieve this, we can automatically calculate on statistic or verbal information, such as TF- IDF, named realities, part- of-4 speech, etc. Textbooks often contain primary generalities or ideas about certain motifs, particularly those that are more sphere-oriented. Word frequence can also serve as an indicator of usefulness by using it to validate the correctness of potentially non-applicable words like stop words and eliminate them. The use of PoS trailing or shallow parsing can help identify the syntactic function of words, and we can also use words from a specific class (such as nouns or gobbets) as answer campaigners. Similarly, NER can be used to detect mentions of specific realities by identifying them. The type of reality can then determine how the judgment will be transformed into 'a question.' We tested different styles, clauses, bigrams (single words), trigrams and combinations of terms, named realities, and noun gobbets in our development. We also experimented with Transformers for this purpose. The answer selection section was insulated from existing channels for question generation. The final interpretation, based on the estimates made using all of these styles, includes named realities, noun gobbets, and campaigners named by a motor. Our actions involve analyzing each choice in isolation while searching for answer campaigners. If there are specific realities, we employ SpaCy to identify these individuals and their reality markers. The process is similar to that of noun gobbets, but with the added feature of assigning a reality marker to each no unto. Our transformer method is based on the previously prepend question generation channel, where we isolate the section of the channel that selects answers that are grounded in a given environment and assign corresponding reality markers. This is similar to the way we assigned them for noun gobbets. Question Generation Once a seeker answer and the appropriate environment (such as judgment) are present, the objective is to induce the textbook of the question, which means that the stem or channel was designed to generate potential questions and follow up with candidates. In some systems, the term in the judgment that will be the answer is replaced with a blank space. This ensures that the judgement remains declarative. However, we decided in advance that our system should transform the verdict into 'a question' as it sounded interesting and would add value to the work. Rule- grounded approaches can modify and restructure rulings by transfiguring their words. These changes must align with pre-existing rules, which are usually handcrafted, such as metamorphoses on subject- verbal cuemas. An alternative approach could be the use of templates, which are typically simpler models that generate questions by adding a specific element from the original document to an almost ready-to-use question. However, this method requires much more hand-crafted templates than rules, making it less self-regulating. The use of Transformers involves following the same workflow as previously mentioned. They are considered state-of-the-art models and can be easily acquired through the Hugging Face website. However, the lower control they possess is a major drawback to the game. The proposition involves generating questions for each answer, inculcating questions to prompt them during development, but we aim to create questions that are relevant to all answers. However, similar to pre-processing, we conduct research on source textbooks by section and then perform the same process for other sections. The imposed styles were established on rules and Transformers. While we utilized the previously OK-adapted system for this objective, we fully implemented the rule-based system, with the exception of libraries used for verbal analysis such as chancing clauses and named realities. Mechanics: Transformer In the early stages of the project, we conducted preliminary experimentation to explore potential resources for future work and test their feasibility. The transformer was used as a model based on T5 and was fine-tuned with SQuAD v1.1 for AQG during this testing. Following that, we tested out more from a Github repository. The repository contains transformers that are answer-aware and require answers to be generated, such as QG, QA-QG or QGS Prepend, along with an E2E response generator. These transformer types are also based on the T5 model and trained on SQuAD v1. Additionally, they select answers, while QT is dedicated to Question Generation. The system's final version is based on [20], which was the more effective one. However, we encountered difficulties in isolating the Question Generation segment to avoid deteriorating results when using other Answer Selection Methods. Additionally, they recommend using the answer-agnostic transformer (E2E). An illustration is an answer-agnostic transformer (E2E) that answers to this question: "What is the name of Portugal's city?" If we use "Portugal" instead, we get the correct answer. Distractor Selection We employed a BART transformer that was already fine-tuned and trained using the dataset RACE[21], as per the recommendations of [22]. The method of including both the question and answer in the input requires a maximum length of 1024 characters, which can result in multiple distractors. We used a transformer that was based on [22], but the dataset used to train it contains declarative sentences and interrogative examples. Although our questions were not in the same style as those in RACE, there may be some less convincing explanations due to the lack of available models for this task. • To give an example, we selected the opponents for the answer "Roger Taylor" and ended up with the following distractors: • The person in question is identified as "John Deacon". • The woman in question is referred to as "Queen". • "Freddie Mercury" • "Jack Deacon" • The British musical group. We describe the pipeline that was created to automatically generate MCQs for English language. The pipeline is broken down into different stages, including Pre-processing, Answer Selection, Question Generation, and Distractor Selection. We discuss how each one was developed, the methods used, etc. We provide a detailed account of the evaluation carried out to compare the different methods and their conclusions. This evaluation involved both automatic evaluation and human opinion evaluation. During our examination of related documents, we came across two distinct evaluation metrics: automatic and human-opinion: We describe the process of both types of evaluation, the metrics employed, results obtained, and the outcomes of the processes. During the development and final stages of the project, it was necessary to evaluate the effectiveness of all considered approaches. The evaluation process was mainly used to determine which approaches were most effective. In the last phase of work, the evaluation was conducted to arrive at conclusions about the developed system's performance. The project involved analyzing different answer selection methods and models of question generation through automatic evaluation, with the use of metrics from scientific literature reviewed (BLEU and ROUGE) and reference data like SQuAD1 being used to support this. Each passage contains an inventory of answers, their location on the paragraph, and the human-created question for each passage. We utilized human opinions for the final assessment. This evaluation not only provides us with a means to determine system performance but also takes into account end users, who are the primary beneficiaries of the system. To address this issue, we used two different methods: 1) accessing data from forms distributed to project participants (IPN) and 2) using Mindflow to gather information on the quality of generated questions and 3) testing which distractor selection method yielded more relevant results. Automated evaluation can quickly arrive at conclusions about the performance of methods, which can aid in selecting the most effective ones. Furthermore, automatic evaluation is reproducible and assigns all methods to identical evaluation metrics and tasks. To determine the best answer selection methods for use, we tested: • Unigrams are used to represent individual words in the context of terms. • Bigrams are a set of two-word word combinations; • A set of three words grouped together as trigrams; • Terminology: Terms that are grouped together (e.g., individuals, geographical locations, dates, etc.); • The noun and its corresponding words are what makes it a chunk. • Whether they are part of a sentence or the entirety of an entire sentence, clauses can be used. • The classification of T+NEs+NCs involves the inclusion of terms, named entities, and noun chunks. The dev set of SQuAD v1.1 was used as the reference, which includes passages that contain answerable questions and their answers. Figure 3 shows a portion of the dataset that includes articles, questions from Wikipedia, and references to other topics created by humans. These answers include "Mediterranean" or "a Mediterranean climate", "What kind of climate does southern Australia have?" All the methods mentioned previously were tested with and without stop words to determine their impact on Answer Selection (Table 1 and Table 2). The answers were then sorted using TF-IDF, and the average of the values for each paragraph was used to calculate the metrics. {"context":"Southern California contains a Mediterranean climate, with infrequent rain and many sunny days. Summers are hot and dry, while winters are a bit warm or mild and wet. Serious rain can occur unusually. In the summers, temperature ranges are 90-60’s while as winters are 70-50’s, usually all of Southern California have Mediterranean climate. But snow is very rare in the Southwest of the state, it occurs on the Southeast of the state.", "qas":[ {"answers":[{"answer_start":31, "text":"Mediterranean"}, {"answer_start":29, "text":"a Mediterranean climate"}, {"answer_start":31, "text":"Mediterranean"}], "question":"What kind of climate does southern California maintain?", "id":"5705fc3a52bb89140068976a"}, {"answers":[{"answer_start":59, "text":"infrequent rain"}, {"answer_start":59, "text":"infrequent rain"}, {"answer_start":59, "text":"infrequent rain"}], "question":"Other than many sunny days, what characteristic is typical for the climate in souther California?", "id":"5705fc3a52bb89140068976b"}, {"answers":[{"answer_start":243, "text":"60’s"}, {"answer_start":243, "text":"60’s"}, {"answer_start":243, "text":"60’s"}], "question":"What is the low end of the temperature range in summer?", "id":"5705fc3a52bb89140068976c"}, {"answers":[{"answer_start":353, "text":"very rare"}, {"answer_start":353, "text":"very rare"}, {"answer_start":353, "text":"very rare"}], "question":"How frequent is snow in the Southwest of the state?", "id":"5705fc3a52bb89140068976d"}, {"answers":[{"answer_start":269, "text":"70"}, {"answer_start":269, "text":"70"}, {"answer_start":269, "text":"70"}], "question":"What is the high end of the temperature range in winter?", "id":"5705fc3a52bb89140068976e"}] } Figure 3. Example from SQuAD: Passage (context) from a Wikipedia article, examples of questions about the passage, and potential answers identified in the text (and their location in the text) • All: proportion of the candidate answers present in at least one of the answers for the correspondent passage; • Top 10: proportion of the ten best-scored candidate answers (accordingly to TF-IDF) present in at least one of the answers for the correspondent passage; • Last Position (LP): position of the last candidate answer that appears in at least one of the answers for the correspondent passage; • BLEU-1 (B-1), BLEU-2 (B-2), BLEU-3 (B-3) and BLEU-4 (B-4); • Rouge-L (R-L). AQG commonly employs metrics such as BLEU (1, 2, 3, and 4) and ROUGE-L to compare the similarity of two sections of text using concepts like ngrams and longest common sub-sequences. These metrics were used in both Answer Selection and Question Generation steps. To perform the evaluation with each metric, all candidates are compared with all references that belong to the same passage. The comparison of answer selection methods can be seen in Table 1. Table 1 Comparison of Answer Selection Methods In addition to these metrics, we determined the proportion of candidate answers in at least one reference answer for the correspondent passage (All) and the position of the last common answer among all candidates and references sets (LP). The scores for the various metrics and answer selection methods are presented in Table 1. The best value for each metric is bold while the "Named Entities" method has the highest score for "All" - a proportion of candidate answers found in one or more correspondent passages. The evaluation process was repeated again, except for the stop words from selected and reference answers, to test whether their presence had an effect on the scores. The results were similar in both BLEU-3 and PLAF (although not specifically high scorers). Despite being relatively close, "Noun Chunks" was the best-scored item in BBEU-4 while "Trigrams'" resulted in defeat. Excluding stop words, the comparison of answer selection methods is shown in Table 2. Table 2 Comparison of Answer Selection Methods (without stop words) For both sets of tests, the methods that favored single words ("Terms" and "T+NEs+NCs") had the second and third highest scores in "All" while the "Top 10" score was the top one (not including "NEm" which was second best when not considering stopwords). In these metrics, we assume that there is a candidate answer to be present in the references as well as it is contained in one, giving single word advantage over other words. "Clauses" may have been the more effective method for evaluating "LP" since it generates numerous candidates, but this approach reduces the likelihood of them being contained by a reference. The metric that selects answer candidates with exactly three words ("Trigrams") has a higher score in BLEU-3, "Clauses", and TLEU-4. However, the scores vary slightly when stop words are removed from these two methods, as most trigram/clans tend to have chunks of nouns composed of stop word, while others do not. The "Named Entities" method was found to be the most consistent in scoring the highest, given the values of both groups of tests. As a result, names were used in subsequent tests to select answers. We found a GitHub repository that had transformers capable of both answer-aware and answeragnostic selection methods. By isolating the part responsible for answer selection, we could combine it with question generation in accordance with rules to achieve optimal results. The comparison of transformers used in the answer selection methods is documented in Table 3. Table 3 Comparison of Answer Selection Methods (transformers) QG and QL-QA are the answer-aware transformers that can handle Question Generation and Question Answering. They differ in their approach to identifying the solution within the context, with Qg Prepend presuming the response to the given context. As depicted in the table, QG and QGS-QA both exhibit comparable outcomes for this task, with no distinction between them. When evaluating "All" and "Top 10", they score much higher than their predecessors, but with a smaller gap. In contrast, these results are more consistent across all metrics except stop words. The use of highlights resulted in better outcomes, but the prepend transformer's handling of answer selection in the pipeline was more challenging to isolate than other methods. Although it did not perform as well as the other method, it still delivered better results in most metrics compared to the earlier tested methods; nevertheless, we decided to use this transformer in subsequent methods when selecting answers. Automatic Evaluation of Question Generation Methods We proceeded to compare question generation methods. For this, we used not only the answer- aware techniques that were already evaluated for selecting answers (QG, QA-QGA and QG Prepend) but also tested an answer–agnostic transformer (E2E). These tests were conducted independently. The answer-aware transformer, which uses the prepend format and is designed for question generation, was also chosen. As we had previously discussed, it was more efficient to isolate parts of the answer selection process using QG Prepend. The outcomes of BLEU and ROUGE are presented in Table 4. We were taken aback to find that QG Prepend scored better than the other two answer-aware transformers from the same repository (QG and QA-QF) for all metrics except B-4. We also observed positive results for E2E, which scored slightly lower in almost all but not B-5. At every metric, [20] (with answers selected with QGI and chosen by others) gave 03% score up to BOE Table 4 Comparison of Question Generation Methods The rule-based approach, which is used for selecting named entities and noun chunks, had the worst values for this approach. This was not a surprise, given that transformers are considered state-of- the-art. However, the dataset used is also important as SQuAD is not comprehensive. Good questions can receive scores that do not reflect them well since there may be no similar questions in the same dataset. Transformers also have an advantage in being trained in constructing algorithms that train models with complex algorithms. We demonstrated once again that the use of named entities rather than noun chunks was a viable option, as it was the version that scored the highest between the two rule-based approaches. 3. Conclutions The focus of this work was on ways to address the challenge of creating multiple-choice questions automatically. The ultimate aim was to create a system that would utilize various approaches to generate multiple types of questions randomly, using different techniques. This pipeline was chosen to implement the system and it proved to be an ideal solution for integrating diverse approaches such as response selection, question generation, and distractor selection. Some approaches have not been successful, particularly the rule-based approach. Although the use of rules was expected in comparison to "Transformers", it allowed for greater control over the question-generating process and provided a baseline for Transformer. However, as an out-of-the-box approach developed with limited assistance from certain linguistic analysis libraries (such as Aldrin Library), it was interesting to implement. The transformer did not rate named objects or expressions on the Answer Selection task differently, and both options were satisfactory. However, in a more comprehensive evaluation, the methods with the answers chosen by the transformer were slightly better. With respect to distractors, we could create distors from both the source text and external sources. There are other areas that could be improved in future work. This includes non-machine learning approaches to question generation, where we can improve the rules or explore methods that were not tested previously, such as SRL. Regarding distractors, we still need to be able to generate dispensers that vary in level of incorrectness, with some being more incorrect than others. Additionally, our approach did not include any means of validating and ranking the generated questions, which could enhance the quality of the provided questions to the user. Our study was able to compare and combine various NLP techniques utilized for question generation, leading to the creation of an approach that, while still challenging, can be refined through experimentation. The results indicate that the integration of different approaches has resulted in successful outcomes for the AQG task by creating a pipeline that can perform each of the three sub- steps described above. The development of additional systems could yield significant benefits in the future. By enhancing existing methods and considering more complex questions, such a system appears to have the potential to reduce time spent on creating tests and questionnaires and become an additional tool for education and training. 4. References 1. A. Belchikov, "Automated Testing Systems in Education: A Review," International Journal of Emerging Technologies in Learning (iJET), vol. 10, no. 1, pp. 4-12, 2015. 2. B. Tang and W. Lu, "Application of Neural Networks in Education Assessment," in 2019 International Conference on Education Technology and Social Science (ICETSS), Chengdu, China, 2019, pp. 251-254. 3. S. Liu, J. Zhang, and Y. Xie, "An Automated Knowledge Assessment System Based on Deep Learning," in 2020 IEEE 2nd Conference on Multimedia Information Processing and Retrieval (MIPR), Beijing, China, 2020, pp. 214-218. 4. J. Zhang, Y. Cui, and H. Wang, "An Automated Assessment System for Students' Learning Outcome Based on Deep Learning," in 2021 IEEE 13th International Conference on Intelligent Human- Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 2021, pp. 98-103. 5. M. A. M. Hashim and S. S. Sabirin, "A Neural Network-based Adaptive Assessment System for Mathematics Learning," in 2019 6th International Conference on Research and Innovation in Information Systems (ICRIIS), Kuala Lumpur, Malaysia, 2019, pp. 1-6. 6. A. Pal, A. Saha, and S. Chakraborty, "An Efficient Automated Assessment System for Students using Neural Network," in 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 2018, pp. 190-196. 7. N. Thangaraj and M. A. T. Ramalakshmi, "Neural Network based Assessment System for Evaluating Programming Skill of Students," International Journal of Computer Applications, vol. 174, no. 13, pp. 7-12, 2017. 8. S. P. R. Jha, "An Intelligent Automated Assessment System using Neural Networks," in 2017 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), Chennai, India, 2017, pp. 1-5. 9. R. Zafar, M. A. Majeed, and M. Yaqoob, "Automated Assessment of Student Knowledge in E- learning Systems using Neural Networks," in 2020 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS), Bangkok, Thailand, 2020, pp. 35-40. 10. X. Wang, J. Gao, and X. Zhang, "Development of an Automated Assessment System for Chemistry Experiments based on Deep Learning," in 2021 13th International Conference on Education Technology and Computers (ICETC), Osaka, Japan, 2021, pp. 58-62. 11. M. C. Martinez-Torres, R. A. Rodriguez-Diaz, and L. C. Castro-Beltran, "A Review of the Use of Neural Networks in Education," IEEE Latin America Transactions, vol. 17, no. 9, pp. 14111416, 2019. 12. Y. Liu and Y. Dong, "Automated Assessment System for Students' Learning Outcomes based on Deep Learning," in 2020 IEEE 20th International Conference on Advanced Learning Technologies (ICALT), Tartu, Estonia, 2020, pp. 195-199. 13. S. S. Sabirin and M. A. M. Hashim, "Neural Network-based Assessment System for Learning English as a Second Language," in 2018 IEEE 5th International Conference on Smart Instrumentation, Measurement and Applications (ICSIMA), Songkhla, Thailand, 2018, pp. 1-6. 14. A. H. A. H. A. Ghaffar, M. A. R. Abro, and H. A. M. Jamali, "Automated Assessment System for Students' Writing Skill using Artificial Neural Network," in 2021 IEEE 15th International Conference on Innovations in Information Technology (IIT), Abu Dhabi, United Arab Emirates, 2021, pp. 1-6. 15. M. A. Majeed, R. Zafar, and S. B. Qaisar, "An Automated System for Assessing Students' Knowledge in Physics using Neural Networks," in 2021 IEEE 18th International Conference on Smart Communities: Improving Quality of Life using ICT, Lahore, Pakistan, 2021, pp. 104-108. 16. X. Liu, Y. Chen, and Z. Gao, "Design and Implementation of a Knowledge Assessment System for Distance Education based on Neural Network," in 2019 IEEE 5th International Conference on Computer and Communications (ICCC), Chengdu, China, 2019, pp. 10-14. 17. K. Wu and Y. Zhang, "Automated Assessment of Learning Outcomes based on Deep Learning," in 2020 IEEE 20th International Conference on Advanced Learning Technologies (ICALT), Tartu, Estonia, 2020, pp. 5-9. 18. J. Han, J. Wu, and Y. He, "Development of a Knowledge Assessment System for Distance Education based on Deep Learning," in 2021 11th International Conference on Intelligent HumanMachine Systems and Cybernetics (IHMSC), Nanchang, China, 2021, pp. 154-158. 19. Сіткар Т. В. Реалізація інтелектуальної інформаційної системи тестування з відкритою формою тестового завдання / Т. В. Сіткар // Науковий часопис Над. пед. ун-ту ім. М. П. Драгоманова. Серія 5. Педагогічні науки: реалії та перспективи: зб. наук. пр. / за ред. В. П. Сергієнка.-К., 2011.-Вип. 28. — С. 231-237 20. Manuel Romero. T5 (base) fine-tuned on squad for qg via ap. https://huggingface.co/mrm8488/t5-base-finetuned-question-generation-ap, 2021. 21. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017 22. Ho-Lam Chung, Ying-Hong Chan, and Yao-Chung Fan. A bert-based distractor generation scheme with multi-tasking and negative answer training strategies. arXiv preprint arXiv:2010.05384, 2020.