Resolving duplicates in Large Multiple-Choice
                                Questions Repositories
                                Valentina Albano1 , Donatella Firmani2 , Luigi Laura3 , Jerin George Mathew2 ,
                                Anna Lucia Paoletti1 and Irene Torrente4
                                1
                                  Dip. Funzione Pubblica, Corso Vittorio Emanuele II, 116, 00186 Rome, Italy.
                                2
                                  Sapienza University, Piazzale Aldo Moro, 5, 00185 Rome, Italy
                                3
                                  Uninettuno University, Corso Vittorio Emanuele II, 39, 00186 Rome, Italy
                                4
                                  Formez, Viale Marx, 15, 00137 Rome, Italy


                                                                         Abstract
                                                                         Multiple-choice questions (MCQs) are commonly used in educational assessments and professional
                                                                         certification examinations. However, managing vast collections of MCQs presents numerous challenges,
                                                                         including maintaining their quality and relevance. A notable issue in such repositories is the occurrence
                                                                         of conceptually identical questions presented in varied forms. These duplicates, while different in
                                                                         wording, fail to enhance the value of the repository. In this extended abstract, we present our approach
                                                                         for identifying and handling potential duplicate questions in large MCQ databases. Our proposed method
                                                                         involves three primary stages: initial pre-processing of MCQs, calculation of similarity based on Natural
                                                                         Language Processing (NLP) techniques, and a graph-based method for exploring these similarities.

                                                                         Keywords
                                                                         multiple-choice questions, entity resolution, record linkage, graph communities


                                1. Introduction
                                Multiple-Choice Questions (MCQs) are widely utilized for knowledge assessment across various
                                domains, from university admissions and job evaluations to self-assessment and entertainment,
                                including popular game shows and mobile gaming apps. Large-scale standardized tests typically
                                feature MCQs with four response options: one correct answer and three distractors.
                                   Academic research primarily focuses on the effectiveness of MCQs as evaluation tools.
                                Azevedo, Oliveria, and Damas Beites’ study [1] is exemplary in exploring methods for fair
                                student assessments through MCQ analysis. Learning Analytics, as defined in [2], involves the
                                comprehensive measurement and analysis of learner data to optimize educational environments.
                                Furthermore, the laborious nature of manual MCQ creation has led researchers to the develop-
                                ment of automatic generation techniques. These range from using resources like WordNet and
                                shallow parsing to advanced methods involving ontologies and deep neural networks [3].


                                IRCDL 2024 20th conference on Information and Research science Connecting to Digital and Library science formerly the
                                Italian Research Conference on Digital Libraries Bressanone, Brixen, Italy - 22-23 February 2024
                                $ V.Albano@governo.it (V. Albano); donatella.firmani@uniroma1.it (D. Firmani);
                                luigi.laura@uninettunouniversity.net (L. Laura); mathew@diag.uniroma1.it (J. G. Mathew);
                                A.Paoletti@funzionepubblica.it (A. L. Paoletti); itorrente@formez.it (I. Torrente)
                                                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Unlike previous research, our approach focuses on the maintenance of large Multiple-Choice
Questions (MCQs) repositories with a data quality perspective. Specifically, we focus on the
identification of conceptually similar questions within these repositories. Our method can
be used by MCQs database administrators to identify overly similar questions and verifying
question coherence with syllabus areas and coverage. Our intuition is rooted in the believe that
redundant or less coherent questions can be accumulated in MCQs repositories over time – e.g.,
upon addition and merging operation – which can dilute their quality and effectiveness.
   The primary feature of our method is the utilization of recent Natural Language Processing
(NLP) techniques in an entirely unsupervised environment. These advanced NLP methods,
based on the popular Transformer architecture, are capable of identifying a substantial number
of relevant matches. However, they also carry the risk of generating false positives, where
questions might be inaccurately classified as similar. To mitigate this issue of false positives,
our approach incorporates an innovative graph exploration technique. This technique focuses
on identifying candidate matches from questions that are part of densely connected graph
communities. By doing so, our method not only minimizes the likelihood of false positives but
can also uncover the underlying structure of the MCQs repository.
Outline. Section 2 describes the main components of our approach and a case study from a
large-scale training project by the Italian Government. Section 3 summarizes our experimental
results in the main application scenarios. Section 4 describes related works and Section 5
presents conclusive remarks. For more detailed insights, please refer to the recent journal paper
[4] and to the conference paper in [5] presenting an earlier version of our approach.


2. Overview
In this section, we outline our proposed workflow for managing large Multiple Choice Questions
(MCQs) repositories, which includes the following four key steps:

   1. Similarity Computation: Compute similarity scores between all pairs of questions in
      an unsupervised manner, without the need for a labeled dataset.
   2. Threshold Definition: Establish a similarity threshold, 𝜎, to preliminarily distinguish
      between similar and dissimilar questions. This threshold is adjustable, allowing interactive
      user selection.
   3. Graph Construction: Construct a graph, 𝐺𝜎 , where nodes represent questions, and
      edges connect questions with similarity scores above the threshold, 𝜎.
   4. Graph Exploration: Employ graph visual analysis, focusing on graph communities to
      identify clusters of potentially similar questions.

Case study. We demonstrate our approach using the MCQs database from the Competenze
Digitali program.1 The program aims to provide non-IT public employees with personalized
e-learning training in basic digital skills. Key elements include (i) a Syllabus outlining minimum
digital skills required for public employees, (ii) an online platform for skill gap analysis, training
course definition, and delivery, and (iii) a catalog of quality training developed in collaboration
    1
        Database access is restricted, but the new program is publicly available at https://www.syllabus.gov.it.
     Question: What is the definition of “cookies”?
     A. Small files that store information about online browsing on the user’s computer or device.
     B. Small files that track and collect user’s online behaviors for targeted advertising purposes.
     C. Tokens generated by web applications to authenticate user sessions and enable personalized
     features.
     D. Privacy settings that allow users to control the information shared with websites they visit.
Figure 1: Sample question and answers inspired by a similar question in the MCQ Dataset.


                                                                                                                      Text similarity
                                                                                                                          1

    bchain#1
                                                                                                                          0.9
    bchain#2
                                                                                                                          0.8
    netsec#1
                                                                                                                          0.7
    netsec#2

                                                                                                                          0.6
     email#1

                                                                                                                          0.5
     email#2


       db#1                                                                                                               0.4


       db#2                                                                                                               0.3


        ai#1                                                                                                              0.2

        ai#2
                                                                                                                          0.1
               bc         bc         ne           ne          em       em       db        db        ai       ai
                 ha         ha           ts         ts          ai       ai          #         #      #        #
                   in         in           ec          ec         l#       l#         1         2        1        2
                      #          #            #           #          1        2
                          1          2         1           2


                                                                         (a)                                                            (b)

Figure 2: (Left) Similarity matrix produced by ST-MPNet. We indicate for each question its correspond-
ing topic, for instance ai#1 and ai#2 are two sentences related to AI. (Right) The largest CC in 𝐺0.7
(ST-MPNet) with node colors highlighting the community structure.


with major public and private entities. The MCQ dataset of the Competenze Digitali program
comprises 798 Italian language questions, each presenting four candidate answers, of which
only one is correct. Every question corresponds to a particular syllabus, which groups together
questions related to the same topic (e.g. computer networks). In total, there are 11 distinct
syllabi in the dataset. We provide a sample question in Figure 1.


3. Applications
3.1. Identifying redundant questions
We describe a method for identifying overly similar questions in the Competenze Digitali
MCQs database. In our experiment, we tested four multilingual models from the Sentence
Transformers library [6] – specifically fine-tuned with Italian data – that are ST-Roberta,
ST-DistilUSE, ST-MiniLM, and ST-MPNet. We assessed their performance on a subset of
our MCQ dataset, which included manually selected question pairs related to the same syllabus
topic, and identified ST-MPNet as the best-performing model. The resulting similarity matrix
from ST-MPNet, illustrated in Figure 2a, displays a pattern of darker 2 × 2 squares along the
diagonal, indicating accurate semantic matches of pairs of most similar questions.
                                    Question       Syllabus area       Similarity
                                    Q-1.1.1.1-1         S1.1            0.415891
                                    Q-1.1.1.1-1         S1.2            0.147890
                                    Q-1.1.1.1-1         S1.3            0.254660
                                    Q-1.1.1.1-1         S2.1            0.158570
                                    Q-1.1.1.1-1         S2.2            0.202694
                                        ...              ...               ...
                                    Q-5.2.3.5-8         S3.2            0.041329
                                    Q-5.2.3.5-8         S4.1            0.096101
                                    Q-5.2.3.5-8         S4.2            0.073620
                                    Q-5.2.3.5-8         S5.1            0.106045
                                    Q-5.2.3.5-8         S5.2            0.175479
Figure 3: The similarity values between the questions and the areas of the Syllabus.


   A critical consideration in our approach is whether to compute similarity using only the
question text, the question with the correct answer, or the question with all answer options.
Our analysis suggests that including all answer options provides the most accurate results,
particularly for questions with identical wording, such as those in our repository that simply
ask, “Which of the following statements is false?”.
   When analyzing the similarity values, we found a spectrum ranging from completely unrelated
questions (similarity 0) to those that were suspiciously similar (similarity values between 0.98
and 0.99). To address the intermediate range of similarities, we implemented a graph-based
method by setting a threshold 𝜎 = 0.7. We examined the graph 𝐺0.7 , consisting of 718 nodes and
3, 836 edges, and applied the Clauset-Newman-Moore algorithm [7] for community detection.2
The communities identified, as depicted in Figure 2b, are often granular enough to enable
manual inspection and identification of similar question groups. For cases requiring additional
analysis, node-centrality methods such as [8, 9] can be employed to enhance visual exploration.

3.2. Verifying syllabus coherence
Using the previously described similarity techniques and a structured syllabus, we can efficiently
assess the alignment of questions with specific syllabus areas. For instance, in the Competenze
Digitali program, the syllabus comprises five main areas and 11 sub-areas. Each question in our
database is uniquely identified by the area and sub-area numbers it corresponds to.
   This structure allows us to analyze the similarity between each question and its designated
sub-area within the syllabus. In our dataset we have 798 questions and 11 sub-areas, so we
computed 798 · 11 = 8778 values of similarity. In Table 3 we report the head and the tail of the
values; the first items are the similarity score of the first question in the dataset against the first
sub-areas of the Syllabus. It is easy to see that the largest score, among the ones shown in the
table, is exactly for the sub-area of the Syllabus it belongs, i.e., Q-1.1.1.1-1 belongs to S1.1. The
same happens for the last question, i.e., Q-5.2.3.5-8 belongs to S5.2. About half of the questions
in our dataset showed the highest similarity with their respective syllabus sub-areas.

    2
        The results of a baseline approach based on connected components instead of communities are available in [5].
   This outcome highlights the effectiveness of our approach in ensuring that questions are
relevant and aligned with specific curriculum areas. By examining these similarity scores,
educators and administrators of the MCQs repository can identify syllabus areas needing more
focus or refinement and detect potential redundancies in their question sets.


4. Related Works
The exploration of duplicate question detection has been investigated in the Q&A domain,
aiming to efficiently answer queries by linking to similar, previously answered questions in
Q&A forums. Notable recent works include those of Li et al. [10], who focused on medical Q&A
platforms, and Kamienski et al. [11], who concentrated on game development forums. Both
studies developed deep-learning systems for recognizing similar questions. Li et al. trained a
Long-Short Term Memory (LSTM) neural network on pairs of questions, aligning semantically
similar queries within the LSTM’s vector space. Kamienski et al. combined large pre-trained
deep learning models with supervised techniques, using features from models like MPNet to
train a supervised model that predicts similarity scores between sentence pairs.
   In the field of learning analytics, there has been limited investigation into identifying dupli-
cate questions. A notable study is presented in [12], which developed a machine learning-based
system for managing large question paper databases. It trains an XGBoost model [13] on
manually-selected features like structural attributes and word embeddings, using labeled dupli-
cate question pairs from Quora to identify semantically similar English sentence pairs.
   Unlike these methods, our study leverages pre-trained large language models in an unsuper-
vised manner, without requiring ground-truth labels. Additionally, our approach incorporates a
graph construction phase to facilitate the identification of duplicates.
Other works. The general problem of identifying duplicate records in databases is known
in literature as Entity Resolution. Data management applications typically use supervised
methods [14, 15] and external knowledge bases [16] to learn vector representations of records
and attributes. In this paper, we focus instead on unsupervised similarity computation and leave
the final decision to a human expert, in the spirit of oracle-based approaches such as [17, 18].


5. Conclusive remarks
We describe our approach for the maintenance of large Multiple-Choice Questions (MCQs)
repositories, based on our experience with a large-scale training project by the Italian Gov-
ernment. Our method employs Natural Language Processing (NLP) techniques, particularly
Transformer architectures, to semantically detect similar pairs of questions in MCQs repositories,
aiming to go beyond traditional word co-occurrence analysis while acknowledging the potential
for false positives. To address these false positives, we incorporate a graph exploration strategy
that focuses on community structures within the similarity graph, thus enhancing the precision
of similarity assessments by analyzing relationships within the same community.
   Future works include exploring algorithms for efficiently merge multiple MCQ repositories
and developing automated tools for aligning questions upon syllabus updates, thereby reducing
the manual effort required for repository maintenance.
Acknowledgements
The authors have been partially supported by SEED PNR Project “FLOWER” “Frontiers in
Linking records: knOWledge graphs, Explainability and tempoRal data” and Sapienza Research
Project B83C22007180001 “Trustworthy Technologies for Augmenting Knowledge Graphs”.


References
 [1] J. M. Azevedo, E. P. Oliveira, P. Damas Beites, Using learning analytics to evaluate the
     quality of multiple-choice questions: A perspective with classical test theory and item
     response theory, The International Journal of Information and Learning Technology 36
     (2019) 322–341. doi:10.1108/IJILT-02-2019-0023.
 [2] P. Long, G. Siemens, G. Conole, D. Gasevic (Eds.), Proceedings of the 1st International
     Conference on Learning Analytics and Knowledge, LAK 2011, Banff, AB, Canada, February
     27 - March 01, 2011, ACM, 2011.
 [3] R. Mitkov, H. Le An, N. Karamanis, A computer-aided environment for generating
     multiple-choice test items, Natural language engineering 12 (2006) 177–194. doi:10.
     1017/S1351324906004177.
 [4] V. Albano, D. Firmani, L. Laura, J. G. Mathew, A. L. Paoletti, I. Torrente, Nlp-based
     management of large multiple-choice test item repositories, Journal of Learning Analytics
     (2023) 1–16.
 [5] V. Albano, D. Firmani, L. Laura, A. L. Paoletti, I. Torrente, Managing large multiple-choice
     test item repositories, in: Proceedings of the 26 International Conference Information
     Visualisation (IV22), 2022. doi:10.1109/IV56949.2022.00054.
 [6] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Process-
     ing, Association for Computational Linguistics, 2019. URL: http://arxiv.org/abs/1908.10084.
     doi:10.48550/arXiv.1908.10084.
 [7] A. Clauset, M. E. J. Newman, C. Moore, Finding community structure in very large
     networks, Physical Review E 70 (2004). doi:10.1103/physreve.70.066111.
 [8] G. Ausiello, D. Firmani, L. Laura, The (betweenness) centrality of critical nodes and
     network cores, in: 2013 9th International Wireless Communications and Mobile Computing
     Conference (IWCMC), IEEE, 2013, pp. 90–95. doi:10.1109/IWCMC.2013.6583540.
 [9] G. Ausiello, D. Firmani, L. Laura, Real-time monitoring of undirected networks: Articu-
     lation points, bridges, and connected and biconnected components, Networks 59 (2012)
     275–288. doi:10.1002/net.21450.
[10] Y. Li, L. Yao, N. Du, J. Gao, Q. Li, C. Meng, C. Zhang, W. Fan, Finding similar medical
     questions from question answering websites, 2018. doi:10.48550/arXiv.1810.05983.
     arXiv:1810.05983.
[11] A. Kamienski, A. Hindle, C.-P. Bezemer, Analyzing techniques for duplicate question
     detection on q&a websites for game developers, Empirical Software Engineering 28 (2023)
     17. doi:10.1007/s10664-022-10256-w.
[12] S. Mukherjee, N. S. Kumar, Duplicate question management and answer verification
     system, in: 2019 IEEE Tenth International Conference on Technology for Education (T4E),
     2019, pp. 266–267. doi:10.1109/T4E.2019.00067.
[13] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the
     22nd acm sigkdd international conference on knowledge discovery and data mining, 2016,
     pp. 785–794. doi:10.1145/2939672.2939785.
[14] U. Brunner, K. Stockinger, Entity matching with transformer architectures-a step forward
     in data integration, in: EDBT 2020, 2020. doi:10.5441/002/edbt.2020.58.
[15] M. Ebraheem, S. Thirumuruganathan, S. R. Joty, M. Ouzzani, N. Tang, Distributed represen-
     tations of tuples for entity resolution, Proc. VLDB Endow. 11 (2018) 1454–1467. URL: http:
     //www.vldb.org/pvldb/vol11/p1454-ebraheem.pdf. doi:10.14778/3236187.3236198.
[16] A. S. Andreou, D. Firmani, J. G. Mathew, M. Mecella, M. Pingos, Using knowledge graphs for
     record linkage: Challenges and opportunities, in: M. Ruiz, P. Soffer (Eds.), Advanced Infor-
     mation Systems Engineering Workshops - CAiSE 2023 International Workshops, Zaragoza,
     Spain, June 12-16, 2023, Proceedings, volume 482 of Lecture Notes in Business Information
     Processing, Springer, 2023, pp. 145–151. URL: https://doi.org/10.1007/978-3-031-34985-0_15.
     doi:10.1007/978-3-031-34985-0\_15.
[17] D. Firmani, S. Galhotra, B. Saha, D. Srivastava, Robust entity resolution using a crowdoracle,
     IEEE Data Eng. Bull. 41 (2018) 91–103. URL: http://sites.computer.org/debull/A18june/p91.
     pdf.
[18] S. Galhotra, D. Firmani, B. Saha, D. Srivastava, Efficient and effective er with progressive
     blocking, The VLDB Journal 30 (2021) 537–557. doi:10.1007/s00778-021-00656-7.