A Survey on Dataset Development Techniques for QA Systems ⋆

A Survey on Dataset Development Techniques for QA Systems ⋆ AichaAggoune aggoune.aicha@univ-guelma.dz Computer science department University 8th May

1945 Guelma Algeria

LabSTIC Laboratory University 8th May

1945 Guelma Algeria

A Survey on Dataset Development Techniques for QA Systems ⋆ 1613-0073 93381075D3F5C54836B8EF10581C58D2 GROBID - A machine learning software for extracting information from scholarly documents QA systems Dataset development Metrics Techniques

Question-answering (QA) systems are pivotal in natural language processing, driving advancements in conversational AI, virtual assistants, and automated knowledge retrieval. The quality and structure of datasets play a critical role in the performance, reliability, and adaptability of these systems. This paper presents a comprehensive review of dataset development techniques for QA systems. We classify these techniques into three categories: manual techniques, which are based on expert domain and crowdsourcing, and automatic techniques, which are divided into two classes: knowledge-based methods and machine learning, and innovative techniques by using data augmentation methods. We introduce a comparison of some important datasets for QA systems according to different criteria with a special focus is given to evaluation metrics used to assess dataset quality. The study can guide practitioners in developing robust, high-quality datasets for future QA systems.

Introduction

Natural language processing (NLP) has seen remarkable advancements in recent years, with questionanswering (QA) systems emerging as one of the most impactful applications. QA systems, designed to retrieve precise answers from vast textual information, are now integral to technologies such as search engines, virtual assistants, and knowledge-based systems. The performance of these systems hinges not only on sophisticated algorithms and model architectures but also on the quality and relevance of the datasets used to train them. High-quality datasets provide the essential foundation for these models to understand complex language structures, reason over context, and accurately respond to user queries [1].

Developing robust datasets for QA is a complex and resource-intensive process. Key challenges in dataset development include ensuring data diversity and balancing language complexity. Various techniques have emerged to address these challenges, ranging from traditional manual annotation to innovative method by using data augmentation methods.

This paper aims to provide a comprehensive review of the techniques used in developing datasets for QA systems, focusing on their strengths, limitations, and areas of application. By systematically examining these methods, we seek to illuminate best practices and emerging trends in QA dataset development. Furthermore, this review addresses the importance of dataset validation and quality metrics, highlighting how they contribute to the reliability and effectiveness of QA systems. Ultimately, our goal is to guide researchers and practitioners in creating datasets that better serve the needs of future QA models, fostering continued innovation and performance improvements in the field.

The remainder of this paper is organized as follows: In Section 2, we introduce the theoretical foundations. Section 3 reviews the techniques for dataset development. In Section 4, we present a comparison between dataset structures. Section 5. describe the important metrics for Assessing Datasets. Conclusions are drawn in the last section.

Theoretical foundations

Question-Answering systems

Question-answering (QA) systems offer an intuitive interface for querying vast stores of information across diverse data formats, including both structured and unstructured data in natural languages. These systems play a crucial role in transforming raw data into usable knowledge, enabling users to retrieve specific answers to questions rather than sifting through large documents or databases [2]. QA systems are increasingly employed in applications ranging from customer support and virtual assistants to research and education, where they can quickly extract insights from sources such as documents, databases, and even multimedia content.

To operate effectively, QA systems need to handle the variability and complexity of natural language, requiring them to interpret nuanced questions and extract relevant answers accurately. This involves the integration of techniques from fields such as natural language processing (NLP), information retrieval (IR), and machine learning (ML). Additionally, QA systems must accommodate the inherent diversity in question formulations and adapt to different data types, including text documents, tables, knowledge graphs, and multimodal data.

Closed-domain Question-Answering systems

Closed-domain Question-answering systems (CQA) are specialized to respond to queries within defined subject areas, such as sports, healthcare, education, or entertainment [3]. These systems leverage domainspecific knowledge, often structured in detailed ontologies or databases, to streamline information retrieval and enhance accuracy in answering questions. The focus on a particular domain simplifies the task for natural language processing (NLP) models, as the system can utilize a well-defined vocabulary, set of concepts, and relationships unique to that domain. For example, in a medical QA system, structured knowledge about diseases, symptoms, and treatments can help the system precisely interpret and respond to health-related inquiries.

Unlike closed-domain systems, open-domain QA systems rely on vast, unstructured sources of information, such as large text corpora, encyclopedic databases (like Wikipedia), or even the internet itself, rather than predefined, domain-specific knowledge structures. This allows them to provide answers on diverse subjects, from historical events and scientific concepts to general trivia and current events.

Closed-domain QA systems are specifically tailored to operate in contexts where general-purpose, open-domain solutions may lack the required depth, precision, or contextual understanding [4]. The development of high-quality datasets specifically tailored for QA systems is essential to training models that are reliable, accurate, and generalizable across domains. These datasets need to account for linguistic diversity, context sensitivity, and a wide range of question types, from simple fact-based queries to complex, reasoning-based questions.

Techniques of dataset development for CQA systems

A variety of techniques have been developed to construct datasets for question-answering (QA) systems, each designed to address particular challenges in generating comprehensive and high-quality data for training and evaluation purposes. In this survey, we categorize these techniques into three main types: manual methods, automated methods, and innovative approaches.

Manual methods

Manual Methods refer to dataset creation techniques that rely on human effort for data collection, question generation, and answer annotation [5]. These methods are highly valuable for ensuring data quality, relevance, and contextual accuracy, as they allow human annotators to apply their expertise and judgment in curating the dataset. However, manual methods are often labor-intensive, timeconsuming, and costly, especially for large-scale datasets. Human annotators create question-answer pairs based on a given text or knowledge source. Annotators carefully read through documents, extract meaningful information, and formulate questions that can be answered directly from the content [6]. Another method is based on crowdsourcing, which involves outsourcing the task of question and answer generation to a large pool of workers on platforms like Amazon Mechanical Turk or Figure Eight [7]. This approach allows for rapid data collection from a diverse group of contributors.

In specialized fields, such as medicine, law, or finance, domain experts are employed to create or validate question-answer pairs. Their expertise ensures that the information is accurate, contextually relevant, and adheres to domain-specific standards.

Automated methods

These methods significantly reduce the time and cost required to produce vast amounts of questionanswer pairs, making it possible to construct datasets for training and evaluating models on a large scale. Automatic techniques for creating question-answering (QA) datasets can be broadly divided into two main classes: knowledge-based methods and machine learning-based methods.

Knowledge-based methods rely on structured information sources, such as ontologies, knowledge graphs, and databases, to automatically generate question-answer pairs [8]. These methods use predefined rules, templates, and structured data to produce questions and identify corresponding answers.

Machine learning-based methods, especially those using natural language processing (NLP) and deep learning, have transformed QA dataset creation by automating the generation of complex, context-rich question-answer pairs [9]. These methods use trained models to generate or extract questions and answers from unstructured text, offering greater flexibility and adaptability [10].

More advanced automated approaches involve using machine learning models, particularly large pre-trained language models (e.g., GPT-3, BERT, T5), to generate question-answer pairs synthetically [11,12]. These models are trained on extensive text corpora, enabling them to produce realistic and contextually varied questions based on input content.

Innovative approaches

In recent years, data augmentation techniques have gained traction as a way to enhance and diversify QA datasets without the need for entirely new data sources. These techniques manipulate existing question-answer pairs to create new, varied versions, expanding the dataset and exposing models to a wider range of language patterns, contexts, and question types [13]. Data augmentation approaches are particularly useful for improving model generalization and robustness, helping QA systems perform better in real-world scenarios [14].

Data augmentation techniques like synonym substitution, paraphrasing, and entity replacement are used to increase dataset size and diversity automatically [15]. By modifying existing question-answer pairs, these methods create variations that expose models to different phrasings and vocabulary without needing new data sources.

Comparison between datasets structures

When evaluating QA datasets, it is crucial to consider the structure of the dataset and the type of question-answer (Q&A) pairs it contains. Different datasets follow various organizational structures based on their intended use.

The most existing QA datasets typically consist of pairs of questions and corresponding answers. For example, SQuAD (Stanford Question Answering Dataset): Questions are based on a paragraph, and answers are specific spans of text from the paragraph [16]. TriviaQA: Similar to SQuAD, the dataset contains questions with answers that are directly extracted from documents or web pages [17]. Natural Questions (NQ): Contains questions where answers are extracted from long documents.

Another innovative approach involves query generation from natural language questions. This structure focuses on generating queries that can be used to retrieve answers from a database, knowledge graph, or other structured data sources [18]. This type of dataset emphasizes the process of converting a natural language question into a structured query that can be executed on a structured database or system, such as SQL. WikiSQL [2] is a large-scale dataset for natural language to SQL query generation. It contains questions based on data tables from Wikipedia and includes SQL queries that extract answers from these tables.

More recent work focuses on the generation of Mongo queries from natural questions with the application of three data augmentation techniques: paraphrasing, back translation, and named entity substitution [19]. An extended work aims to generate more complex queries with auto-validation of the augmented data [20].

Query generation-based datasets are a valuable tool for developing information retrieval systems that bridge the gap between natural language and structured data. By converting natural language questions into executable queries (e.g., SQL, SPARQL, MQL), these datasets enable systems to access and retrieve information from sources.

Table 1 outlining key criteria used to assess various datasets for Question-Answering (QA) systems.

Metrics for Assessing Datasets

For datasets designed for generative QA, where the model must generate queries in natural language, different metrics are used to evaluate the quality of the generated queries.

Automatic evaluation using BLEU and ROUGE scores: BLEU is a widely recognized metric in the field of machine translation, while ROUGE is commonly used for evaluating text summarization and other natural language generation tasks. A higher score of these metrics indicates greater similarity and thus a more accurate translation.

BLEU is a widely recognized metric in the field of machine translation [23], while ROUGE is commonly used for evaluating text summarization and other natural language generation tasks [23]. A higher score of these metrics indicates greater similarity and thus a more accurate translation.

𝐵𝐿𝐸𝑈 = 𝐵𝑃 × 𝑒𝑥𝑝 𝑁 ∑︁ 𝑛=1 (𝑤 𝑛 𝑙𝑜𝑔𝑃 𝑛 ) .(1)

Where:

• N is the maximum n-gram size (usually up to 4).

• Pn is the precision for n-grams.

• Wn is the weight assigned to the precision, usually set to 1/N • BP (Brevity Penalty) adjusts the score for the shorter translations.

ROUGE evaluates the n-gram overlap between the output summary and one or more reference summaries [24]. The following formula of ROUGE measure:

𝑅𝑂𝑈 𝐺𝐸 = 𝑅𝑂𝑈 𝐺𝐸_𝑁 𝑚 + 𝑅𝑂𝑈 𝐺𝐸_𝐿 𝑚 + 𝑅𝑂𝑈 𝐺𝐸_𝑆 𝑚 .(2)

Where:

𝑅𝑂𝑈 𝐺𝐸_𝑁 = 𝑇 𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑁 𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑣𝑒𝑟𝑙𝑎𝑝𝑝𝑖𝑛𝑔 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 . (3) 𝑅𝑂𝑈 𝐺𝐸_𝐿 = ∑︀ 𝑟𝑒𝑓 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠 (𝑙𝑜𝑛𝑔𝑒𝑠𝑡_𝑐𝑜𝑚𝑚𝑜𝑛_𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒) ∑︀ 𝑟𝑒𝑓 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠 (𝑠𝑢𝑚𝑚𝑎𝑟𝑦 𝑙𝑒𝑛𝑔𝑡ℎ) .(4)𝑅𝑂𝑈 𝐺𝐸_𝑆 = ∑︀ 𝑟𝑒𝑓 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠 ∑︀ 𝑠𝑘𝑖𝑝 𝑏𝑖𝑔𝑟𝑎𝑚 (𝑐𝑜𝑢𝑛𝑡 𝑚𝑎𝑡𝑐ℎ(𝑠𝑘𝑖𝑝) ∑︀ 𝑟𝑒𝑓 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠 ∑︀ 𝑠𝑘𝑖𝑝 𝑏𝑖𝑔𝑟𝑎𝑚 (𝑐𝑜𝑢𝑛𝑡(𝑠𝑘𝑖𝑝 𝑏𝑖𝑔𝑟𝑎𝑚)) .(5)

METEOR (Metric for Evaluation of Translation with Explicit ORdering) [25]: Evaluates text generation based on synonyms, stemming, and word order. It is more flexible than BLEU and rewards synonyms and paraphrased text. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.

The METEOR score is calculated as follows:

𝑀 𝐸𝑇 𝐸𝑂𝑅 = 𝐹 mean × (1 − Penalty).(6)

where: Harmonic Mean of Precision and Recall:

𝐹 mean = 10 • Precision • Recall 9 • Precision + Recall ,(7)Penalty = 𝛾 • (︂ chunks matches )︂ 𝛽 ,(8)

matches: Total number of matched unigrams, chunks: Groups of matches in the same order, 𝛾 and 𝛽: Tunable parameters to control the penalty's impact (default values are usually 𝛾 = 0.5 and 𝛽 = 3.0).

Finally, a key metric is how well a model performs on the dataset: Training Loss/Accuracy: These metrics reflect how well the model learns from the dataset during training. A lower loss and higher accuracy indicate a model that fits the data well.

A low training loss and high accuracy on tasks like extractive QA or question answering from a knowledge base suggest that the dataset is well-constructed and provides enough relevant information. A low training loss and high accuracy on tasks like extractive QA or question answering from a knowledge base suggest that the dataset is well-constructed and provides enough relevant information.

Conclusion

Various techniques for dataset creation and validation in the field of question-answering (QA) systems. These techniques are essential for advancing the effectiveness of QA systems across multiple domains and ensuring that they can handle a diverse set of questions and answer types. this survey offers valuable insights into the diversity of datasets available for training and evaluating QA systems. The datasets reviewed here span a wide range of domains, question types, and answer formats, each designed to address specific challenges in QA. While progress has been made in creating large-scale, diverse, and specialized datasets, challenges related to scalability, dataset quality, and domain generalization remain. As QA systems continue to evolve, the development of new datasets and evaluation metrics will play a crucial role in advancing the capabilities of these systems, allowing them to handle increasingly complex tasks in real-world applications.

Declaration on Generative AI

During the preparation of this work, the author used ChatGPT, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using this tool, the author reviewed and edited the content as needed and takes full responsibility for the publication's content.

Table 11Review of some popular datasetsRefDatasetSourceFieldMethodologyData size[16]SQuADWikipedia Diverse Selection of Articles,+100KQuestion Generation,Answer Annotation[21]DBPalSynthetic Diverse Generator, data3 millionaugmentation,Lemmatizer[18] NarratiQAbooksMovies Data collection,46,765question generation[22] BabiMovie Wikipedia Movies data collection,10.000data structuring,dialog generation,question formulation[19]M2Q2MflixMovies Creating templates,88,100data augmentation,data revision[20]M2Q2+MflixMovies Creating templates,100kdata augmentation,auto-validation

Text-to-sql generation for question answering on electronic medical records PWang TShi CKReddy Proceedings of The Web Conference 2020 The Web Conference 2020 2020 VZhong CXiong RSocher arXiv:1709.00103 Seq2sql: Generating structured queries from natural language using reinforcement learning 2017 arXiv preprint JQi JTang ZHe XWan YCheng CZhou XWang QZhang ZLin arXiv:2205.06983 Rasat: Integrating relational structures into pretrained seq2seq model for text-to-sql 2022 arXiv preprint Cobert: Covid-19 question answering system using bert JAAlzubi RJain ASingh PParwekar MGupta Arabian journal for science and engineering 48 2023 Review and analysis of synthetic dataset generation methods and techniques for application in computer vision GPaulin MIvasic-Kos Artificial intelligence review 56 2023 Optimizing dataset creation: A general purpose data filtering system for training large language models SJin YWang SLiu YZhang WGu 2024 Clotho-aqa: A crowdsourced dataset for audio question answering SLipping PSudarsanam KDrossos TVirtanen 2022 30th European Signal Processing Conference (EUSIPCO) IEEE 2022 Systematic review of question answering over knowledge bases APereira ATrifan RPLopes JLOliveira IET Software 16 2022 Deep learning based active learning technique for data annotation and improve the overall performance of classification models SUAmin AHussain BKim SSeo Expert Systems with Applications 228 120391 2023 Transformer models used for text-based question answering systems KNassiri MAkhloufi Applied Intelligence 53 2023 Bert model-based natural language to nosql query conversion using deep learning approach KMHossen MNUddin MArefin MAUddin International Journal of Advanced Computer Science and Applications 14 2023 Towards User-Friendly NoSQL: A Synthetic Dataset Approach and Large Language Models for Natural Language Query Translation ATola 2024 Politecnico di Torino Ph.D. thesis An empirical survey of data augmentation for limited data learning in nlp JChen DTam CRaffel MBansal DYang Transactions of the Association for Computational Linguistics 11 2023 Gotta: generative few-shot question answering by prompt-based cloze data augmentation XChen YZhang JDeng J.-YJiang WWang Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), SIAM the 2023 SIAM International Conference on Data Mining (SDM), SIAM 2023 Data augmentation techniques in natural language processing LF A OPellicer TMFerreira AH RCosta Applied Soft Computing 132 109803 2023 PRajpurkar RJia PLiang arXiv:1806.03822 Know what you don't know: Unanswerable questions for squad 2018 arXiv preprint MJoshi EChoi DSWeld LZettlemoyer arXiv:1705.03551 Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension 2017 arXiv preprint The narrativeqa reading comprehension challenge TKočiskỳ JSchwarz PBlunsom CDyer KMHermann GMelis EGrefenstette Transactions of the Association for Computational Linguistics 6 2018 M2q2: A text-to-mql dataset for movie qa systems AAggoune ZMihoubi Proceedings of the 6th Mediterranean Conference on Pattern Recognition and Artificial Intelligence (MedPRAI) the 6th Mediterranean Conference on Pattern Recognition and Artificial Intelligence (MedPRAI) 2024 Towards efficient dataset development: A case study of m2q2+ in movie qa systems AAggoune ZMihoubi Proceedings of the the 6th Edition of the International Conference on Advanced Aspects of Software Engineering (ICAASE) the the 6th Edition of the International Conference on Advanced Aspects of Software Engineering (ICAASE) 2024 Dbpal: A fully pluggable nl2sql training pipeline NWeir PUtama AGalakatos ACrotty AIlkhechi SRamaswamy RBhushan NGeisler BHättasch SEger Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data the 2020 ACM SIGMOD International Conference on Management of Data 2020 Querying nosql with deep learning to answer natural language questions SBlank FWilhelm H.-PZorn ARettinger Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2019 33 Bleu: a method for automatic evaluation of machine translation KPapineni SRoukos TWard W.-JZhu Proceedings of the 40th annual meeting of the Association for Computational Linguistics the 40th annual meeting of the Association for Computational Linguistics 2002 Rouge: A package for automatic evaluation of summaries C.-YLin Text summarization branches out 2004 Meteor: An automatic metric for mt evaluation with improved correlation with human judgments SBanerjee ALavie Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization 2005