GECKO: A Question Answering System for Official Statistics

GECKO: A Question Answering System for Official Statistics LucasLageweg Statistics Netherlands

Henri Faasdreef 312 Den Haag Netherlands

University of Amsterdam

Science Park 900 1098 XH Amsterdam Netherlands

JonasKouwenhoven BennoKruit Vrije Universiteit Amsterdam

De Boelelaan 1105 Amsterdam Netherlands

GECKO: A Question Answering System for Official Statistics 1613-0073 0DC3FB185DE3AFD56F83501EBC04AB28 GROBID - A machine learning software for extracting information from scholarly documents

This paper presents GECKO, a knowledge graph-based statistical question answering system currently in beta deployment. GECKO aims to facilitate the retrieval of single statistical values from an extensive database containing over a billion values across more than 4,000 tables. The system integrates a comprehensive framework including data augmentation, entity retrieval, and large language model (LLM)-based query generation. A key feature of the beta deployment is the collection of user feedback, which is critical for improving system performance and accuracy. This feedback mechanism allows users to report issues directly, ensuring continuous improvement based on real-world use.

Introduction

Statistics Netherlands (Centraal Bureau voor de Statistiek; CBS) is an independent administrative body of the Dutch government tasked with the creation of statistics over a broad spectrum of social topics and the responsibility to make them accessible to the general public. However, in-house studies have shown that users struggle to find the correct tables for their needs in the vast amount of data available. This research aims to develop a Question Answering (QA) system to provide specific statistical observations from this data as responses to natural-language user questions.

QA systems can take several forms, with most recently free-form generative Large Language Models (LLMs) like ChatGPT and GPT4 [1] getting much attention. Due to the nature of these models, they are able to generalize very well on a large range of topics, but have shown to be prone to 'hallucinations', where plausible but incorrect or even nonsensical answers are generated [2]. Especially for official data like governmental statistics, this is highly undesirable behavior.

Knowledge Graph Question Answering (KGQA) is a field where knowledge graphs (KGs) containing real-world facts and relations in structured form are used as a basis for QA systems. Answers of such systems should always adhere to the KG. Therefore, assuming it contains correct information, answering by returning parts of the KG, or reasoning over it, cannot lead to nonsensical answers. In this paper, we introduce an end-to-end pipeline for a generation-based KGQA system of CBS data.

SEMANTICS 2024, Demo Track

Envelope l.lageweg@cbs.nl (L. Lageweg); jonaskouwenhoven@live.nl (J. Kouwenhoven); b.b.kruit@vu.nl (B. Kruit) Our approach introduces a data augmentation process for enhancing model training, explores various encoder architectures for entity retrieval, and proposes a new query generator mechanism enhanced by Low Rank Adaptation (LoRA) [3]. Additionally, we propose a new prompting technique that utilizes dynamic prompts, constructing specific prompts based on the generation phase. These improvements help the process of generating symbolic expressions for querying a KG, thereby enhancing the overall performance of the QA system.

This paper details the beta deployment of GECKO and its feedback collection mechanism, emphasizing the role of user input in refining the system. The beta phase is critical for identifying and addressing potential issues, ultimately enhancing the system's robustness and reliability.

Related Work

Query generation systems, particularly those involving text-to-SQL and KGQA, have made significant strides [4]. Recent work [5,6] focus on grounding queries in knowledge graphs to avoid hallucinations. Recent advancements [7,8] highlight the use of LLMs in generating logical forms for querying databases. Data augmentation techniques [9,10], are essential for creating diverse and realistic training datasets. Entity retrieval methods including sparse and dense retrieval approaches [11,12,13], play a crucial role in identifying relevant data within vast datasets.

Compared to existing KBQA or text-to-SQL systems, we provide a hybrid solution where statistical tabular data can be represented as knowledge graphs, to which the techniques for symbolic expression generation instead of more complex query language generation (SQL or SPARQL) can be applied. With this approach, we propose a novel system that can help find relevant information in official statistics and similar systems, which is vital for governmental decision making and all fields of research utilising and relying on these statistics.

System Design

In processing a question, GECKO performs four core steps: entity retrieval, filters retrieval, constrained S-expression decoding (i.e. symbolic expression generation) and observation validation. We restrict the querying space by performing entity retrieval based on the input question to determine the closest KB nodes. This is done through either sparse retrieval using BM25+ [14] as a baseline method, using a trained dual encoder [12] or a finetuned ColBERT model [13].

After obtaining the closest matching entities based on the query, we retrieve all possible filters for tables by exploding a subgraph using the entities found. The result of the subgraph exploding is a graph containing all table nodes and their related measures and dimensions having nodes intersecting with the retrieved entities from the previous step. The subgraph contains all relevant nodes to the query, connected to one or more tables.

The query and subgraph are used as input for the constrained S-expression decoding. The S-expressions are generated token-by-token such that, given the subgraph, admissible tokens are returned at every step. A rule-based baseline was created using the entity retrieval scores to greedily determine what token from the admissible tokens to select. The second method uses a transformer-based decoder-only seq2seq model and dynamic prompting.

When generating a token at a given timestep, the model evaluates the sequences in the list of admissible/constrained tokens and selects the sequence with the highest assigned score. For example, when 7425eng is given to the decoder as one of the admissible next tokens, but only a decomposition of sub-tokens can be embedded by the model (e.g. 7425 followed by ##eng), the summed log probability for these subtokens will determine the total probability of selecting this identifier for generation.

The novelty in this constraining method is the introduction of dynamic prompting, which, instead of calculating the likelihood of a token sequence based on a static prompt (i.e. text input for the model), adjusts the prompts according to the generation phase. For example, when generating a table ID, the prompt is altered to only include the most relevant table IDs and their descriptions. Similarly, when measures are generated in the next phase, it retrieves the measures related to the previously generated table ID, and using those to construct a new prompt. This method applies to the different dimension groups as well. Figure 2 contains a schematic overview of the dynamic prompting technique.

Model training & beta deployment

For creating training data, we developed a method for manual data annotation. This method involves annotators writing queries that can be answered by a specific table cell. Annotators were instructed to write their questions both as full sentences and in a more casual style, aiming to simulate the formulation of questions posed by users in a search engine. The data obtained from this manual annotation process contains queries and their corresponding S-expression, resulting in 2300 annotated pairs. The annotated queries were distributed over random tables from the CBS datapool, and contained a strong class imbalance towards tables that were more easily annotated. This class imbalance and random distribution motivates extending this study with data augmentation. In this extension, annotated S-expressions and their associated queries are used to fine-tune a GPT-3.5 model through the OpenAI fine-tuning services. The query-expression pairs were transformed into prompts using the descriptions of the IDs for various measures, dimensions, and table IDs. Training such a model reduces the need for additional manual annotation, while also significantly increasing the amount of annotated data.

The initial GECKO model (𝑣 1 ) and model containing the improvements discussed here (𝑣 2 ) were evaluated using a selected sample of this dataset. This was done by evaluation exact table matches of generated S-expressions (𝑣 1 0.35; 𝑣 2 0.63), F1-scores for selected dimensions in said S-expressions (𝑣 1 0.62; 𝑣 2 0.71) and by manually annotation answer relevancy, as an answer can be a non-exact match but still be relevant to the question (𝑣 1 0.38; 𝑣 2 0.71).

Conclusion

The beta deployment of GECKO1 , a generation-based KGQA system for CBS data, marks a significant milestone in improving user interaction with governmental statistics. This phase includes mechanisms for feedback collection, which will play a crucial role in refining and enhancing the system based on user input. The feedback gathered during the beta deployment will help identify and address potential issues, ensuring the system's robustness and reliability. This process is essential for developing a reliable QA system capable of providing accurate and relevant statistical observations in response to natural-language user questions.

PPe er ri io od ds s D Da ai ir ry y p pr ro od du uc ct ti io on n

Figure 1 :1Figure 1: (a) Overview of our pipeline from query to answer. Candidate table nodes (green) are retrieved for the query, after which the measure and dimension filter candidates are retrieved (blue), resulting in a complete subgraph for the table candidates. The subgraph is used as input for the constrained S-expression decoding by either the baseline method or trained model. (b) Example CBS table fragment (from 7425eng), showing one dimension (time periods) and two measures.

Figure 2 :2Figure 2: Schematic overview of dynamic promting for a decoder-only model architecture. If there are multiple tokens that can be generated, a custom prompt is given to the decoder with labels for all the possible options. https://gecko.cbs.nl

arXiv:2303.08774 GPT-4 2023 OpenAI Technical Report How Language Model Hallucinations Can Snowball MZhang OPress WMerrill ALiu NASmith arXiv:2305.13534 2023 LoRA: Lowrank adaptation of large language models EJHu YShen PWallis ZAllen-Zhu YLi SWang LWang WChen International Conference on Learning Representations 2022 BQin BHui LWang MYang JLi BLi RGeng RCao JSun LSi FHuang YLi arXiv:2208.13629 A survey on text-to-sql parsing: Concepts, methods, and future directions 2022 Three levels of generalization for question answering on knowledge bases YGu SKase MVanni BSadler PLiang XYan YSu BeyondI I 10.1145/3442381.3449992 Proceedings of the Web Conference 2021 the Web Conference 2021 ACM 2021 Decaf: Joint decoding of answers and logical forms for question answering over knowledge bases DYu SZhang PNg HZhu AHLi JWang YHu WWang ZWang BXiang arXiv:2210.00063 2023 Evaluating large language models in semantic parsing for conversational question answering over knowledge graphs PSchneider MKlettner KJokinen ESimperl FMatthes arXiv:2401.01711 2024 arXiv preprint Enhancing text-to-SQL capabilities of large language models: A study on prompt design strategies LNan YZhao WZou NRi JTae EZhang ACohan DRadev The 2023 Conference on Empirical Methods in Natural Language Processing 2023 InPars: Unsupervised dataset generation for information retrieval LBonifacio HAbonizio MFadaee RNogueira 10.1145/3477495.3531863 Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22 the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22

New York, NY, USA

Association for Computing Machinery 2022 VJeronymo LBonifacio HAbonizio MFadaee RLotufo JZavrel RNogueira arXiv:2301.01820 Inpars-v2: Large language models as efficient dataset generators for information retrieval 2023 arXiv preprint Recent trends in deep learning based open-domain textual question answering systems ZHuang SXu MHu XWang JQiu YFu YZhao YPeng CWang IEEE Access 8 2020 Dense passage retrieval for open-domain question answering VKarpukhin BOguz SMin PLewis LWu SEdunov DChen W.-TYih 10.18653/v1/2020.emnlp-main.550 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics 2020 Colbert: Efficient and effective passage search via contextualized late interaction over bert OKhattab MZaharia Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval the 43rd International ACM SIGIR conference on research and development in Information Retrieval 2020 Lower-bounding term frequency normalization YLv CZhai 10.1145/2063576.2063584 Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11 the 20th ACM International Conference on Information and Knowledge Management, CIKM '11

New York, NY, USA

Association for Computing Machinery 2011