Retrieval-Augmented Generation for Query Target Type Identification Darío Garigliotti1 1 University of Bergen, Norway Abstract The paradigm shift unleashed by Entity-Oriented Search still characterizes a vast space of the dynamics with which users engage in digital information access, from Web search to e-commerce and social networks. The progress in research around Entity Retrieval tasks has in particular shown the convenience of incorporating type-based information for entities in their methods to provide relevant answers to queries. As types are typically accessible in an ontology of reference within the knowledge base where their assigned entities live, automatically identifying query target types is a relevant problem to tackle. In this work, we propose to address the task of Query Target Type Identification by assessing the capabilities of Large Language Models that have recently shown widespread success. Our experimentation with methods based on Retrieval-Augmented Generation over a purposely built test collection from the literature challenges a well-established closed LLM by presenting it with entity type information from a resource within the core in hubbing Linked Open Data. 1. Introduction The emergence and consolidation of Entity-Oriented Search (EOS) represent a milestone in the evolution of web search paradigms [1]. The space of services provided by search engines were subtly yet decisively upgraded by offering more direct responses to the user beyond the traditional “blue links.” Indeed, the cognitive workload of finding the actual desired information within relevant documents is alleviated in several scenarios where Search Engine Response Pages (SERPs) provide focused replies, among others, via widgets (like the ones about weather information), entity cards (or infoboxes summarizing factoids, for example, about an artist or a city) and direct answers. In a virtuous cycle, users demand it more often and in more scenarios, by issuing more entity-centric queries, hence the paradigm is reinforced and nowadays as expected as it is ubiquitous. Interwoven with these industrial developments, the focus on entities in research largely established around studying a paradigmatic type of entity, people, in tasks such as expert finding [2]. It then broadened to all kinds of entities alongside the increasing efforts in Semantic Web to build knowledge bases (KBs) for more and more domains of knowledge, where uniquely identified entities are first-class citizens described by properties and related to other entities. Addressing Entity Retrieval (this is, the problem of ranking relevant entities for an entity-centric query) has enabled a considerable body of research where methods often incorporate information relevant to entities that is stored in KBs [3, 4]. Other than properties and relationships as mentioned, a characteristic knowledge item for an entity is its semantic classes or types, typically assigned from an ontology or type taxonomy existing in correspondence to the KB [5]. As it has been studied, Entity Retrieval (ER) gets benefited from integrating type-based information about the expected entities for the query into the ER approach [6]. Hence, a related significant problem in Entity-Oriented Search is the one of automatically identifying query target types, i.e. the types of all the entities expected to be relevant for answering the query [7]. We refer to this problem as Target Type Identification (TTI), and it is the task that we address in this paper. To the best of our knowledge, the state-of-the-art TTI approach over the latest developed benchmark is a Learning-to-Rank (LtR) method, where a combination of several kinds of features is manually selected and whose respective importance is then supervisedly learnt [8]. These features complement RAGE-KG’24: Retrieval-Augmented Generation enabled by Knowledge Graphs workshop at ISWC 2024 $ dario.garigliotti@uib.no (D. Garigliotti)  0000-0002-0331-000X (D. Garigliotti) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings each other, as this particular LtR instance brings together query attributes, type attributes, and the scoring from multiple rankings using both traditional lexical matching and Language Models to capture similarity based on distributional semantics in latent space. An attempt to update its performance results into a new state of the art for TTI could exploit modern Large Language Models (LLMs) to replace the more primitive LMs previously used, or even further, possibly replacing many or all previous features with a feature set only made of LLMs. However, although this may indeed improve and, in a way, absorb several of the non-(L)LM features into few modeled by representation –or deep– learning (DL), the feature engineering of the underlying LtR method remains. An alternative consists in re-approaching the TTI task by replacing the feature design altogether with a mechanism that directly uses a single, powerful LLM while integrating the external knowledge that is expected not to be well represented in, or not captured at all by, the billions of an LLM’ parameters. The object of experimentation in our work is this alternative by the means of a Retrieval-augmented Generation (RAG) framework [9] that by design integrates explicit knowledge –here, types in the ontology of a widely known KB– with the parametric knowledge that has shown successful performance in a variety of tasks [10, 11, 12, 13]. In a series of experimental configurations for RAG-based methods, we assess the abilities exhibited by an established, commercial, closed LLM in identifying target types. Throughout these analyses, we explore an area such as TTI in the encompassing integration of structured entries from a key hub resource in Linked Open Data into the vast space of information technology around LLMs. 2. Query Target Type Identification Given an entity-centric or entity-oriented query 𝑞 —this is, a query to be answered with relevant entities— and a type taxonomy 𝑇 , Hierarchical Target Type Identification (TTI) (or Query TTI) [7, 8] is the problem of returning a ranking that lists all the main target types of the query, this is, such that (i) they are the most specific entity classes that are relevant to 𝑞, and (ii) they are not on the same path from the root in the tree 𝑇 two-to-two. Since this definition imposes a number of criteria in the way a TTI method is to be assessed, we assume a slightly different set of criteria during evaluation. First, we evaluate a TTI output with set-based metrics, i.e. not order-based as we do not request the LLM to provide a ranking of the correct target types for a query. Also, although requested to the LLM with this definition during augmentation —i.e. regarding the need for being of highest specificity in non-overlapping taxonomy branches—, we do not allow for leniency on the evaluation of this hierarchical nature; the criterion is binary and hence we do not give any sort of discounted scoring for wrongly predicted types that are taxonomically close to the correct one(s). 2.1. Retrieval-Augmented Generation We consider each of our parameter configurations as a TTI method based on Retrieval-augmented Generation (RAG) [9]. Following established literature practices [14, 15], we instantiate the common naive RAG framework with its three distinctive stages, in a similar fashion as done previously for other tasks [13, 16]. The first component performs retrieval to obtain a ranked list of target types for each query. This crucial step is the one selecting the items from external information to be then provided explicitly during the following stage, augmentation. As in this work we are interested in assessing the abilities of LLMs to integrate this explicit knowledge with its parametric sources, we assume two possible (near-)optimal retrievers. First, an oracle retrieval assumes the ground-truth ranked target types from the test collection (cf. Section 2.3) as a perfectly informed type ranking. Second, an extension of this oracle, that we refer to as pseudo-oracle, is obtained for each query by adding, after the oracle-given ranked types, all new types coming from the union of the type sets in the ontology for all the top-1,000 entities retrieved with BM25, a solid first-pass retriever [17]. As we describe below in Section 2.3, the relevance score for each query-type pair in the collection allows us to use the ground-truth target types for each query as an oracle ranking in our experimentation for the retrieval stage. This order is also the by-ranking oracle order of passages in augmentation phase, and the order of the opening sub-sequence (corresponding to this oracle ranking) of the retrieved pseudo-oracle. In the second RAG stage, augmentation, we engineer a prompt to provide the retrieved target types for LLM assessment. The prompt starts with a header that describes the task at hand, and the format of the input data and its expected output: “ You are an assistant for a question-answering task. You are provided with entity types (or classes) from DBpedia ontology version 2015-10, each type preceded with its corresponding identifier or type ID. Then, you are provided with a user query that has been issued to a search engine. This query is best answered by ranking entities that are relevant to the query. Use the types and the query that you are provided with, to ANSWER the QUESTION to the best of your ability. If you don’t know the answer, just say that you don’t know. Keep the answer concise. Always mention one or more corresponding type IDs (which must be among the given TYPES) in their correct format (this is, for a type with name X). Examples are given below, each example between the ’’ and ’’ tags. After that, you are given the query with types so that you answer the question. ” In the zero-shot prompting scenarios, the part about providing examples (from “as it’s done in each example” to “After that,”) is omitted from the prompt. Otherwise, this header above is followed by one or more examples in the same format as the main item of the task. The prompt ends with this main item, i.e. the actual set of retrieved types –provided in a particular order according to an experimental parameter– followed by the query, the main task question for the LLM and an open field “Answer:” to be completed with the generated answer. Each type is given by its ID in a pseudo-URI (Universal Resource Identifier) for camel-cased type X, and its full type name as a valid expression in natural language. In an augmentation-phase parameter, we experiment with providing also a short description of the type, if available for its URI in DBpedia 2016-04 via the comment predicate, . The task question asks to the generator: “ Which one(s), if any, of the provided entity type(s) are the main target types of the query, this is, such that (i) they are the most specific category of entities that are relevant to the query, and (ii) they are not on the same path from the root in the tree induced by the DBpedia 2015-10 ontology? ” The generation phase completes the RAG pipeline. In all our experiments, we input the prompt build during augmentation into GPT-3.5 (gpt-3.5-turbo-0125) [10]. 2.2. Research Questions The research questions that guide our experimental analysis in Section 3 are the following: • RQ1: How does the LLM perform at assessing the target types given by an optimal or near-optimal retriever? • RQ2: What is the impact of the order in which these retrieved types are in the prompt? • RQ3: Does adding a textual description to each type name help identifying target types? • RQ4: How much do the few-shot examples provided in the prompt contribute? • RQ5: What performance differences are observed across query groups? 2.3. Experimental Setup Test collection. The dataset to evaluate TTI performances is based on DBpedia-Entity, an Entity Retrieval (ER) collection [4]. DBpedia-Entity contains 485 user queries that aggregate multiple ER benchmarks, where relevance of entities from DBpedia 2015-10 were judged by annotators. The TTI collection [8] builds on top of it by assigning human judgements of target types from DBpedia 2015-10 ontology for 479 of the 485 DBpedia-Entity queries (the remaining 6 failed to get annotated types). Specifically, 7 annotators judged types pooled by first-pass lexical retrievers. Each query gets then a set of relevant target types, with the number of annotators for each type as its relevance, making the total amount of annotators across all types for a query to be 7. These scores allow for assuming the TTI ground truth as a ranking. Evaluation metrics. We measure the performance of a method on each query, in terms of traditional set-based metrics used in Information Retrieval, specifically Precision, Recall, and their harmonic mean F-Score. We report the average performance across a query set, for few query sets of interest: the entire set of 479 queries in the TTI collection, and each of the four query groups in its partition (SemSearch-ES, INEX-LD, QALD2 and ListSearch). Method configurations. This is a summary of our experimental parameters: • (Retrieval) method: oracle or pseudo-oracle retriever; • (Augmentation) type information –only the name, or name and description–; type order –as given by the ranking, or random–; and few-shot learning –zero-shot or one-shot. 3. Experimental Results Table 1 presents the experimental results for all the RAG-based TTI methods given by the parameter configurations described in the previous Section. RQ1: Assessing near-perfect type retrieval. We first observe a very high precision for most cases, and especially for one-shot augmentation setting. Yet, in these same results we can verify that this LLM still fails to compare with the perfect retrievers, hurting the performance even further with peudo-oracle retrieval. Recall measurements are clearly underperforming. RQ2: Impact of order of retrieved types in prompt. In the majority of the cases of methods and evaluation metrics, the scenario of types ordered by ranking is the best performing, giving weight to observations in the related work regarding artefacts in the LLM favouring the memorization of the typical by-ranking order in RAG. However, for some metrics the best performing method in an entire query group, and in particular for every metric in the measurement for all queries in the collection, performs best with random order of types. RQ3: Importance of type description alongside type name in prompt. In most cases, we observe that for oracle-based methods, the provision of the type description results slightly detrimental, arguably introducing confusion rather than positively complementing the type name. Instead, for methods using the pseudo-oracle retriever, the type description helps improve the performance, as they possibly add a correction to the misleading types. RQ4: Contribution of examples in few-shot learning. As expected, the one-shot learning in the prompt benefits almost all the scenarios across methods and metrics, often substantially, and in several cases achieving a performance very close to perfect. RQ5: Breakdown by query groups. INEX-LD –general keyword queries– is, overall, as expected, the most challenging group, while ListSearch –entity list queries– and SemSearch-ES –named entity queries–, groups that lend themselves into entity types, the best performing ones. 4. Conclusion and Future Work In this work we have addressed Target Type Identification via LLM-powered RAG methods. In future lines of work, we aim to experiment with (i) alternative evaluation criteria that take into account the hierarchical nature of TTI, (ii) sub-optimal retrievers to assess the more realistic performance, and (iii) incorporating more aspects of the graph-based nature of both the type system and the knowledge base that hosts the entities. Table 1 Experimental results over all the queries in the test collection (first block), as well as over each of its query groups (second to fifth blocks). In each block, the best performance on a metric is shown in bold. Type Zero-shot One-shot Retriever Type Information Order Prec. Rec. F-Sco. Prec. Rec. F-Sco. Query group: All queries in the test collection By ranking 0.9355 0.7022 0.7719 0.9833 0.7215 0.8001 Name Random 0.9436 0.7105 0.7811 0.9854 0.7273 0.8049 Oracle By ranking 0.9315 0.6931 0.7637 0.9791 0.7213 0.7986 Name + Description Random 0.9292 0.692 0.7608 0.9854 0.7258 0.8036 By ranking 0.8697 0.6698 0.7225 0.8925 0.6666 0.7334 Name Random 0.85 0.6473 0.703 0.8789 0.6567 0.7214 Pseudo-O. By ranking 0.8826 0.6913 0.7395 0.9134 0.6722 0.7432 Name + Description Random 0.8624 0.6692 0.7188 0.8982 0.6644 0.7324 Query group: SemSearch-ES By ranking 0.9766 0.7349 0.8073 0.9844 0.7238 0.8024 Name Random 0.9688 0.7346 0.8049 0.9844 0.7275 0.8052 Oracle By ranking 0.9766 0.7171 0.7932 0.9844 0.7238 0.8024 Name + Description Random 0.9531 0.7105 0.7815 0.9844 0.7171 0.7958 By ranking 0.875 0.6604 0.7146 0.8945 0.6611 0.7297 Name Random 0.8307 0.6292 0.6823 0.8516 0.6207 0.6878 Pseudo-O. By ranking 0.8698 0.6669 0.7161 0.8984 0.6507 0.7229 Name + Description Random 0.8646 0.6415 0.7029 0.9102 0.6624 0.7346 Query group: INEX-LD By ranking 0.9192 0.6524 0.7327 0.9697 0.6574 0.7492 Name Random 0.9091 0.6397 0.7226 0.9697 0.6709 0.7589 Oracle By ranking 0.8956 0.6423 0.7172 0.9495 0.6549 0.7418 Name + Description Random 0.8939 0.6153 0.6993 0.9697 0.6684 0.7582 By ranking 0.8367 0.612 0.669 0.8081 0.5648 0.633 Name Random 0.8013 0.564 0.63 0.8182 0.5682 0.638 Pseudo-O. By ranking 0.8418 0.6279 0.6795 0.8535 0.585 0.6613 Name + Description Random 0.8148 0.5901 0.6495 0.8283 0.5581 0.6347 Query group: QALD2 By ranking 0.9065 0.7243 0.7772 0.9855 0.7732 0.8365 Name Random 0.9275 0.7448 0.7995 0.9928 0.7804 0.8437 Oracle By ranking 0.9094 0.7134 0.7708 0.9928 0.7804 0.8437 Name + Description Random 0.9246 0.7454 0.7912 0.9928 0.7829 0.8459 By ranking 0.9203 0.7388 0.7911 0.9565 0.7581 0.8179 Name Random 0.9239 0.737 0.7882 0.9384 0.7472 0.8034 Pseudo-O. By ranking 0.9529 0.7714 0.8215 0.9638 0.7605 0.8215 Name + Description Random 0.9143 0.7629 0.7988 0.9529 0.7514 0.8099 Query group: ListSearch By ranking 0.9386 0.682 0.7599 0.9912 0.712 0.7977 Name Random 0.9649 0.7032 0.783 0.9912 0.712 0.7977 Oracle By ranking 0.9386 0.6857 0.7626 0.9825 0.7047 0.7892 Name + Description Random 0.9386 0.6732 0.7541 0.9912 0.7164 0.8006 By ranking 0.8311 0.6469 0.695 0.886 0.6506 0.7225 Name Random 0.8246 0.6316 0.6865 0.8904 0.6645 0.7325 Pseudo-O. By ranking 0.8472 0.6769 0.7187 0.9211 0.6652 0.7424 Name + Description Random 0.8385 0.6557 0.7 0.8794 0.6535 0.7211 Acknowledgments This work was funded by the Norwegian Research Council grant 329745 Machine Teaching for Explain- able AI. References [1] K. Balog, Entity-Oriented Search, volume 39 of The Information Retrieval Series, Springer, 2018. [2] K. Balog, L. Azzopardi, M. de Rijke, A language modeling framework for expert finding, Information Processing & Management 45 (2009) 1–19. [3] K. Balog, M. Bron, M. de Rijke, Query modeling for entity search based on terms, categories, and examples, ACM Trans. Inf. Syst. 29 (2011) 1–31. [4] K. Balog, R. Neumayer, A test collection for entity search in DBpedia, in: Proceedings of the 36th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR ’13, 2013, pp. 737–740. [5] D. Garigliotti, K. Balog, On type-aware entity retrieval, in: Proceedings of the 2017 ACM International Conference on Theory of Information Retrieval, ICTIR ’17, 2017, pp. 27–34. [6] D. Garigliotti, F. Hasibi, K. Balog, Identifying and exploiting target entity type information for ad hoc entity retrieval, Information Retrieval Journal 22 (2019) 285–323. [7] K. Balog, R. Neumayer, Hierarchical target type identification for entity-oriented queries, in: Pro- ceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, 2012, pp. 2391–2394. [8] D. Garigliotti, F. Hasibi, K. Balog, Target type identification for entity-bearing queries, in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, 2017, pp. 845–848. [9] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 9459–9474. [10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners, 2019. [11] H. T. et al., Llama 2: Open Foundation and Fine-Tuned Chat Models, ArXiv abs/2307.09288 (2023). [12] Q. Si, T. Wang, Z. Lin, X. Zhang, Y. Cao, W. Wang, An empirical study of instruction-tuning large language models in Chinese, in: Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, 2023, pp. 4086–4107. [13] D. Garigliotti, SDG target detection in environmental reports using retrieval-augmented generation with LLMs, in: Proceedings of ClimateNLP, Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 241–250. [14] T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations, in: H. Bouamor, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 2023, pp. 6465–6488. [15] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval-augmented generation for large language models: A survey, 2024. arXiv:2312.10997. [16] D. Garigliotti, B. Johansen, J. V. Kallestad, S.-E. Cho, C. Ferri, EquinorQA: Large Language Models for Question Answering over proprietary data, in: ECAI 2024 - 27th European Conference on Artificial Intelligence - Including 13th Conference on Prestigious Applications of Intelligent Systems (PAIS 2024), IOS Press, 2024. [17] S. Robertson, The probability ranking principle in IR, Journal of Documentation 33 (1977) 294–304.