-

1613-0073

Framework for Graph Database Query Generation Leveraging Large Language Models

Volker Tresp

tresp@dbs.ifi.lmu.de 0 1

Bailan He

bailan.he@siemens.com 0 1

Yushan Liu

0 1

Marcel Hildebrandt

marcel.hildebrandt@siemens.com 1

Zifeng Ding

zifeng.ding@siemens.com 0 1

Yaomengxi Han

yaomengxi.han@siemens.com 1 2 0 Ludwig Maximilian University of Munich , Geschwister-Scholl-Platz 1, 80539 Munich , Germany 1 Siemens AG , Otto-Hahn-Ring 6, 81739 Munich , Germany 2 Technical University of Munich , Arcisstraße 21, 80333 Munich , Germany

Large language models (LLMs) have attracted considerable attention in academia and industry due to their superior performance compared to classical machine learning models across various applications. In particular, prompt engineering and in-context learning enable LLMs to operate efectively in scenarios with minimal training data, where they demonstrate proficiency with only giving precise instructions or a few examples. The advanced reasoning abilities of LLMs have been instrumental in the development of intelligent assistants. These assistants often rely on accessing information from comprehensive databases such as knowledge graphs (KGs) through natural language. The process of converting natural language requests into query language to retrieve information from databases is known as query generation (QG). One challenge in QG is the evaluation of the LLMs' performance due to the absence of standardized evaluation frameworks and datasets. To tackle this challenge, we introduce an automated evaluation framework tailored for QG, featuring three key metrics: Gold Query Accuracy, Execution Accuracy, and Execution Rate. We focus on the exemplary use case of accessing a graph database in a supply chain management (SCM) setting via a natural language interface. Our results demonstrate the eficacy of our framework and metrics in accurately evaluating model performance for QG tasks.

Leveraging Query generation Knowledge graph Large language model Supply chain management

CEUR ceur-ws.org

1. Introduction

Large Language Models (LLMs) have recently gained prominence in natural language processing, demonstrating exceptional proficiency across various tasks [ 1, 2, 3, 4 ]. Unlike conventional machine learning models that rely heavily on labeled data, LLMs exhibit reasoning capabilities, allowing them to perform diverse tasks without extensive labeled datasets. For instance, PerezBeltrachini et al. [ 5 ] developed a semantic parser using LLMs with an RDF dataset. This system CEUR Workshop Proceedings efectively understands natural language questions within a user’s dialogue and generates corresponding SPARQL queries to address them. Given the capacity of LLMs to process and generate vast amounts of text data, manual evaluation of LLMs becomes impractical, while automatic frameworks ensure a swift and eficient assessment. Automated evaluation is also preferred due to its ability to mitigate biases and inconsistencies introduced by human evaluators, thereby enhancing reliability and reproducibility. As the utilization of LLMs expands across various domains, scalable evaluation methods are imperative. Objective metrics such as perplexity [ 6 ], BLEU score [ 7 ], and ROUGE score [ 8 ] ofer quantifiable performance measures in general scenarios. However, accurately evaluating the performance of LLMs in domain-specific settings remains challenging. In many cases, anecdotal evidence and manual inspection have been used as substitutes for more rigorous evaluation methods. Furthermore, in many industrial contexts, the developers of artificial intelligence tools may not have the necessary domain expertise to evaluate the performance of LLMs, leading to a need for an automatic and systematic approach to evaluate LLMs in the absence of in-depth domain knowledge. In this paper, we focus on an application in the context of supply chain management (SCM), where the lack of standard evaluation datasets poses a significant challenge, complicating the comparison of diferent LLMs.

Supply chain management (SCM) involves continuously monitoring supply chains to ensure their operability and proactively making them suficiently resilient to withstand disruptive events such as pandemics, natural disasters, or political and economic crises [ 9 ]. With the growing volume and complexity of available data, it is essential for supply chain managers to eficiently manage, store, and retrieve the relevant information. This ensures that relevant data can be accessed and analyzed in a timely manner to make well-informed strategic decisions, identify critical risks, and accurately react to current events [ 10 ]. The management across the whole supply chain necessitates extensive databases for comprehensive analysis. Knowledge Graphs (KGs) represent networks of real-world entities, encompassing objects, events, situations, and concepts, and elucidate the connections between them and have emerged as invaluable tools for ofering an interconnected perspective on SCM data. For instance, within SCM, entities like suppliers, smelters, and components can be represented as nodes in a KG, while relationships between diferent types of nodes, such as ”located in,” serve as the edges within the KG. Several studies have developed SCM-related KGs and applied reasoning techniques to improve supply chain management [11, 12, 13]. For example, the CoyPu KG 1 integrates macroeconomic data and global crisis events to enhance transparency in SCM operations. These KGs are typically processed using graph database management systems like Neo4j 2. Querying these systems, such as using Cypher queries for Neo4j, assists in extracting relevant information for further analysis. However, mastering the optimal querying of graph databases requires significant time and efort from SCM professionals. Therefore, developing a natural language interface utilizing LLM-based approaches to interpret database schemas and translate requests into graph database queries (e.g., SPARQL or Cypher) can significantly aid SCM professionals in understanding the data and making informed decisions. This translation task is referred to as query generation (QG).

1https://coypu.org/ergebnisse/knowledge-graph 2https://neo4j.com/

Despite the emergence of QG solutions [14, 15], their real-world performance when using proprietary domain data and schemas is still uncertain because there are no readily available datasets to evaluate model performance. To address this challenge, we present a novel automated evaluation framework 3 tailored for practical applications, exemplified in the industrial use case of SCM QG. Grounded on KGs, our framework first harnesses the power of GPT-3.5 [ 16] to generate an evaluation dataset. This dataset is then utilized to objectively measure model performance, enabling users to modify prompts according to their specific needs and data requirements.

2. Definitions

Definition 1 (Knowledge Graph). A KG is defined as a collection of triples ⊂ ℰ × ℛ × ℰ where ℰ denotes the set of entities and ℛ the set of relation types. In our use case, elements in ℰ correspond to supply chain-related entities, e.g., suppliers, smelters, and components, and are represented as nodes in the graph. Every entity has a unique entity type, which is defined by the mapping ∶ ℰ → , where stands for the set of entity types. The entities are connected via relations specified in ℛ , represented as typed directed edges in the graph. Definition 2 (Query Generation). Given a user’s natural language request = ( 1, 2, ..., ) and a corresponding query = ( 1, 2, ..., ), where each and represent a token in the request and query, respectively, the aim is to accurately discern the underlying intent of a request and generate a corresponding query [ 5 ]. The execution of query within the KG database yields retrieval results denoted as , where corresponds to the user’s desired information from the KG.

3. Framework Overview

Our proposed framework consists of two main processes: (1) A query dataset creation process that generates an evaluation dataset containing diverse natural language requests and queries. (2) A QG process that evaluates the model performance of diferent prompts based on the generated evaluation dataset.

3.1. Query dataset creation

For the scope of our study, the query dataset consists of structured queries alongside their corresponding natural language requests. Creating such datasets presents notable challenges, particularly due to the potential lack of existing datasets containing relevant queries tailored to specific scenarios or domains. The task of annotating data for specific domains, such as SCM, requires the involvement of domain experts, which incurs significant costs and manual eforts.

To optimize eficiency, we present an innovative framework harnessing the generative potential of language models. This framework aims to automate the creation of query datasets, 3Code is released at: https://github.com/4hebailanc/Automated-Evaluation-Framework-for-GraphDatabase-Query-Generation thereby facilitating the quantitative evaluation of models. As depicted in Figure 1, the query generation process encompasses four sequential steps: a. Initial Query Template The process begins by creating a general query template , using placeholders for specific information elements like nodes or relations. An example for a query template in the query language Cypher is MATCH (n) -[r]-> (m:Label1) RETURN n, r, where m:Label1 acts as a placeholder for a node labeled with ‘Label1’. In the context of this query template, ‘Label1’ represents a specific label assigned to nodes within the graph database. The placeholder allows for dynamic substitution with actual node labels during query generation. b. Placeholder Substitution Once the template is crafted, specific information is utilized to replace the designated placeholders, resulting in the formulation of a comprehensive gold query . A gold query is the final query created by substituting specific information into a template, and it acts as the standard against which we evaluate the accuracy of our model’s query generation capabilities. In this context, the placeholder m:Label1 is replaced with m:ManufacturerPart. Consequently, the placeholder substitution produces the following gold query : MATCH (n) -[r] -> (m:ManufacturerPart) RETURN n, r. This particular query, , is tailored to retrieve nodes (identified as ’n’) and their associated relationships (referred to as ’r’) connected to nodes labeled as ’ManufacturerPart’ within the database’s graph structure. After substitution, each generated gold query will be executed once to ensure its functionality and accuracy in retrieving pertinent information from the database. c. Requests Generation The LLM employs diverse prompting formats to generate a range of query requests in natural language. As shown in Figure 2, three distinct prompt templates, i.e., simple prompt, schema prompt, and in-context prompt templates have been devised. The simple prompt directly instructs the model to generate natural language requests. The schema prompt incorporates the KG schema to guide the model. The in-context prompt employs in-context learning [17], integrating multiple query and natural language request pairs as demonstrations within the template. Each query results in the generation of three distinct natural language requests. d. Human Evaluation The concluding stage necessitates human intervention to validate the quality and precision of the generated queries (see Section 4.2 for details). This step acts as an important quality control measure, confirming that the queries efectively represent the intended information retrieval process.

It is noteworthy that only steps (a) and (d) necessitate human intervention within the proposed methodology. Step (a) consists of creating a query template, a process that is simplified by using the reference base query syntax 4. The human validation undertaken in Step (d) serves to ensure the quality and accuracy of queries, thereby augmenting the overall reliability of the system.

3.2. Query generation with LLMs

Our study centers on the task of query generation within the framework of Neo4j databases. Specifically, our objective is to develop a model capable of producing queries based on natural language requests, enabling the retrieval of pertinent information from the database. To assess the model’s performance quantitatively, in line with prior research [ 18, 5 ], we employ executionbased automatic metrics. We employ three primary metrics in our evaluation: Execution Rate (ER), which evaluates the executability of the output queries within the database, Gold Query Accuracy (GQA), which assesses the similarity between the output query and the gold query, and Execution Accuracy (EA), which measures the correspondence of retrieval results to those of the gold query. Both GQA and EA are calculated by averaging the scores of all gold and output queries pairs using BERTScore [19], which measures the similarity between two text sequences by computing the cosine similarity between the contextual embeddings of their tokens, as produced by the language model BERT [20]. The key idea behind BERTScore is to not only take exact word matches into account but also semantic similarity and contextual appropriateness 5, which makes it more robust compared to traditional evaluation metrics. Higher values for all three metrics are preferred: a high ER signifies correct execution of output queries in the database, while higher GQA and EA scores indicate strong resemblance between 4https://neo4j.com/docs/cypher-manual/current/queries/basic/ 5In our case, the same cypher queries return a fixed order of requested information, the output and gold queries, along with accurate retrieval results aligning with those of the gold query.

1. Execution Rate (ER): This metric evaluates the executability of the output queries within the database. It is defined as the ratio of executable queries to the total number of output queries: 2. Gold Query Accuracy (GQA): This metric evaluates the similarity between the output query and the gold query. It is calculated using the BERTScore metric [19] as follows: ER =

Number of executable queries Total number of output queries

GQA = 1

∑ BERTScore( , ) EA = 1

∑ BERTScore( , ) where is the total number of output and gold queries pairs, and and represent the output query and gold query for the -th pair, respectively. 3. Execution Accuracy (EA): (1) (2) (3) where is the total number of output and gold queries pairs, and and represent the retrieval results for the output query and gold query for the -th pair, respectively. output is given under each type of prompt. Wrong or undesired query outputs are marked in red. Without the schema information, the model with simple prompt utilizes ”associated with” as a relation type, which is not defined in the KG schema. The prompt with schema solves this problem but does not use the desired ”collect” function. With in-context demonstrations, the model shows best performance.

4. Experiments

We illustrate the practical application of our framework by utilizing real SCM data [ 10 ], emphasizing its efectiveness in assessing the performance of LLMs for QG in real-world scenarios.

4.1. Supply chain knowledge graph

The supply chain knowledge graph consolidates information from internal sources of the company Siemens. This comprehensive dataset includes insights such as tier-1 suppliers, business scopes, and specific parts associated with a company. Additionally, it incorporates external data, such as publicly available information on smelters and substances. Specifics regarding tier-2 and tier-3 suppliers are primarily derived from customs data. This information is further supplemented by a smaller portion obtained from both private customs records and public media sources. In total, there are 16,910 tier-1 suppliers, 43,759 tier-2 suppliers, and 49,775 tier-3 suppliers of Siemens. All entity and relation types and corresponding numbers of nodes and edges are listed in Table 1. It is important to note that suppliers at diferent tier levels are not mutually exclusive. To facilitate the access to this complex network of information, the graph is structured using the graph database platform Neo4j.

4.2. Supply chain query dataset

We compiled a set of 60 query templates, categorized into six main groups as detailed in Table 2. Each template underwent placeholder substitution, as outlined in Section 3.1, yielding ifve distinct queries per template. This process resulted in the creation of 300 gold queries. Subsequently, we generated three natural language requests for each gold query, using three diferent prompts. These query-request pairs were subsequently evaluated by three reviewers within our organization to determine their feasibility. A query-request pair is labeled as reasonable if the reviewers agree that the query is efective in retrieving information from the database to answer the corresponding request; otherwise, it was marked as unreasonable. Any pair receiving unreasonable annotations from the reviewers was filtered out. Following this ifltering process, we retained 825 pairs of query-requests to constitute our query dataset. Interannotator agreement was measured using Fleiss’ kappa [21], which yielded a value of 0.72 in our annotations, indicating substantial agreement among the reviewers regarding the alignment of the generated query and request pairs. The high number of retained pairs underscores the efectiveness of our dataset generation methodology. The time spent annotating by each reviewer is estimated to be 3 hours.

4.3. Query generation performance

In this section, we show how the generated dataset can be used to evaluate the QG capabilities of the model and to help modify the model’s corresponding task prompts. As shown in Figure 3, we evaluate the performance of three diferent types of QG prompts using two state-of-the-art language models, GPT-3.5 and GPT-4: simple prompts, prompts with schema, and in-context prompts. In a simple prompt, the model is directed to generate a corresponding query without prior knowledge of schema information within the KG. This may result in the utilization of relation types not present in the KG; for instance, the relation type ”associated with” is not defined in the KG schema. However, when additional schema data is integrated, the model shows improved efectiveness by employing precise schema properties in queries. Nevertheless, without specific user directives, such as indicating the desired output format using the ”collect” function, the generated query may not perfectly match our gold query. The in-context prompt takes a step further by providing multiple request and query pairs, along with the schema, to guide the model in producing correct results. When presented with an in-context demonstration, the model shows the best performance in QG tasks.

The results detailed in Table 3 consistently demonstrate GPT-4’s superior performance over GPT-3.5 across all three prompt types. Notably, employing schema and in-context demonstrations consistently result in higher GQA , ER, and EA compared to direct instructions to the model. The provision of a schema significantly enhances model performance, with in-context demonstrations exhibiting the highest eficacy in both GPT-3.5 and GPT-4. These results provide valuable insights into the eficacy of various prompting methods and highlight the advancements from GPT-3.5 to GPT-4 in QG tasks. Furthermore, they validate the utility of query datasets generated through the pipeline for prompt tuning in real-world applications.

5. Conclusion

In conclusion, our paper proposed a comprehensive automated evaluation framework for QG, addressing the challenge of assessing the performance of LLMs in specific domains due to the lack of standardized evaluation criteria and datasets. Our framework comprises two main steps: firstly, the creation of an evaluation dataset leveraging the reasoning abilities of LLMs through query templates, and secondly, the evaluation of QG model performance based on the generated evaluation dataset. To illustrate the eficacy of our framework, we apply it to a concrete industry use case: QG in SCM KGs. The creation of a diverse supply chain query dataset, as delineated in Table 2, establishes the groundwork for assessing the model performance of diferent prompts for QG. Subsequent analysis, summarized in Table 3, consistently demonstrates the superiority of GPT-4 over GPT-3.5 across various conditions. Overall, our framework enhances the comprehension of QG within the realm of SCM graphs, ofering practical implications for enhancing query eficiency and accuracy in real-world SCM applications. This use case exemplifies our methodology for enhancing eficiency and accuracy in query generation within real-world scenarios.

6. Future Directions

In future research, expanding the experimental configuration to include other prominent LLMs like Gemini [22] from Google and Llama [ 1 ] from Meta would be beneficial. This expansion would broaden the scope of the study, allowing for a more comprehensive evaluation of the proposed framework’s efectiveness across a wider range of LLMs. Additionally, the current single-round question answering format is constraining; hence, delving into the exploration of multi-round dialogues represents a promising avenue for further exploration. By incorporating these future directions, the research can continue to advance our understanding of LLMs and their applications in real-world scenarios.

Acknowledgments This work has been supported by the German Federal Ministry for Economic Afairs and Climate Action (BMWK) as part of the project CoyPu under grant number 01MK21007K and has also been supported by the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. ternational Workshop on Linked Data-driven Resilience Research 2023 co-located with Extended Semantic Web Conference 2023 (ESWC 2023), Hersonissos, Greece, May 28, 2023, volume 3401 of CEUR Workshop Proceedings, CEUR-WS.org, 2023. URL: https: //ceur-ws.org/Vol-3401/paper3.pdf. [11] J. Deng, C. Chen, X. Huang, W. Chen, L. Cheng, Research on the construction of event logic knowledge graph of supply chain management, Adv. Eng. Informatics 56 (2023) 101921. URL: https://doi.org/10.1016/j.aei.2023.101921. doi:10.1016/J.AEI.2023.101921. [12] X. Huang, L. Cheng, J. Deng, T. Wang, Binocular attention-based stacked bilstm NER model for supply chain management event knowledge graph construction, in: Proceedings of the 15th International Conference on Machine Learning and Computing, ICMLC 2023, Zhuhai, China, February 17-20, 2023, ACM, 2023, pp. 40–46. URL: https://doi.org/10.1145/ 3587716.3587723. doi:10.1145/3587716.3587723. [13] W. Chen, L. Cheng, T. Wang, J. Deng, Knowledge graph construction for supply chain management in manufacturing industry, in: D. Huang, P. Premaratne, B. Jin, B. Qu, K. Jo, A. Hussain (Eds.), Advanced Intelligent Computing Technology and Applications - 19th International Conference, ICIC 2023, Zhengzhou, China, August 10-13, 2023, Proceedings, Part IV, volume 14089 of Lecture Notes in Computer Science, Springer, 2023, pp. 682–693. URL: https://doi.org/10.1007/978-981-99-4752-2_56. doi:10.1007/978- 981- 99- 4752- 2\_56. [14] T. Bratanic, Generating cypher queries with chatgpt-4 on any graph schema, 2023. URL: https://neo4j.com/developer-blog/ generating-cypher-queries-with-chatgpt-4-on-any-graph-schema/, accessed: [15.10.2023]. [15] X. Li, R. Zhao, Y. K. Chia, B. Ding, L. Bing, S. R. Joty, S. Poria, Chain of knowledge: A framework for grounding large language models with structured knowledge bases, CoRR abs/2305.13269 (2023). URL: https://doi.org/10.48550/arXiv.2305.13269. doi:10.48550/ARXIV.2305.13269. arXiv:2305.13269. [16] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [17] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, Z. Sui, A survey for in-context learning, arXiv preprint arXiv:2301.00234 (2022). [18] A. Saha, V. Pahuja, M. Khapra, K. Sankaranarayanan, S. Chandar, Complex sequential question answering: Towards learning to converse over linked question answer pairs with a knowledge graph, in: Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. [19] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with BERT, in: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. URL: https: //openreview.net/forum?id=SkeHuCVFDr. [20] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423. doi:10.18653/V1/N19- 1423. [21] J. L. Fleiss, Measuring nominal scale agreement among many raters., Psychological bulletin 76 (1971) 378. [22] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M.

Dai, A. Hauth, et al., Gemini: a family of highly capable multimodal models, arXiv preprint arXiv:2312.11805 (2023).

[1]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , A.

Rodriguez , A.

Joulin , E. Grave, G. Lample, Llama: Open and eficient foundation language models , ArXiv abs/2302 .13971 ( 2023 ). URL: https: //api.semanticscholar.org/CorpusID:257219404.

[2]

Hofmann ,

Borgeaud ,

Mensch , E. Buchatskaya,

Cai ,

Rutherford , D. de las Casas,

L. A.

Hendricks ,

Welbl ,

Clark ,

Hennigan ,

Noland ,

Millican , G. van den Driessche,

Damoc ,

Guy ,

Osindero ,

Simonyan ,

Elsen ,

Vinyals ,

J. W.

Rae , L. Sifre, An empirical analysis of compute-optimal large language model training , in: A. H. Oh , A.

Agarwal , D.

Belgrave , K. Cho (Eds.), Advances in Neural Information Processing Systems , 2022 . URL: https://openreview.net/forum?id=iBBcRUlOAPR.

[3]

Chowdhery ,

Narang ,

Devlin ,

Bosma ,

Mishra ,

Roberts ,

Barham ,

H. W.

Chung ,

Sutton ,

Gehrmann ,

Schuh ,

Shi ,

Tsvyashchenko ,

Maynez ,

Rao ,

Barnes ,

Tay ,

Shazeer ,

Prabhakaran ,

Reif ,

Du ,

Hutchinson ,

Pope ,

Bradbury ,

Austin ,

Isard ,

Gur-Ari ,

Yin ,

Duke ,

Levskaya ,

Ghemawat ,

Dev ,

Michalewski ,

Garcia ,

Misra ,

Robinson ,

Fedus ,

Zhou ,

Ippolito ,

Luan ,

Lim ,

Zoph ,

Spiridonov ,

Sepassi ,

Dohan ,

Agrawal ,

Omernick ,

A. M.

Dai ,

T. S.

Pillai ,

Pellat ,

Lewkowycz , E. Moreira,

Child ,

Polozov ,

Lee ,

Zhou ,

Wang ,

Saeta ,

Diaz ,

Firat ,

Catasta ,

Wei ,

Meier-Hellstern ,

Eck ,

Dean ,

Petrov ,

Fiedel , Palm: Scaling language modeling with pathways , J. Mach. Learn. Res . 24 ( 2023 ) 240 : 1 - 240 : 113 . URL: http://jmlr.org/papers/v24/ 22 - 1144 .html.

[4]

Liang ,

Bommasani ,

Lee ,

Tsipras ,

Soylu ,

Yasunaga ,

Zhang ,

Narayanan ,

Wu ,

Kumar ,

Newman ,

Yuan ,

Yan ,

Zhang ,

Cosgrove ,

C. D.

Manning ,

Ré ,

Acosta-Navas ,

D. A.

Hudson ,

Zelikman ,

Durmus ,

Ladhak ,

Rong ,

Ren ,

Yao ,

Wang ,

Santhanam ,

L. J.

Orr ,

Zheng ,

Yüksekgönül ,

Suzgun ,

Kim ,

Guha ,

N. S.

Chatterji ,

Khattab ,

Henderson ,

Huang ,

Chi ,

S. M.

Xie ,

Santurkar ,

Ganguli ,

Hashimoto ,

Icard ,

Zhang ,

Chaudhary ,

Wang ,

Li ,

Mai ,

Zhang , Y. Koreeda, Holistic evaluation of language models , CoRR abs/2211 .09110 ( 2022 ). URL: https://doi.org/10.48550/arXiv.2211.09110. doi: 10 .48550/ARXIV.2211.09110. arXiv: 2211 . 09110 .

[5]

Perez-Beltrachini ,

Jain , E. Monti,

Lapata , Semantic parsing for conversational question answering over knowledge graphs ( 2023 ) 2499 - 2514 . URL: https://doi.org/10. 18653/v1/ 2023 .eacl-main. 184 . doi: 10 .18653/V1/ 2023 .EACL-MAIN. 184 .

[6] Perplexity-a measure of the dificulty of speech recognition tasks , The Journal of the Acoustical Society of America 62 ( 1977 ) S63 - S63 .

[7]

Papineni ,

Roukos ,

Ward , W. Zhu, Bleu: a method for automatic evaluation of machine translation , in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6 - 12 , 2002 , Philadelphia, PA, USA, ACL, 2002 , pp. 311 - 318 . URL: https://aclanthology.org/P02-1040/. doi: 10 .3115/1073083.1073135.

[8]

C.-Y.

Lin , ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics , Barcelona, Spain, 2004 , pp. 74 - 81 . URL: https://aclanthology.org/W04-1013.

[9]

Cox , Power, value and supply chain management , Supply chain management: An international journal 4 ( 1999 ) 167 - 175 .

[10]

Liu ,

He ,

Hildebrandt ,

Buchner ,

Inzko ,

Wernert ,

Weigel ,

Beyer ,

Berbalk ,

Tresp , A knowledge graph perspective on supply chain resilience , in: S. Tramp,

Usbeck ,

Arndt ,

Holze , S. Auer (Eds.), Proceedings of the Second In-