=Paper= {{Paper |id=Vol-3747/paper10 |storemode=property |title=On Constructing Biomedical Text-to-Graph Systems with Large Language Models |pdfUrl=https://ceur-ws.org/Vol-3747/text2kg_paper10.pdf |volume=Vol-3747 |authors=Antonio Puertas Gallardo,Segrio Consoli,Mario Ceresa,Roel Hulsman,Lorenzo Bertolini |dblpUrl=https://dblp.org/rec/conf/text2kg/GallardoCCHB24 }} ==On Constructing Biomedical Text-to-Graph Systems with Large Language Models== https://ceur-ws.org/Vol-3747/text2kg_paper10.pdf
                                On Constructing Biomedical Text-to-Graph Systems
                                with Large Language Models
                                Lorenzo Bertolini1 , Roel Hulsman1 , Sergio Consoli1 , Antonio Puertas-Gallardo1 and
                                Mario Ceresa1

                                1
                                    European Commission, Joint Research Centre (JRC), Ispra, Italy



                                              Abstract
                                              Knowledge graphs and ontologies represent symbolic and factual information that can offer structured
                                              and interpretable knowledge. Extracting and manipulating this type of information is a crucial step in
                                              complex processes such as human reasoning. While Large Language Models (LLMs) are known to be
                                              useful for extracting and enriching knowledge graphs and ontologies, previous work has largely focused
                                              on comparing architecture-specific models (e.g. encoder-decoder only) across benchmarks from similar
                                              domains. In this work, we provide a large-scale comparison of the performance of certain LLM features
                                              (e.g. model architecture and size) and task learning methods (fine-tuning vs. in-context learning (iCL))
                                              on text-to-graph benchmarks in the biomedical domain. Our experiment suggests that, while a simple
                                              truncation-based heuristic can notably boost the performance of decoder-only models used with iCL,
                                              small fine-tuned encoder-decoder models produce the most stable and strong performance. Moreover, we
                                              found that a massive out-of-domain text-graph pre-training has a positive impact on fine-tuned models,
                                              while we observed only a marginal impact of pre-training and size for decoder-only iCL models.


                                              Keywords
                                              Information Extraction, Knowledge Graphs, Word Embeddings, Large Language Models, In-Context
                                              Learning




                                1. Introduction
                                Acquiring structured knowledge from text is a fundamental step in a complex process like
                                reasoning and answering questions, whether such a process is carried out by a human or
                                an artificial intelligence (AI) system [1]. In natural language processing (NLP), structured
                                knowledge is often handled via ontologies or knowledge graphs [2, 3, 4]. Knowledge graphs
                                are typically organised as collections of [(head # relation # tail)] triplets, such as
                                [(dog # isA # animal)], or [(Rome # CapitalOf # Italy)]. Knowledge graphs and
                                ontologies play a pivotal role in representing knowledge across various domains, facilitating


                                Third International Workshop on Knowledge Graph Generation from Text, May 26-30, 2024, co-located with Extended
                                Semantic Web Conference (ESWC), Hersonissos, Greece
                                $ lorenzo.bertolini@ec.europa.eu (L. Bertolini)
                                            © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
intelligent applications such as chatbots [5], recommendation systems [6] question answering
systems [7, 8] and more [9, 1].
Knowledge graphs have seen a surge in their application in recent years [10, 11]. However,
building them can be laborious and costly [8, 4]. This has led to the development of numerous
methods aimed at auto-generation of these graphs from text sources in various fields [12, 9, 11, 4].
Until recently, extracting and manipulating knowledge graphs and other forms of graphs has
been largely dealt with by small knowledge graph embedding models (KGEs) [13], which are
lightweight but limited in capabilities, or different types of graph neural networks (GNNs)
[14, 15], such as convolutional graph neural networks (CGNNs) [16], or gated attention graph
neural networks (GAT-GNN) [17]. Recently, many of these architectures have been replaced by
transformer-based large language models (LLMs) [18], which have shown great potential in
modelling graph-based data.
Despite these advancements, current techniques still suffer from significant limitations concern-
ing accuracy, completeness, privacy, bias, and scalability [19, 20, 4]. Therefore, generating a
large-scale knowledge graph automatically from text corpora remains an open challenge [3, 9, 4].
As shown by a consistent body of evidence [21, 22, 23], LLMs can be adapted to both extract
knowledge graphs from a reference text (text-to-graph task), as well as to convert knowledge
graphs into natural language while maintaining the semantic meaning (graph-to-text task). We
are interested in the former.
To adapt an LLM to a particular task, two popular task learning methods are fine-tuning and
in-context learning (iCL) [24]. Given a training dataset pertaining to the new task at hand,
fine-tuning an LLM amounts to an additional training phase to update a subset of learnable
model parameters to adapt to the new task. In-context learning, on the other hand, consists of
including a few task examples in the model prompt at inference time - a special case of few-shot
learning. Typically, iCL provides weaker performance than fine-tuning and is computationally
more expensive at inference time [25, 24], yet it is highly flexible as it does not require any
parameter updates. Both options involve a vast amount of design choices, from the quality and
quantity of available training data to the amount of in-context examples to include in iCL.
While most work on knowledge graph extraction has focused on pushing the state-of-the-art
in terms of performance [26, 27] or summarising the field in terms of different applications
[21, 22, 23] and formulations of scenarios and tasks [28, 29, 30], it remains unclear to the general
AI practitioner what would be, given a specific dataset and computational resources, the best
solution to approach a text-to-graph task, formulated as an end-to-end LLM-based solution.
This work is directed to the general AI practitioner in the biomedical domain aiming to develop
an end-to-end LLM-based knowledge graph extraction system from textual sources. We inves-
tigate how to best approach such task by examining various combinations of model design
choices, assuming a fixed and accessible computational resource of a single RTX 8000 GPU.
The main variables under investigation are model architecture (encode-decoder, decoder-only),
model family (T5, BART, Mistral-v0.1, Llama-2), model size (small (60M) to mid (13B learnable
parameters)), task learning method (fine-tuning, iCL) and additional pre-training data (rela-
tion extraction data, conversation data, instruction data, (bio)medical data). In brief, the main
insights of this paper encompass the following:
   1. We provide tentative evidence that biomedical knowledge graphs can be hard to model.
      Mid-sized decoder-only models adopting iCL show weak performance, while performance
      of small fine-tuned encoder-decoder models is robust compared to the general domain.
   2. For small fine-tuned encoder-decoder models we observe power-law scaling in model size,
      while for mid-sized decoder-only models adopting iCL we instead observe power-law
      scaling in the number of in-context examples. This is in line with known results [24].
   3. Only additional pre-training data on relation extraction tasks boosts model performance,
      while neither observing conversation data, instruction data nor (bio)medical data during
      pre-training makes a notable difference.
   4. We propose and experimentally prove the effectiveness of a simple truncation-based
      heuristic on model output to control for a specific type of hallucination of in-context
      learning, avoiding expensive prompt tuning and prompt design.


2. Material and Methods
Knowledge graph structure To ensure a stable and fair comparison across models with
different pre-training, we pre-process the selected dataset to match the following linearised
text-graph structure. Formally, a dataset consists of two sets of strings 𝑇 and 𝐺, where each
reference text 𝑡𝑖 ∈ 𝑇 and knowledge graph 𝑔𝑖 ∈ 𝐺 are assumed to be identical representations
semantically, but differ syntactically. For example, given a reference text “The pencil is on the
table.”, we represent the corresponding knowledge graph as containing one linearised triplet
“[(pencil # IsOn # table)]”. In the coming paragraphs, we present a detailed example
of the proposed linearisation, in the context of the prompt used for in-context-learning set-up.


Dataset We use BioEvent [30], a benchmark that aggregates 10 popular biomedical datasets,
and adopt a simple strategy to clean up the data by removing any duplicate pairs and breaking
ties in favour of the text-graph pair pertaining to the longest linearised knowledge graph,
assuming that the longest knowledge graph is the most complete description of the entities and
relations described, and finally obtain a train/validation/test set using an 80/10/10% split.


Metrics We evaluate performance with Rouge scores [31], namely Rouge-𝑛 (𝑛 = 1, 2) and
Rouge-L. The former is based on 𝑛-grams, while the latter on the longest common sub-sequence
(LCS) between two strings, as implemented by Hugging Face evaluate library.


Models We adopt models from two families of pre-trained encoder-decoder: T5 and BART.
We adopt three sizes in the T5 family, namely 60.5M parameters (t5-small), 223 M (t5-base)
and 738 M (t5-large). As for BART, we use of the BART model (bart-large) introduced in
[32], as well as a version from [33], tuned on REBEL dataset [34], a large relation extraction
dataset designed for text-to-text modelling.
We then opt for two sets of pre-trained decoder-only LLMs: Mistral-v0.1 and Llama-2. We
include two sizes, one with 7B parameters (Llama-2-chat-7b-hf) and the largest model in
our analysis with 13B parameters (Llama-2-chat-13b-hf), as well as a Llama-2 model fine-
tuned on biomedical knowledge and question-answering (meditron-7b) [35], to investigate
the beneficial effect of domain-specific training in the biomedical domain.
As for the Mistral family, we adopt three models. The original 7B model (Mistral-v0.1), a ver-
sion fine-tuned on a variety of open-source conversation datasets (Mistral-Instruct-v0.1),
and finally a version fine-tuned by OpenOrca (Mistral-OpenOrca) [36] on a reproduction
attempt of the Orca dataset [37], leveraging the Flan Collection for effective instruction-tuning
[38]. Importantly, all models adopted in this work are fully open-source and accessible through
Hugging Face by adopting the transformer library [39].


Learning methods We adopt two distinct task learning methods, fine-tuning for the smaller
encoder-decoder models and iCL [24] for the larger decoder-only models. All fine-tuning
experiments are based on the trainer class implementation from Hugging Face. Given a
text-graph pair (𝑡𝑖 , 𝑔𝑖 ) in the pre-defined training set, each model undergoes an additional
fine-tuning phase where it is trained to generate the graph 𝑔𝑖 as output, using the text 𝑡𝑖 as
input. All models are tuned end-to-end for up to ten epochs, selecting the best model based on
the validation Rouge-1 score, as per standard practice in NLP and knowledge graph literature
[40, 41]. For training, hyper-parameters are as given in Table 2 of Appendix A.
For the iCL setting, each pre-trained model is queried with a simple prompt, containing a set
of 𝑁 solved text-graph examples taken from the available training set. To limit the impact of
selecting a set of poor examples, we sample 𝑁 examples randomly from the training set for each
test instance at inference time. Moreover, we omit time and computation-consuming prompt
engineering and computationally expensive prompt tuning to resemble common practice of end-
users. However, we highlight the importance of such practice to prevent model hallucinations,
and, more generally, to prevent spurious features in prompt design along the lines of [42]. To
provide a fair estimate of iCL performance, we introduce a simple post-hoc hallucination-control
heuristic to determine the end of the desired structured output (i.e. the end of a knowledge
graph). Simply put, we truncate model output at the appearance of the tokens “)]”, signalling
the end of a knowledge graph in our graph structure. An example of the finalised iCL prompt
(with 𝑁 = 2) is presented in Figure 1.


Experimental setup The main experiment is designed to unveil the approximate overall
power of selected models and task learning methods, as well as to understand what impacts
and shapes their performance. The models’ algorithms have been implemented in Python
version 3.8.5, and all the computations run on a single RTX 8000 GPU within a AMD EPYC 7282
16-Core 64-bit microprocessor at 1.50GHz with 512GB RAM. We do recognise that assuming
larger computational power could significantly improve results, especially by including large
decoder-only models or by fine-tuning the mid-sized Mistral-v0.1 and Llama-2 families, which
is out of the computational reach of the current setup.
      Task
                Convert the text into a sequence of triplets:


                Text: Further investigation using inhibition or genetic deletion of Erbb2 in vitro
                revealed reduced Cdc25a levels and increased S-phase arrest in UV-irradiated
                cells lacking Erbb2 activity.
                Graph: [(reduced # Theme # Cdc25a) | (reduced # Cause # genetic deletion) |
      Context




                (genetic deletion # Theme # Erbb2)]
                Text: In this study, we showed that iNOS was ubiquitinated and degraded
                dependent on CHIP (COOH terminus of heat shock protein 70-interacting
                protein), a chaperone-dependent ubiquitin ligase.
                Graph: [(dependent # Theme # ubiquitinated) | (ubiquitinated # Theme # iNOS)
                | (dependent # Cause # CHIP)]


                Text: Such activity was abolished in mechanically stimulated mouse MRTF-A(-/-)
      Text




                cells or upon inhibition of CREB-binding protein (CBP)
      Query




                Graph:



Figure 1: Example prompt - 𝑁 = 2 in-context examples. The prompt ends with a query for a knowledge
graph pertaining to the last reference text.


The main goal of the experiment is to understand how LLM characteristics and task-learning
methods perform in our text-to-graph task, under fixed computational resources. Throughout,
we aim to guide the general AI practitioner to understand which combination is most suited
for such a task and to showcase how to navigate (part of) the vast and complex spectrum of
model design choices. Given the fixed computational resources, we fine-tune the previously
introduced set of smaller encoder-decoder models and compare performance to the set of larger
decoder-only models in combination with iCL. This choice is framed in the context of a given
computational resource such that fine-tuning is computationally infeasible for larger models.
At the same time, the short context window of the T5 and BART families (1k tokens or below)
proves iCL unsuitable. Following [43], we adopt 𝑁 = 8 for the amount of in-context examples.


3. Results
The overall results of various combinations of model architecture, family, size, relevant pre-
training data and task learning method are shown in Table 1. First, we can observe a clear
benefit in fine-tuning smaller encoder-decoder models. We hypothesise this relates to issues
regarding benchmark quality. Bio Event presents a high amount of unique entities and triplets,
creating a complex distribution of patterns in reference texts that is difficult to infer correctly
from just 8 in-context examples. Overall the best performance across metrics is reached with
fine-tuned decoder-only models, i.e. the largest model in the T5 family.

Table 1
General experiment results. We report Rouge scores obtained by various combinations of model
architecture, family, size (learnable parameters), relevant additional data seen during pre-training and
our task learning method of fine-tuning or iCL.
                                                                                          Bio Event
       Arch.                    Family      Size    Pre-training    Learning Method   R-1 R-2 R-L
             Encod.-decod.




                                            6.5M                    Fine-tuning       .57    .42   .53
                                     T5    223M                     Fine-tuning       .62    .48   .59
                                           738M                     Fine-tuning       .66    .54   .63
                                           406M                     Fine-tuning       .53    .38   .51
                                  BART
                                           406M           REBEL     Fine-tuning       .67    .56   .64
                                               7B                   iCL + Heuristic   .43    .25   .37
                                               7B                   iCL               .16    .07   .13
                                               7B   Conversation    iCL + Heuristic   .43    .24   .36
                             Mistral-v.1
           Decoder-only




                                               7B   Conversation    iCL               .18    .08   .14
                                               7B      OpenOrca     iCL + Heuristic   .44    .25   .36
                                               7B      OpenOrca     iCL               .24    .12   .19
                                               7B     Instruction   iCL + Heuristic   .44    .24   .37
                                               7B     Instruction   iCL               .17    .08   .14
                                             13B      Instruction   iCL + Heuristic   .44    .24   .37
                               Llama-2
                                             13B      Instruction   iCL               .16    .07   .13
                                               7B       Meditron    iCL + Heuristic   .40    .21   .34
                                               7B       Meditron    iCL               .17    .08   .14

Moreover, within both the T5 and Llama-2 family, we find a clear positive correlation between
model size and performance. This is in line with the well-documented phenomenon of power-
law scaling of LLM performance in the number of model parameters [44]. Focusing on the
BART family, we see that adopting an additional relation extraction dataset during pre-training
(REBEL) yields universally superior results. This is in sharp contrast to other pre-training
additions, since neither conversation data nor instruction, OpenOrca or Meditron datasets seem
to affect performance on either benchmark. We hypothesise none are particularly relevant to
our text-to-graph task, although this is notably most surprising for the biomedical knowledge
in the Meditron pre-training data.
Table 1 also shows that our hallucination-control heuristic for iCL models yields a large per-
formance boost, independently of architecture, family, size, or pre-training data. To briefly
reiterate, this was put in place to avoid computationally and experimentally demanding prompt
engineering or tuning, and implemented by truncating model output after tokens signalling
a graph’s end (i.e., “)]”). The jump in performance can reach more than 20 points, and is
consistent across all Rouge scores. Concerning specific metrics, we found Rouge-1 (R-1) scores
to be consistently higher, especially in decoder-only models, indicating a stronger entity and
relation recognition. We also found R-L scores to be systematically above Rouge-2 (R-2) score,
and closer to R-1. This suggests that the identified entities and relations are often in the right
order, but certain entities or relations are missing such that correct 2-grams are lacking.


4. Discussion
This work is directed at biomedical researchers and practitioners aiming to develop an end-to-
end LLM-based automatic graph extraction system from textual sources. Assuming a realistic
computational baseline, our large-scale comparison contributed to the development of a more
effective and efficient pipeline for biomedical knowledge extraction and representation tasks by
highlighting the impact of a plethora of design choices and provided several empirical insights.
Indeed, off-the-shelf LLMs together with a task learning method can achieve strong entity
and relation recognition, and reach moderate yet promising overall results on knowledge
graph completion. The optimal performance of LLMs is likely higher than displayed here,
e.g. due to prompt engineering/tuning, hyper-parameter tuning, more computational power
and more model parameters. Our results indicate that, without fine-tuning, LLMs might not
be directly suitable for biomedical text-to-graph tasks. Fine-tuning has proven more robust
than iCL, since mid-sized decoder-only models adopting iCL show weak performance, while
small fine-tuned encoder-decoder models achieve robust moderate results. We hypothesise that
expert knowledge contained in reference texts in the biomedical domain poses a more difficult
knowledge extraction problem, such that iCL with a small amount of in-context examples is
not sufficient to correctly extrapolate said task. That is, knowledge graphs in the biomedical
domain might require knowledge obtained across a large set of examples. However, we provide
strong and consistent evidence for our simple truncation-based heuristic to be highly effective in
boosting model performance without time-expensive prompt engineering and computationally
expensive prompt tuning, which is not necessarily generalisable across subsets of the same task
[45]. Crucially, this suggests that when the output of a model follows a constrained structure,
simple rule-based heuristics can be an efficient method to limit undesired output.


5. Conclusions
This work examined the ability of LLMs to generate biomedical knowledge graphs from refer-
ence texts, comparing end-to-end fine-tuned encoder-decoder models, against decoder-only
models used with in-context learning (iCL). Our results showed how small fine-tuned encoder-
decoder models consistently outperform mid-sized decoder-only models adopting iCL. We found
evidence that our simple heuristic to control for model hallucination has a consistently positive
impact on the performance of decoder-only models, but no connection between performance
and including additional datasets during pre-training that are not directly linked to the text-to-
graph task, such as conversation-tuning, instruction-tuning and biomedical expert knowledge.
On the contrary, we found that including a relation-extracting dataset like REBEL showed a
notable boost in the performance of encoder-decoder models, for which we also observed a
power-law connection between model size and performance.
References
 [1] S. Tiwari, F. Ortíz-Rodriguez, S. B. Abbés, P. U. Usip, R. Hantach, Semantic AI in Knowledge
     Graphs, Taylor & Francis, Boca Raton, US, 2023. doi:10.1201/9781003313267.
 [2] H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation
     methods, Semantic Web 8 (2017).
 [3] A. Hogan, E. Blomqvist, M. Cochez, C. D’Amato, G. D. Melo, C. Gutierrez, S. Kirrane,
     J. E. L. Gayo, R. Navigli, S. Neumaier, A.-C. N. Ngomo, A. Polleres, S. M. Rashid, A. Rula,
     L. Schmelzeisen, J. Sequeda, S. Staab, A. Zimmermann, Knowledge graphs, ACM Comput-
     ing Surveys 54 (2021). doi:10.1145/3447772.
 [4] C. Peng, F. Xia, M. Naseriparsa, F. Osborne, Knowledge Graphs: Opportunities and Chal-
     lenges, Artificial Intelligence Review 56 (2023). doi:10.1007/s10462-023-10465-9.
 [5] A. Ait-Mlouk, L. Jiang, KBot: A Knowledge Graph Based ChatBot for Natural Language
     Understanding over Linked Data, IEEE Access 8 (2020). doi:10.1109/ACCESS.2020.
     3016142.
 [6] Y. Xian, Z. Fu, S. Muthukrishnan, G. De Melo, Y. Zhang, Reinforcement Knowledge Graph
     Reasoning for Explainable Recommendation, Association for Computing Machinery, New
     York, NY, USA, 2019. doi:10.1145/3331184.3331203.
 [7] X. Huang, J. Zhang, D. Li, P. Li, Knowledge Graph Embedding Based Question Answering,
     Association for Computing Machinery, New York, NY, USA, 2019. doi:10.1145/3289600.
     3290956.
 [8] M. Kejriwal, J. Sequeda, V. Lopez, Knowledge Graphs: Construction, Management and
     Querying, Semantic Web 10 (2019). doi:10.3233/SW-190370.
 [9] M. Kejriwal, Knowledge Graphs: A Practical Review of the Research Landscape, Informa-
     tion 13 (2022). doi:10.3390/info13040161.
[10] X. Chen, S. Jia, Y. Xiang, A Review: Knowledge Reasoning Over Knowledge Graph, Expert
     Systems with Applications 141 (2020). doi:10.1016/j.eswa.2019.112948.
[11] S. Ji, S. Pan, E. Cambria, P. Marttinen, P. S. Yu, A Survey on Knowledge Graphs: Rep-
     resentation, Acquisition, and Applications, IEEE Transactions on Neural Networks and
     Learning Systems 33 (2022). doi:10.1109/TNNLS.2021.3070843.
[12] Q. Liu, Y. Li, H. Duan, Y. Liu, Z. Qin, Knowledge graph construction techniques, Journal of
     Computer Research and Development 53 (2016). doi:10.7544/issn1000-1239.2016.
     20148228.
[13] Q. Wang, Z. Mao, B. Wang, L. Guo, Knowledge Graph Embedding: A Survey of Approaches
     and Applications, IEEE Transactions on Knowledge and Data Engineering 29 (2017).
     doi:10.1109/TKDE.2017.2754499.
[14] Z. Ye, Y. J. Kumar, G. O. Sing, F. Song, J. Wang, A comprehensive survey of graph neural
     networks for knowledge graphs, IEEE Access 10 (2022). doi:10.1109/ACCESS.2022.
     3191784.
[15] L. Wu, Y. Chen, K. Shen, X. Guo, H. Gao, S. Li, J. Pei, B. Long, Graph Neural Networks for
     Natural Language Processing: A Survey, Foundations and Trends in Machine Learning 16
     (2023). doi:10.1561/2200000096.
[16] S. Zhang, H. Tong, J. Xu, R. Maciejewski, Graph Convolutional Networks: Algorithms,
     Applications and Open Challenges, Springer International Publishing, Cham, 2018. doi:10.
     1007/978-3-030-04648-4\_7.
[17] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph Attention
     Networks, in: International Conference on Learning Representations, 2018. URL: https:
     //openreview.net/forum?id=rJXMpikCZ.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, Advances in Neural Information Processing Systems 30
     (2017).
[19] F. Radulovic, N. Mihindukulasooriya, R. García-Castro, A. Gómez-Pérez, A comprehensive
     quality model for linked data, Semantic Web 9 (2018). doi:10.3233/SW-170267.
[20] M. R. A. Rashid, G. Rizzo, M. Torchiano, N. Mihindukulasooriya, O. Corcho, R. García-
     Castro, Completeness and consistency analysis for evolving knowledge bases, Journal of
     Web Semantics 54 (2019). doi:10.1016/j.websem.2018.11.004.
[21] B. Jin, G. Liu, C. Han, M. Jiang, H. Ji, J. Han, Large language models on graphs: A
     comprehensive survey, arXiv preprint arXiv:2312.02783 (2023).
[22] J. Liu, C. Yang, Z. Lu, J. Chen, Y. Li, M. Zhang, T. Bai, Y. Fang, L. Sun, P. S. Yu, et al., Towards
     graph foundation models: A survey and beyond, arXiv preprint arXiv:2310.11829 (2023).
[23] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifying large language models and
     knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering
     (2024). doi:10.1109/TKDE.2024.3352100.
[24] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
     R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
     S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
     Language models are few-shot learners, Advances in Neural Information Process-
     ing Systems 33 (2020). URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/
     1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[25] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, C. A. Raffel, Few-shot
     parameter-efficient fine-tuning is better and cheaper than in-context learning, Advances
     in Neural Information Processing Systems 35 (2022).
[26] Q. Guo, Z. Jin, X. Qiu, W. Zhang, D. Wipf, Z. Zhang, CycleGT: Unsupervised graph-to-
     text and text-to-graph generation via cycle training, in: T. Castro Ferreira, C. Gardent,
     N. Ilinykh, C. van der Lee, S. Mille, D. Moussallem, A. Shimorina (Eds.), Proceedings of
     the 3rd International Workshop on Natural Language Generation from the Semantic Web
     (WebNLG+), Association for Computational Linguistics, Dublin, Ireland (Virtual), 2020.
     URL: https://aclanthology.org/2020.webnlg-1.8.
[27] Z. Jin, Q. Guo, X. Qiu, Z. Zhang, GenWiki: A dataset of 1.3 million content-sharing
     text and graphs for unsupervised graph-to-text generation, in: D. Scott, N. Bel, C. Zong
     (Eds.), Proceedings of the 28th International Conference on Computational Linguistics,
     International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020.
     doi:10.18653/v1/2020.coling-main.217.
[28] L. Wang, Y. Li, O. Aslan, O. Vinyals, WikiGraphs: A Wikipedia text - knowledge graph
     paired dataset, in: A. Panchenko, F. D. Malliaros, V. Logacheva, A. Jana, D. Ustalov,
     P. Jansen (Eds.), Proceedings of the Fifteenth Workshop on Graph-Based Methods for
     Natural Language Processing (TextGraphs-15), Association for Computational Linguistics,
     Mexico City, Mexico, 2021. doi:10.18653/v1/2021.textgraphs-1.7.
[29] A. Colas, A. Sadeghian, Y. Wang, D. Z. Wang, Eventnarrative: A large-scale event-centric
     dataset for knowledge graph-to-text generation, in: Thirty-fifth Conference on Neural
     Information Processing (NeurIPS 2021) Track on Datasets and Benchmarks, 2021.
[30] G. Frisoni, G. Moro, L. Balzani, Text-to-text extraction and verbalization of biomedical
     event graphs, in: Proceedings of the 29th International Conference on Computational
     Linguistics, International Committee on Computational Linguistics, Gyeongju, Republic
     of Korea, 2022. URL: https://aclanthology.org/2022.coling-1.238.
[31] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summa-
     rization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004.
     URL: https://www.aclweb.org/anthology/W04-1013.
[32] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
     moyer, BART: Denoising sequence-to-sequence pre-training for natural language genera-
     tion, translation, and comprehension, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.),
     Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
     Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.acl-main.
     703.
[33] G. Rossiello, M. F. M. Chowdhury, N. Mihindukulasooriya, O. Cornec, A. M. Gliozzo,
     Knowgl: Knowledge generation and linking from text, in: The Thirty-Seventh AAAI
     Conference on Artificial Intelligence, AAAI Press, 2023, pp. 16476–16478. doi:10.1609/
     aaai.v37i13.27084.
[34] P.-L. Huguet Cabot, R. Navigli, REBEL: Relation extraction by end-to-end language
     generation, in: Findings of the Association for Computational Linguistics: EMNLP 2021,
     Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021. URL:
     https://aclanthology.org/2021.findings-emnlp.204.
[35] Z. Chen, A. H. Cano, A. Romanou, A. Bonnet, K. Matoba, F. Salvi, M. Pagliardini, S. Fan,
     A. Köpf, A. Mohtashami, et al., Meditron-70b: Scaling medical pretraining for large
     language models, arXiv preprint arXiv:2311.16079 (2023).
[36] W. Lian, B. Goodson, G. Wang, E. Pentland, A. Cook, C. Vong, "Teknium", MistralOrca:
     Mistral-7B Model Instruct-tuned on Filtered OpenOrcaV1 GPT-4 Dataset, HuggingFace
     repository (2023).
[37] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, A. Awadallah, Orca: Progressive
     learning from complex explanation traces of gpt-4, arXiv preprint arXiv:2306.02707 (2023).
[38] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph,
     J. Wei, et al., The flan collection: Designing data and methods for effective instruction
     tuning, arXiv preprint arXiv:2301.13688 (2023).
[39] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural language
     processing, in: Q. Liu, D. Schlangen (Eds.), Proceedings of the 2020 Conference on
     Empirical Methods in Natural Language Processing: System Demonstrations, Association
     for Computational Linguistics, 2020. doi:10.18653/v1/2020.emnlp-demos.6.
[40] I. Balazevic, C. Allen, T. Hospedales, Multi-relational poincaré graph embeddings, Ad-
     vances in Neural Information Processing Systems 32 (2019). URL: https://proceedings.
     neurips.cc/paper_files/paper/2019/file/f8b932c70d0b2e6bf071729a4fa68dfc-Paper.pdf.
[41] I. Chami, A. Wolf, D.-C. Juan, F. Sala, S. Ravi, C. Ré, Low-dimensional hyperbolic knowledge
     graph embeddings, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of
     the 58th Annual Meeting of the Association for Computational Linguistics, Association
     for Computational Linguistics, 2020. doi:10.18653/v1/2020.acl-main.617.
[42] M. Sclar, Y. Choi, Y. Tsvetkov, A. Suhr, Quantifying language models’ sensitivity to spurious
     features in prompt design or: How i learned to start worrying about prompt formatting,
     arXiv preprint arXiv:2310.11324 (2023).
[43] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou,
     Chain-of-thought prompting elicits reasoning in large language models, Advances in
     Neural Information Processing Systems 35 (2022). URL: https://proceedings.neurips.cc/
     paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
[44] J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Pat-
     wary, Y. Yang, Y. Zhou, Deep learning scaling is predictable, empirically, arXiv preprint
     arXiv:1712.00409 (2017).
[45] L. Bertolini, J. Weeds, D. Weir, Testing large language models on compositionality and
     inference with phrase-level adjective-noun entailment, in: N. Calzolari, C.-R. Huang,
     H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi, P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji,
     S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, S.-H.
     Na (Eds.), Proceedings of the 29th International Conference on Computational Linguistics,
     International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 2022.
     URL: https://aclanthology.org/2022.coling-1.359.


A. Fine-tuning hyper-parameters
Hyperparameters set, with their respective values adopted for our experiments with the encoder-
decoder models.
Table 2
Hyper-parameters used in fine-tuning. Where unspecified, default values in Hugging Face’s trainer
class apply.
                     Hyper-parameter                                Value
                     Seed                                             42
                     Evaluation Strategy                            epoch
                     Epochs                                           10
                     Warm-up steps                                    10
                     Validation metric                           eval_rouge1
                     Calculate generative metrics (i.e. Rouge)       True
                     Optimizer                                     AdamW
                     Learning rate                                  5e-05
                     Weight decay                                    0.01
                     ADAM 𝛽1                                          0.9
                     ADAM 𝛽2                                        0.999
                     ADAM 𝜖                                         1e-0.8
                     Label smoothing factor                           0.1
                     Train batch size                                 24
                     Validation batch size                            24
                     Group samples of similar length                 True
                     16-bit precision training                       True