<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CEAR: Automatic construction of a knowledge graph of chemical entities and roles from scientific literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Langer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabian Neuhaus</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Nürnberger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Otto-von-Guericke University Magdeburg</institution>
          ,
          <addr-line>Universitätsplatz 2, 39106 Magdeburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Ontologies are formal representations of knowledge in specific domains that provide a structured framework for organizing and understanding complex information. Creating ontologies, however, is a complex and time-consuming endeavor. ChEBI is a well-known ontology in the field of chemistry, which provides a comprehensive resource for defining chemical entities and their properties. However, it covers only a small fraction of the rapidly growing knowledge in chemistry and does not provide references to the scientific literature. To address this, we propose a methodology that involves augmenting existing annotated text corpora with knowledge from Chebi and fine-tuning a large language model (LLM) to recognize chemical entities and their roles in scientific text. Our experiments demonstrate the efectiveness of our approach. By combining ontological knowledge and the language understanding capabilities of LLMs, we achieve high precision and recall rates in identifying both the chemical entities and roles in scientific literature. Furthermore, we extract them from a set of 8,000 ChemRxiv articles, and apply a second LLM to create a knowledge graph (KG) of chemical entities and roles (CEAR), which provides complementary information to ChEBI, and can help to extend it.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;knowledge graphs</kwd>
        <kwd>ontologies</kwd>
        <kwd>large language models</kwd>
        <kwd>named entity recognition</kwd>
        <kwd>ChEBI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Our approach involves automatically augmenting manually annotated text corpora with
information from ChEBI, using two distinct LLMs to identify and associate chemical roles and
entities, and creating a knowledge graph based on ChEBI which contains information from
research texts, that is not annotated in ChEBI. We make the methodology and the resulting
knowledge graph (KG) available to the research community as a basis for developing utilities to
eficiently explore and structure any given set of chemistry research texts and to help with the
task of extending ChEBI.</p>
      <p>This paper is organized as follows: In Section 2 we provide an overview of ChEBI, and
methods used to create biochemical knowledge graphs and scholarly knowledge graphs, which
are both relevant to our research. Section 3 outlines the steps involved in creating the KG.
Here we explain our approach, providing a clear and reproducible process for others in the
community to follow. In Section 4, we discuss our results for diferent steps in the KG creation
process and the final KG. Finally, section 5 proposes some applications of our methods and
outlines future work on this project.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        The SmartProSys research initiative aims to replace fossil raw materials in chemical production
with renewable carbon sources, thus contributing to a carbon-neutral society. The transition to
sustainable and circular production processes requires research into novel chemical reaction
pathways that lead from renewable raw materials via energy-eficient and low-CO2 synthesis
processes to green products. The task of identifying such pathways requires the collective
chemical knowledge of the world to be searched and structured in a methodical, systematic and
targeted manner. This knowledge is growing rapidly: the ChemRxiv platform, launched in 2017,
already contains more than 20,000 research papers on chemistry. In addition, there are journals
such as the International Journal of Molecular Sciences, which has published more than 20,000
scientific articles in 2022, of which about 30-35% are in the field of biochemistry [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] emphasizes that the first step in designing an efective knowledge representation system,
and vocabulary, is to perform an efective ontological analysis of the field, or domain and that
ontologies enable knowledge sharing.
      </p>
      <p>
        ChEBI is a database and ontology for chemical entities of biological interest. In its November
2012 release, it contained nearly 30,000 fully annotated entities, all of which were added by
expert annotators [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In 2024, ChEBI contains almost 218,000 entities, of which more than
60,000 were fully annotated by ChEBI curators. However, the content of ChEBI is still very
limited, when compared to data sources like PubChem with information on nearly 317 million
substances and 118 million compounds1.
      </p>
      <p>Knowledge graphs, on the other hand, are a powerful tool for representing and querying
complex, interrelated data. They are essentially a network of entities (nodes) and their
interrelations (edges). The relationship between ontologies and knowledge graphs is complementary.
Ontologies provide a well-defined, interconnected vocabulary, while knowledge graphs populate
this vocabulary with specific real-world data instances.
1https://pubchem.ncbi.nlm.nih.gov/docs/statistics, accessed on April 16, 2024</p>
      <p>
        Scholarly Knowledge Graphs (SKG) are structured, semantic representations of scientific data.
In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a comprehensive review is given on the field of applying machine learning, rule-based
learning, and natural language processing tools and approaches to both construct SKGs and
utilize them. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] uses a semi-supervised extraction approach to construct a KG from scientific
text. It contains nodes of research papers with edges for citations between them. Relevant
(candidate) sentences from the represented research papers are classified as aim, method or
result and added as nodes to the SKG. Relations connect the corresponding paper nodes to the
extracted sentences, using the classified type of the sentences as type for the relations. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
constructs knowledge graphs on COVID-19 related scientific text and creates nodes for drugs,
diseases, genes and organisms. For entity extraction they use CORD-NER, a dataset with entities
of the Unified Medical Language System (UMLS) annotated using distant supervision [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Other existing KGs are closely related to biomedical sciences. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] describes a method to
construct a knowledge graph in four steps: triple extraction, triple filtering, concept linking,
merging of vertices and KG population. The main principle for the triple extraction is to split
the text into sentences and use a supervised open information extraction system. Triple filtering
uses term frequencies to determine important concepts and remove redundant or uninformative
information. The remaining concepts are annotated to clinical concepts in UMLS. The resulting
KG merges vertices and links the concepts to scientific papers. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] presents KGen, a
semiautomatic method that generates KGs from scientific biomedical text using a preprocessing step
that splits text into sentences, co-references and abbreviations. After a simplification process,
RDF-triples are generated using part-of-speech (POS) tagging and dependency parsing. An
existing model for Named Entity Recognition (NER) is used together with SPARQL to link
entities to medical ontologies. The resulting KG is manually evaluated by two physicians.
FORUM is a KG that links chemical entities to biomedical concepts [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It is built from
lifescience databases and ontologies like ChEBI, ChemOnt and PubChem and uses ontological
knowledge for automated reasoning and inference of relations between entities. Co-occurrence
analysis in scientific literature repositories like PubMed is used to estimate the strength of the
association.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>
        In our work, we create a KG for chemical entities and roles as defined in ChEBI. Chemical entities
are atoms, substances, groups and molecules and are classified as such based on shared structural
features, while roles are classified based on their activities in biological or chemical systems
or their use in applications [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Figure 1 outlines the method we use to create the KG: First,
we extract the full text from research papers and then fine-tune an LLM to identify chemical
entities and roles. Candidate sentences containing both are collected and a diferent LLM is
used to validate the relationship between the two. Finally, we de-duplicate and normalize both
chemical entities and roles, link them to the ChEBI ontology and create the KG. The following
subsections explain each step in detail.
      </p>
      <p>Figure 2 shows the diferent types of information provided by our approach. The information
that is extracted from the papers has the form &lt;chemical entity&gt; has_Role &lt;chemical</p>
      <p>Role&gt;, together with additional information about the text location that supports this triple.
Each text location consists of a specific paper, the page number in the paper, and the character
position of the sentence relative to that page number. RDF is not ideal to model these relations
because it does not allow to annotate a triple with its source without clumsy workarounds (e.g.,
reification of triples). Thus, we plan to release a KG built using RDF-star. The current RDF
version does not include any text locations.</p>
      <sec id="sec-3-1">
        <title>3.1. Text extraction from research papers</title>
        <p>Research papers are a rich source of information. They contain author names, images, tables,
citations, bibliographies, and more. To address the challenge of extracting the papers’ full texts
in an eficient way, we chose a very simple approach which involves using a Linux utility called
pdftotext. While it cannot identify floating objects in plain text, such as image and table
captions or footers and page numbers, it can reliably extract diferent formats, ranging from
one-column to two-column styles.</p>
        <p>We downloaded a set of 8,000 chemistry research papers from various categories of ChemRxiv
and extracted their full text as JSON documents, including information about the page it was
extracted from. Content-based checksums ensure that no duplicates are processed, even when
crawling other sources for research papers. The checksums are also used as identifiers between
the original PDF file and the associated JSON document.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Chemical entity and role recognition</title>
        <p>
          Transformer-based Large Language Models (LLMs) have proven efective in understanding
language patterns and thus in Natural Language Processing (NLP) tasks such as
Named-EntityRecognition (NER), which we use in order to identify chemical entities and roles. Approaches
such as RoBERTa or BERT use masked language modeling (MLM), where some tokens in an
input sequence are randomly masked and the model is trained to predict the original token [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
Electra models use a pre-training task called replaced token detection or token discrimination,
where instead of predicting a masked token, a discriminative model is trained to predict whether
a token in the corrupted input sequence was replaced by a generator sample. We chose this
approach because it is more sample-eficient [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], and fine-tuned a pre-trained Electra model
on three diferent datasets:
• The BC5CDR dataset consists of human annotations of chemicals, diseases and their
interactions from 1,500 PubMed articles [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
• The NLM-Chem corpus contains 150 full-text articles on biomedical literature, carefully
selected for containing chemical entities which are dificult to find for NER tools. Ten
domain experts annotated the chemical entities in three annotation rounds [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
• The CRAFT corpus contains 97 full-text open access articles from the PubMed Central
Open Access subset. It identifies all mentions of nearly all concepts from nine prominent
biomedical ontologies, including ChEBI [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>
          A fourth manually annotated dataset, EnzChemRED [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], provides chemical entities and proteins,
as well as conversions during chemical reactions. It is highly relevant to our NER task, but
given its recent availability, it has not yet been used for fine-tuning.
        </p>
        <p>The CRAFT corpus annotates all entities according to nine diferent ontologies from diferent
areas of interest. Chemical annotations, including chemical entities and roles are provided
along an extension of an older version of the ChEBI ontology. Although the NLM-Chem corpus
and the BC5CDR dataset also annotate all chemicals in the provided full texts, and although
BC5CDR annotates diseases, they do not include any chemical roles, such as ligand, acid, bufer,
or catalyst. To overcome this limitation, we used a semi-supervised approach and automatically
annotated all roles defined by their label and synonyms in ChEBI using a lexical approach. We
ignore all role strings that are shorter than four characters to avoid mislabeling identical strings
with diferent meanings (homonyms).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Link Validation</title>
        <p>We applied the fine-tuned Electra model to the extracted text of the chemistry research papers,
collecting all sentences, that contained at least one chemical entity and at least one chemical
role (Figure 3). For each sentence, we store the exact text location and the inferred chemical
entities and roles.</p>
        <p>
          The co-occurrence of chemical entities and roles within the same text block suggests that the
chemical entity may have this specific role. However, this correlation alone is not suficient to
draw a definitive conclusion. To address this, we use another LLM to verify the role of a chemical
entity based on the given contextual information. LLAMA 2 is a collection of pre-trained and
ifne-tuned LLMs ranging in size from 7 billion to 70 billion parameters. LLAMA 2-CHAT is
specifically trained for conversational tasks using reinforcement learning with human feedback
(RLHF) [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>In this paper we used LLAMA-2-7b-CHAT and split the prompt into:
• a system prompt, that defines the role of the LLM and makes sure that it simply confirms
or rejects the relation between chemical entity and chemical role without any further
explanations or other context that could complicate the parsing of the answer. For this
paper we used:
1 system_prompt = "Do you agree with the provided question? Please answer with one
2 word, either ’yes’ or ’no’."
• a user prompt, that presents the context to the LLM along with the question whether,
according to the given context, a specific chemical entity has a specific role. For this paper
we used:
1 user_prompt = f"In the sentence ’{sentence}’: Is {chemical} explicitly described
2 as {role}?"
A temperature hyperparameter of 0.1 and a top-p of 0.95 ensure a somewhat deterministic
behavior and reproducible results. All confirmed relations, as well as the associated sentence
location, the chemical entity, and the role, are collected for the construction of the KG, while the
remaining discarded relations are stored for analysis. Figure 4 shows how LLAMA-2 answers
the questions whether trans-b-methylstyrene or NAOH is described as cofactor in the given
sentence (see the first sentence in Figure 3 for a visualization of the same sentence with its
chemical entities and roles rendered in red and blue).</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Knowledge graph creation</title>
        <p>From the confirmed relationships, we group all chemical entities and roles using the labels and
synonyms from ChEBI. If an entity is not part of ChEBI, we use its original appearance in the
text. For this we only use chemical entities and roles with a character length of at least 2. For
each pair of chemical entity and role, we count how many references to specific text locations
exist. A higher frequency of occurrence of a relation increases our confidence in both, its correct
identification in the research text and the correctness of its meaning. At the same time, it also
reduces the novelty of the identified information. A hyperparameter minRef, which simply
ignores relations with a low frequency, can be used to increase precision at the expense of recall
or vice versa.</p>
        <p>The knowledge graph consists of the described relations. It is stored using the Terse
RDF Triple Language (Turtle). All chemical entities (obo:CHEBI_24431) and roles (obo:
CHEBI_50906) that are known to ChEBI are defined using their ChEBI identifier.</p>
        <p>Chemical entities or roles that are unknown to ChEBI are defined using the @prefix
cear: &lt;https://wwwiti.cs.uni-magdeburg.de/iti_dke/cear/&gt;. namespace. The
obo:RO_0000087 is used in ChEBI to define roles of chemical entities.</p>
        <p>The following listing shows an example for two chemical entities, ethylene glycol
bis(2aminoethyl)tetraacetate and PBS, both of which have the role bufer :
1 @prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
2 @prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
3 @prefix obo: &lt;http://purl.obolibrary.org/obo/&gt; .
4 @prefix cear: &lt;https://wwwiti.cs.uni-magdeburg.de/iti_dke/cear/&gt; .
5
6 obo:CHEBI_35225 rdf:type obo:CHEBI_50906 .
7 obo:CHEBI_35225 rdfs:label "buffer" .
8
9 obo:CHEBI_30741 rdf:type obo:CHEBI_24431 .
10 obo:CHEBI_30741 rdfs:label "ethylene glycol bis(2-aminoethyl)tetraacetate" .
11 obo:CHEBI_30741 obo:RO_0000087 obo:CHEBI_35225 .
12
13 cear:chem_4023 rdf:type obo:CHEBI_24431 .
14 cear:chem_4023 rdfs:label "PBS" .
15 cear:chem_4023 obo:RO_0000087 obo:CHEBI_35225 .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Chemical entity and role recognition</title>
        <p>
          In section 3.2, we discussed how we used the BC5CDR corpus, the NLM-Chem corpus and the
CRAFT corpus to fine-tune our Electra model for NER. As in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], we counted a prediction as a
true positive only if both the start and end locations of the characters of the complete entity
exactly matched. This is a very strict definition, since the complexity of chemical entities makes
it dificult to identify exact boundaries of entities or word tokens, for example: dipotassium
2-alkylbenzotriazolyl bis(trifuoroborate)s, 4,7-dibromo-2-octyl-2,1,3-benzotriazole[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>Table 1 shows the precision, recall and f-measure when fine-tuned on only one or multiple
of the corpora. We have included cross-corpus evaluation data, and we can see that a model
ifne-tuned on the NLM-Chem or BC5CDR corpus performs very poorly when evaluated on the
CRAFT corpus. Similarly, when a model fine-tuned using CRAFT is evaluated on NLM-Chem,
the results are very poor. This indicates a lack of generalizability across datasets. Table 2 shows
the reason by listing the ten most frequent misclassifications. All of the text corpora were
manually curated to annotate all chemical entities contained in the texts. However, despite
their common goal, they show discrepancies in annotation. For example, the chemical entities
"DNA", "RNA" and "mRNA" are annotated in the CRAFT corpus, but not in the NLM-Chem
corpus, hence the false negatives. The character "b", that appears as a false positive when a
model fine-tuned on NLM-Chem is evaluated on CRAFT, is used in genetics to describe base
pairs of DNA or RNA. Similarly, "PBS" is marked as a chemical entity in the NLM-chem corpus,
but in CRAFT it is neglected. This illustrates how, depending on the context or background of
the annotators, or depending on their research goals, there may be disagreement about which
entities are considered chemical and which are not. While, for instance, a person working
in criminal forensics might treat DNA as a chemical and focus on ways to identify it in a
given substance, a biologist might treat DNA more as a biological concept. Therefore, the poor
out-of-distribution performance is unavoidable: Even if a model could be created that perfectly
aligns with the opinions of the NLM-Chem experts, the CRAFT experts might not agree.</p>
        <p>
          [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] reports a precision of 81.0 %, a recall of 71.1 % and an F1-measure of 75.7 % in identifying
chemical entities when fine-tuned on both the NLM-Chem and the BC5CDR corpus and
evaluated on NLM-Chem using Bluebert (italic results in the table). Our results demonstrate a better
precision of 85.2 %, a better recall of 77.5 %, and consequently, a better F1-measure of 81.2 %
(bold results in the table). However, when all corpora are employed for fine-tuning the LLM, the
recall rate drops to 71.2 %. We attribute this deterioration to the described disagreement between
diferent groups of annotators. Since we want to provide a comprehensive understanding of
chemical entities and their roles in our KG, we still use this model for the subsequent steps.
        </p>
        <p>Since we only lexically annotated roles from ChEBI in the NLM-Chem corpus with a minimum
length of 4 characters (see section 3.2), "dye" is one of the most common false positive roles
when evaluating a model fine-tuned with CRAFT on the NLM-Chem corpus. 2 From the results
we can still see high precision and recall rates for roles, when a model that is fine-tuned on
NLM-Chem and BC5CDR is evaluated on CRAFT. The same applies to models fine-tuned using
all corpora. This demonstrates, that the described semi-supervised lexical approach is efective.</p>
        <p>In CRAFT, chemical roles are annotated only if they appear as nouns, but not, if they are
paraphrased with other words. Similarly, our lexical approach for both the BC5CDR and the
NLM-Chem corpus considers only nouns. Figure 5 shows some manually annotated text from
the CRAFT corpus, with chemical roles rendered in blue and chemical entities in red. It shows
that solvent is annotated as a role, while dissolved and redissolved are not. While this may be
correct from an annotator’s point of view, it limits the expressiveness of the current version of
our KG.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Link Validation and Knowledge Graph Construction</title>
        <p>After applying the LLAMA-2 model for the validation of links between chemical entities and
roles, and after grouping and applying the minRef hyperparameter as discussed in section 3.4,
two representations of the resulting KG are available. An RDF representation and a graph
2Experiments with a minimum length of 3 characters led to a large drop in both precision and recall when evaluated
on the CRAFT corpus.
representation for HTML that shows chemical entities and roles as nodes, and the has_role
relation as an edge connecting these nodes. Figure 6 shows a sample graph generated on a small
subset of the actual 8,000 papers, with a minRef hyperparameter of 10. The dark red nodes
represent chemical entities available in ChEBI, while the light red nodes represent additional
chemical entities unknown to ChEBI. Similartly, the dark blue nodes represent chemical roles
available in ChEBI, and the light blue nodes represent other chemical roles. The edges are
labeled with the frequency with which a given relation is mentioned in the literature set. To
improve the visual clarity of the graph, we have adjusted the colors of the edges based on these
numbers. The darker an edge appears, the stronger the relation between the chemical entity
and the role in our literature. Please be aware that due to the settings for minRef, all relations
with a frequency lower than 10 are ignored. Consequently, this graph shows only a very limited
number of very common chemical entities with their roles in a small set of research papers.</p>
        <p>To determine associations between chemical entities and roles, we applied the LLAMA-2
model to 115,537 candidate sentences, that contained at least one chemical entity and one role.
During this step, 58,511 relations were confirmed and 272,053 were rejected. The number of
candidate sentences is not the sum of confirmed and rejected relations, because each sentence
can have multiple chemical entities and roles and we check all combinations.</p>
        <p>Table 3 shows the most and the least frequent relations between chemical entities and roles
in our set of texts. For example, water was described as a solvent in 1,085 sentences out of
our 8,000 research papers. We can see, that almost all of the chemical entities and roles of the
top relations are already annotated in ChEBI. The least frequent relations mostly show CEAR
chemical entities (which are unknown to ChEBI). For better visibility we have marked them
in bold. Please note that we could not group CEAR entities, because we do not know about
their synonyms. This fact leads to an under-representation of CEAR chemical entities and roles
in the high-frequency relations of our results. Please also note, that the role "bufers" was not
identified as a ChEBI role: While some roles, such as "solvent" or "ligand" are annotated with
their plural forms as a synonym in ChEBI, "bufer" is not.</p>
        <p>Table 4 shows some information about the KG, when created using diferent settings for
minRef3. We can see that if we increase the minRef hyperparameter to only 2, the number
of relations, relevant text positions, distinct chemical entities and roles decreases drastically.
Raising minRef efectively trades recall and novelty for a better precision and a higher rate of
well-known facts.
3All versions of the KG can be downloaded at: https://wwwiti.cs.uni-magdeburg.de/iti_dke/cear/.</p>
        <p>The prompts used to confirm or reject relationships using LLAMA-2 also have a big impact on
the results. After modifying the system prompt slightly from asking for "one word, either
’yes’ or ’no’" as an answer to asking for only "one word", we had 12 times fewer chemical
entities and roles and 2.3 times fewer relationships between them. Adding additional text to
the system prompt, such as "You are an expert in chemistry", sometimes changed the
answer to include long explanations about why the answer was "yes" or "no". Changing the
user prompt to consider only information described in the sentence, which is what we want
when constructing a KG from research papers, resulted in 2.3 times fewer confirmed relations
and 2.1 times fewer chemical entities and roles. For this paper we decided to use very restrictive
questions in the hope for a KG with a higher precision.</p>
        <p>
          In order to evaluate the overall quality of the constructed KG, three methods can be used:
Gold standard-based evaluation, manual evaluation with domain experts and annotators, and
application-based evaluation with competency questions [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The latter involves asking
questions and answering them using the constructed KG.
        </p>
        <p>We are currently assessing the two following ideas:
• Automatic evaluation using gold standards: An existing KG or ontology can be used as a
gold standard and applying automated reasoning. However, to the best of our knowledge,
there are no gold standards in literature for evaluating triples extracted from unstructured
texts about chemical entities and their roles. Even for existing chemical entities and
roles in ChEBI, the relations between them are not fully annotated. We are currently
researching, whether we can use a combination of ChEBI and PubChem or other databases
to get meaningful evaluation results.
• Manual evaluation with domain experts: Precision can be determined by letting experts
evaluate the rejected and confirmed relations between chemical entities and their roles
in the collected sentences. To determine recall of the final KG, experts would need to
manually annotate all relations between chemical entities and their roles in a fixed set
of scientific texts. This task is not trivial and involves decisions such as, whether to
consider only nouns (like in the CRAFT corpus) or also verbs describing a specific role
(e. g.: "dissolved" for "solvent"), or whether to use intrinsic knowledge about chemical
entities.</p>
        <p>Although the resulting KG looks very promising, it is not yet possible to provide a reliable
measure. We are currently annotating true and false relations in a set of candidate sentences.
This enables the evaluation of diferent prompts or diferent LLMs, as well as diferent settings
for minRef.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we have shown how to create a KG, which is linked to ChEBI, using the same
vocabulary and extending it with knowledge from chemistry research papers. We see several
applications for our approach:</p>
      <p>Our KG can assist in extending the ChEBI ontology by suggesting chemical entities and roles
that are not part of it.4 Table 5 shows the top 10 most frequent relations with chemical entities
and chemical roles not annotated in ChEBI. For example, PBS (phosphate-bufered saline) was
correctly identified as a bufer 249 times in our set of 8,000 research papers. All text locations
(the research paper, the page, and the character position of the relevant sentence) are available
and can be used for reference. Future versions of CEAR will incorporate them using RDF-star.
Extending the scope to larger collections of chemistry research papers can amplify the number
of results for chemical entities and relations that are not annotated in ChEBI, thereby enhancing
the usefulness of the KG.</p>
      <p>Furthermore, we are developing exploration utilities for working with chemistry research
papers. By detecting chemical entities and their roles, we can, for example, highlight them in
the papers and direct users to ChEBI or PubChem for additional information.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>This work was supported by the Research Initiative "SmartProSys: Intelligent Process Systems
for the Sustainable Production of Chemicals" funded by the Ministry for Science, Energy, Climate
Protection and the Environment of the State of Saxony-Anhalt.
4For enhanced visibility, the corresponding namespace "CEAR" in Table 5 has been highlighted in bold.
The Turtle representation of the KG (using a minRef hyperparameter of 2) is available
at: https://wwwiti.cs.uni-magdeburg.de/iti_dke/cear/cear.ttl
Other versions with diferent settings for minRef can be viewed and downloaded at:
https://wwwiti.cs.uni-magdeburg.de/iti_dke/cear/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Supuran</surname>
          </string-name>
          , Progress of section “biochemistry” in
          <year>2022</year>
          ,
          <year>2023</year>
          . URL: https://www.mdpi. com/1422-0067/24/6/5873. doi:
          <volume>10</volume>
          .3390/ijms24065873.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Josephson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Benjamins</surname>
          </string-name>
          ,
          <article-title>What are ontologies, and why do we need them?</article-title>
          ,
          <source>IEEE Intelligent Systems and their Applications</source>
          <volume>14</volume>
          (
          <year>1999</year>
          )
          <fpage>20</fpage>
          -
          <lpage>26</lpage>
          . doi:
          <volume>10</volume>
          .1109/ 5254.747902.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hastings</surname>
          </string-name>
          , P. de Matos,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dekker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ennis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Harsha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muthukrishnan</surname>
          </string-name>
          , G. Owen,
          <string-name>
            <given-names>S.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Steinbeck</surname>
          </string-name>
          ,
          <article-title>The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>41</volume>
          (
          <year>2012</year>
          )
          <fpage>D456</fpage>
          -
          <lpage>D463</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gks1146.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batish</surname>
          </string-name>
          ,
          <article-title>Scholarly knowledge graphs through structuring scholarly communication: a review</article-title>
          ,
          <source>Complex &amp; Intelligent Systems</source>
          <volume>9</volume>
          (
          <year>2023</year>
          )
          <fpage>1059</fpage>
          -
          <lpage>1095</lpage>
          . doi:
          <volume>10</volume>
          .1007/s40747-022-00806-6.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pudi</surname>
          </string-name>
          , Scalable, semi
          <article-title>-supervised extraction of structured information from scientific literature</article-title>
          , in: V.
          <string-name>
            <surname>Nastase</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Dietz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          McCallum (Eds.),
          <source>Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications</source>
          , Association for Computational Linguistics, Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          . URL: https://aclanthology.org/W19-2602. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W19</fpage>
          -2602.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parulian</surname>
          </string-name>
          , G. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Ma, J. Tu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chauhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. R.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pustejovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Onyshkevych</surname>
          </string-name>
          , Covid-19
          <source>literature knowledge graph construction and drug repurposing report generation</source>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2007</year>
          .
          <volume>00576</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          , J. Han,
          <article-title>Comprehensive named entity recognition on CORD-19 with distant or weak supervision</article-title>
          , CoRR abs/
          <year>2003</year>
          .12218 (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .48550/ arXiv.
          <year>2003</year>
          .
          <volume>12218</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Muhammad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kearney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gamble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Coenen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Williamson</surname>
          </string-name>
          ,
          <article-title>Open information extraction for knowledge graph construction</article-title>
          , in: G. Kotsis,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Tjoa</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Khalil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Moser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mashkoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sametinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fensel</surname>
          </string-name>
          , J. Martinez-Gil (Eds.),
          <source>Database and Expert Systems Applications</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>113</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -59028-4_
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rossanez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            dos
            <surname>Reis</surname>
          </string-name>
          , R. d. S. Torres, H. de Ribaupierre,
          <article-title>Kgen: a knowledge graph generator from biomedical scientific literature</article-title>
          ,
          <source>BMC Medical Informatics and Decision Making</source>
          <volume>20</volume>
          (
          <year>2020</year>
          )
          <article-title>314</article-title>
          . doi:
          <volume>10</volume>
          .1186/s12911-020-01341-5.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Delmas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Filangi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paulhe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Duperier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Garrier</surname>
          </string-name>
          , P.-E. Saunier,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pitarch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jourdan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giacomoni</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Frainay, FORUM: building a Knowledge Graph from public databases and scientific literature to extract associations between chemicals and diseases</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>37</volume>
          (
          <year>2021</year>
          )
          <fpage>3896</fpage>
          -
          <lpage>3904</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btab627.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hastings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Owen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dekker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ennis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muthukrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Swainston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Steinbeck</surname>
          </string-name>
          , Chebi in 2016:
          <article-title>Improved services and an expanding collection of metabolites</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>44</volume>
          (
          <year>2015</year>
          )
          <fpage>D1214</fpage>
          -
          <lpage>D1219</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gkv1031.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/N19-1423.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>T.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Electra:
          <article-title>Pre-training text encoders as discriminators rather than generators</article-title>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2003</year>
          .
          <volume>10555</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>C.-H. Wei</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Leaman</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          <string-name>
            <surname>Mattingly</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>T. C.</given-names>
          </string-name>
          <string-name>
            <surname>Wiegers</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Assessing the state of the art in biomedical relation extraction: overview of the biocreative v chemical-disease relation (cdr) task</article-title>
          ,
          <year>Database 2016</year>
          (
          <year>2016</year>
          ). doi:
          <volume>10</volume>
          .1093/database/ baw032.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Islamaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Wei</surname>
            ,
            <given-names>D. C.</given-names>
          </string-name>
          <string-name>
            <surname>Comeau</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Cissel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Coss</surname>
          </string-name>
          , C. Fisher, et al.,
          <article-title>Nlm-chem, a new resource for chemical entity recognition in pubmed full text literature</article-title>
          ,
          <source>Scientific data 8</source>
          (
          <year>2021</year>
          )
          <article-title>91</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41597-021-00875-1.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>K. B. Cohen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Verspoor</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Fort</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Funk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bada</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>L. E. Hunter,</given-names>
          </string-name>
          <article-title>The colorado richly annotated full text (craft) corpus: Multi-model annotation in the biomedical domain, Handbook of Linguistic annotation (</article-title>
          <year>2017</year>
          )
          <fpage>1379</fpage>
          -
          <lpage>1394</lpage>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -94-024-0881-2_
          <fpage>53</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>P.-T. Lai</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Coudert</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Aimo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Axelsen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Breuza</surname>
            , E. de Castro,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Feuermann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Morgat</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Pourcel</surname>
            , I. Pedruzzi,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Poux</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Redaschi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Rivoire</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sveshnikova</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Wei</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Leaman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bridge</surname>
          </string-name>
          ,
          <article-title>Enzchemred, a rich enzyme chemistry relation extraction dataset</article-title>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2404.14209.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Blecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esiobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kardas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kerkez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khabsa</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kloumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Koura</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Liskovich</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Martinet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mihaylov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mishra</surname>
            , I. Molybog,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poulton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Reizenstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rungta</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saladi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schelten</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>X. E.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J. X.</given-names>
          </string-name>
          <string-name>
            <surname>Kuan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            , I. Zarov,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kambadur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Narang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stojnic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Edunov</surname>
          </string-name>
          ,
          <source>T. Scialom, Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2307.09288.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>