<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Constructing CCEE an LLM evaluation dataset for Complex Context-aware Event Extraction for gene regulatory networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frederik Labonté</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucie Flek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lamarr Institute for ML and AI</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>b-it/University of Bonn</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a first look at CCEE (Complex Context-aware Event Extraction), a currently in the works novel evaluation dataset for context-rich gene regulatory network extraction from scientific literature. We propose an annotation scheme for cancer research papers, capturing both core gene interactions and extensive contextual information across 10-14 categories per event, addressing limitations in existing datasets to test construction of disease specific biomedical knowledge graphs. Unlike previous datasets that focus primarily on entity connections of isolated triplets, CCEE links contextual attributes directly to gene regulatory events, providing a more integrated representation of scientific knowledge. We illustrate the annotation on 9 papers manually labeled by multiple experts, and give a first impression of challenges and ways to address them. Additionally we show first evaluations of LLMs as an annotation system. While it under performs human experts in interaction type labeling, it matches human performance on attributing entities as context to interactions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Gene Regulatory Networks</kwd>
        <kwd>Text Annotations</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Cancer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The gene regulatory networks (GRNs) underlying diseases like cancer can be represented through
knowledge graphs (KGs). Which then can be used to find therapeutic targets [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. GRNs in particular are
context dependent and highly influenced by external factors [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], making the extraction of experimental
conditions a crucial addition to the entities and relations involved. Due to data scarcity The automated
construction of GRNs remains challenging [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], even more so the extraction of regulatory events and
their additional biological context. This is due to data scarcity and focus of existing datasets on the
direct relations between entities rather than the surrounding context, e.g. experimental conditions.
The data sets that capture connections individually can make linking them to the regulatory event
ambiguous.
      </p>
      <p>To address this, we focus on linking contextual information directly to interaction triplets. Here
we give a first look into the work in progress on the evaluation dataset for Complex Context-aware
Event Extraction CCEE. Together with biomedical experts, we annotated 9 full-text articles on cancer,
identifying more than 200 matched events, comprising of the core interaction triplet and additional 9 or
13 contextual categories (Table1).We choose cancer as the domain, since the changes in GRNs often
lead to the disease and the amount of publications is so vast that even human experts can not keep up
with all new findings. Making it a perfect candidate for automated KG construction. Providing longer
more in depth events than typical triplets.</p>
      <p>
        Since LLMs ofer a potential solution for fields lacking extensive training data, including biomedical
annotations [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], we test the performance of Mistral Large 2 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], providing a preliminary qualitative
analysis to analyze which categories of LLM annotations are efective.
8th International Workshop on Semantic Web Solutions for Large-scale Biomedical Data Analytics, 2025, June 1, Portorož, Slovenia
$ flabonte@uni-bonn.de (F. Labonté); flek@bit.uni-bonn.de (L. Flek)
0000-0002-5995-8454 (L. Flek)
      </p>
      <p>
        © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1–12
Multiple datasets have been developed for biomedical event extraction, each with distinct focus and
annotation approach. Examples include MLEE [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], GE11 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], CG [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and BioRED [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>The MLEE (Multi-Level Event Extraction) corpus contains 262 PubMed abstracts on angiogenesis,
with 8˜,000 entity mentions and 6,000 events. It spans 19 entity types and 40 event types across
molecular to organism levels, capturing multi-scale biological interactions.</p>
      <p>The GE11 (GENIA Event) corpus, developed for BioNLP Shared Task 2011, includes 1,200 MEDLINE
abstracts annotated for gene expression, protein interactions, and biochemical events, serving as
a benchmark for molecular event extraction.</p>
      <p>The CG (Cancer Genetics) corpus, developed for BioNLP Shared Task 2013, contains 600 documents
with more than 17,000 events, 21,683 entity annotations, and 917 relations, focusing on
cancerrelated pathological and physiological processes.</p>
      <p>The BioRED (Biomedical Relation Extraction Dataset) spans 600 PubMed abstracts, annotating
gene, disease, and chemical relations. It uniquely distinguishes novel findings from established
knowledge, enhancing relation extraction capabilities.</p>
      <p>While these datasets provide a significant basis for biomedical information extraction, they share
several limitations. With the exception of BioRED, these datasets all focus on abstracts, potentially
leaving out additional information about experimental design. Additionally, none of them focus on
capturing contextual information around annotated events. While BioRED also captures disease, species,
cell line and chemical category, it does not annotate non-chemical treatments, the level of regulation,
the tissue or general associated factors.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset Creation and Annotation Methodology</title>
      <p>In contrast to the existing resources, CCEE, was designed for the purposes of applications that require a
deeper understanding of the context and relationships within biomedical research, like disease KGs and
specifically to test generative models as information extractors. The events we extract are enriched with
contextual data spanning 13 context categories for G-D and 9 for G-G interaction 1. This allows us to
test the capability of models to detect and extract these events which build the basis for KG construction
and automated context aware knowledge linking. The 3 main diferentiating factors from existing
datasets are the following.</p>
      <p>1. Full-text analysis: Unlike the abstract-focused approaches of MLEE, GE11, and PHEE, our
dataset annotates complete research papers, capturing the rich context typically found in methods
and results sections that is absent from abstracts.
2. Comprehensive contextual integration: We are capturing two main event types Gene-Gene
(G-G) and Gene-Disease (G-D) interactions, adding extensive contextual information to each
event, to better reflect the contextual nature of GRNs. Our annotation schema connects direct
regulatory relationships with their broader experimental and biological context(Table 1).
3. Event-centric contextual linking: Rather than treating disease states, experimental models,
and treatments as separate events, our approach explicitly links these elements as contextual
attributes of gene regulatory events, creating a more integrated representation of the scientific
knowledge.</p>
      <p>
        To select our annotation corpus, we filtered the PubMed Open Access database with a specific focus
on cancer research literature. Documents were pre-processed using Hunflair 2 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to identify gene,
species, and disease entities, which enabled us to filter papers based on gene frequency and cancer
mentions. to avoid outliers and to avoid annotating papers with no gene specific information. We
removed the upper and lower 25% of papers by unique gene count, we randomly selected documents
for annotation.
      </p>
      <p>To optimize the annotation process while preserving contextual integrity, we implemented a
windowed information extraction. For each gene identified by Hunflair 2, we extracted the surrounding
sentence in both directions, and merged overlapping windows. Splitting each paper on average into
11.5 blocks containing 11.43 sentences, each block got its own ID. This strategy maintained critical
contextual information while reducing annotator cognitive load.</p>
      <p>Our annotation schema was then developed iteratively through consultation with expert medical
researchers who specialize in lung cancer and radiation treatment research. Due to the absence of
established guidelines for complex event annotation with multiple contextual categories, we are refining
our approach through successive iterations. We recruited two annotators with formal education in
biology and specific experience in genetics and molecular biology. The annotators underwent initial
training through shared annotation exercises before proceeding to independent annotation work.</p>
      <p>The current annotation schema distinguishes between two primary event types: G-G and G-D
relationships. When annotators identified these relationships, they systematically documented the
connection and populated multiple contextual categories those can be seen in table 1. We use 3 types
of labels, predefined labels, and free text fields which are used to capture hard to define categories
like associated factors. Those free text fields are also included to specifically test LLM performance on
understanding potentially important information that doesn’t fit the annotation scheme.</p>
      <sec id="sec-2-1">
        <title>2.1. Annotation Matching Methodology</title>
        <p>Establishing a reliable matching procedure for evaluating inter-annotator agreement presents significant
challenges in relation extraction tasks, particularly when annotations lack unique identifiers. In our
dataset, annotations are only associated with source sentences and block IDs, neither of which provides
suficiently granular identification as multiple distinct events may originate from the same textual
source.</p>
        <p>To address this challenge, we implemented a context-based matching approach. For G-D relationships,
two annotations were considered a match when both the disease entity and gene entity overlapped
between annotators within the same block ID. Similarly, for G-G relationships, annotations were
matched when both the primary gene and connected gene aligned between annotators within the same
block ID.</p>
        <p>Potentially, more stringent matching criteria could further increase precision by incorporating
additional contextual elements. For instance, in cases where an event appears multiple times with
difering contextual information (e.g., the same G-G interaction mentioned in relation to diferent cancer
types), our current approach may produce ambiguous matches. However, this would reduce the number
of categories available for annotation scheme evaluation, as categories used for matching were excluded
for inter-annotator agreement.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. F1 Score Calculation</title>
        <p>
          For each annotation, we determine if a corresponding one existed in the same assigned chunk from
the second annotator. The total number of unmatched entries per annotator was subtracted from their
total annotation count to derive true positives. Unmatched annotations formed false positives and
negatives.This approach yielded asymmetrical results due to the fuzzy entity boundaries[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Results
are reported in Table 2.
        </p>
        <p>Table 3 further details the matching statistics, indicating that approximately 61.7% of Anno2’s G-G
relationships were matched with Anno1, while 61.1% of Anno1’s G-G relationships were matched
with Anno2. For G-D relationships, the matching rates were 54.1% and 61.2% for Anno2 and Anno1,
respectively.</p>
        <p>
          While F1 scores provide insights into annotation overlap, more robust statistical measures are
required to assess true inter-annotator agreement while accounting for chance agreement. Therefore,
we extended our evaluation using Cohen’s Kappa and Gwet’s AC1 coeficients [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. LLM-Assisted Annotation Matching</title>
        <p>A significant methodological challenge arose from our use of unnormalized entities and free-text
categorical annotations. Traditional string-matching approaches are inadequate for such data, as</p>
        <p>Fdriteieonteaxlti,nkfoeywords, describing ad- text
If a cell-line was mentioned entity
perceived confidence of an event predefined
A gene connected to another gene entity
role of the gene connected to the
disease (progression, suppression, predefined
connected)
Efect of the treatment predefined
tLwabeelen tgoendsescribe connection be- predefined
cTohnendecisteioansetothaagteinsementioned in entity
vTohlevemdienchthaantisemfecst tthhee dgiesneaeseis in- text
The Gene that is mentioned in
connection to a disease or another pro- entity
tein
In which level of gene regulation
does the gene gene interaction take predefined
place
sWuphpeorertiinngtheevitdeexntcdeid we find the number
sTehrevesdpeicnies the connection was ob- entity
sTehreveTdisisnue the connection was ob- entity
What treatments were used e.g.
drug tests
What does this treatment efect text
dWirheactt,kiinnddiroefcte,vciidteanticoendo we have predefined
Iefatsheeogreitnseoiustacommaersker for the dis- predefined</p>
        <p>Ipfotrhteertroeratthmreonutgihs taacrtginegtinasga trans- predefined
they fail to capture semantic equivalence when annotators use diferent phrasing to express identical
concepts. To address this limitation, we implemented a LLM as a judge approach for determining
categorical matches. For multi-value categories, we considered annotations to match if at least one entry
overlapped semantically as determined by the LLM, the prompt can be found in the project GitHub.
To validate this approach, a subset of categories was also matched manually by human annotators.
This validation process revealed that the LLM-based matching achieved consistently high agreement
with human judgment, with Cohen’s Kappa above 0.86 across tested categories (See Appendix), while
marginally under performing rules-based matching in categories where predefined labels are meant to
be matched.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Sentence-Level Agreement Analysis</title>
        <p>Beyond entity and relationship matching, we also evaluated the consistency of text selection by
examining the specific sentences from which annotators extracted information. This analysis reveals an
interesting pattern: while annotators achieved only moderate rates of exact sentence matching (38.9%
for G-G and 24.7% for G-D relationships), the average matching rate increases to 57% and 51%,
respectively. This agreement improves substantially when considering a window of ±1 sentence around the
selection. The average range match, which calculates the overlap percentage while including adjacent
sentences, reaches 78.8% for G-G and 83.0% for G-D relationships. The higher range-match percentage
for G-D relationships, despite lower exact matching, suggests that disease-related information is often
distributed across adjacent sentences, requiring broader contextual integration.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Comparison of Agreement Metrics</title>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Gene-Disease Relationship Agreement Analysis</title>
        <p>
          G-D relationships showed lower inter-annotator agreement than G-G relationships, requiring refinement
of annotation guidelines. "Connection Disease" and "Connection Treatment" failed to meet the 0.6
threshold for reliable biomedical annotations [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], indicating interpretation challenges. Annotator
discussions revealed issues stemming from category underdefinition. For example, when two genes
interact in relation to a specific cancer, it remained unclear whether both genes should be annotated as
connected to the disease, or only one—and if so, which one. The definition of when a gene or treatment is
considered to regulate a disease also needs clarification. For "Connection Disease," annotators frequently
disagreed on directionality, with one assigning directional relationships while the other chose
nondirectional labels. In "Connection Treatment," 13% of cases had mismatched directional labels, and 23%
disagreed on relationship existence, further suggesting definitional problems. Categories like "Species,"
"Tissue," and "Cell Line" also failed to reach acceptable agreement. The consistent low agreement in
G-D relationships, despite acceptable G-G relationship agreement, highlights the greater challenges in
annotating gene-disease interactions, driven by less obvious interaction definitions and by extension
less clarity of which context to include.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Comprehensive Annotation Statistics and Agreement Analysis</title>
      <sec id="sec-3-1">
        <title>3.1. Annotation Volume and Matching Overview</title>
        <p>Our annotation corpus comprises a substantial body of biomedical relation extraction annotations
distributed across two major event types. Annotator 1 (A1) produced a total of 410 annotated events,
while Annotator 2 (A2) generated 345 events. These annotations were distributed relatively evenly
between G-G and G-D relationships, with A1 creating 201 G-G and 209 G-D events, and A2 producing
175 G-G and 170 G-D events. The matching analysis revealed that for A1, 124 G-G events and 113 G-D
events had at least one corresponding match in A2’s annotations. Conversely, 107 G-G events and
103 G-D events from A2 had at least one match in A1’s dataset. This yields overall matching rates of
approximately 62% for A1’s annotations and 61% for A2’s annotations, consistent with our earlier F1
score analysis.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Category-Specific Agreement for G-G Relations</title>
        <p>For G-G relationships, we calculated observed agreement across multiple contextual categories. Table
4 presents these agreement scores. These agreement rates reveal substantial variability in annotator
consensus across diferent contextual dimensions. Particularly strong agreement was observed for
Treatment Exposure (0.91), Cell Line (0.88), and Tissue (0.82), suggesting these categories have well-defined
boundaries that facilitate consistent annotation. Conversely, Associated Factors (0.53), Regulation Level
(0.56), and Connection Type (0.59) demonstrated marginal agreement levels, indicating potential areas
for guideline refinement.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Annotation Complexity and Volume</title>
        <p>The annotation schema implemented in this study captures substantial contextual information beyond
the core entity relationships. For G-D annotations, each event required documentation of the gene,
disease, and 14 additional contextual attributes, resulting in 3,344 total data points from Annotator 1
and 2,720 from Annotator 2. Similarly, G-G annotations captured the primary gene, connected gene,
and 10 contextual attributes, yielding 2,412 data points from Annotator 1 and 2,100 from Annotator 2.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Annotation Methodological Approach</title>
        <p>Our annotation framework incorporated three distinct annotation methodologies to capture the full
spectrum of biomedical information:
1. Named entities: Capturing specific biomedical entities such as genes, diseases, and treatments
2. Relations: Documenting the connections between identified entities
3. Free text fields: Providing flexibility to capture contextual information that does not conform to
standardized categories</p>
        <p>The inclusion of free text fields is essential for preserving information that might otherwise be
lost in a more rigid annotation framework, though this approach introduced additional challenges for
inter-annotator agreement assessment as discussed in our methodological section.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Unique Features of the Dataset</title>
        <p>The goal of the finished dataset is to advances biomedical NLP by addressing gaps in context-aware
gene regulatory network extraction. It captures the complex contextual dimensions of G-G and
GD relationships, providing depth for evaluating automated knowledge graph construction. With
annotations from 9 cancer research papers, it includes over 200 matched events and 2,000 contextual
data points. Despite lower inter-annotator agreement in G-D relationships, this variability serves as a
valuable benchmark for measuring annotation precision and recall.</p>
        <p>The dataset combines structured categorical annotations and free-text fields, capturing nuanced
contextual information. This design is ideal for evaluating large language models and human annotators.
By ofering detailed performance metrics across multiple annotation categories, the dataset will help
identify strengths and weaknesses in generative models, providing insights into the challenges of
extracting biological context for precision medicine applications.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. LLM Baseline Experiments</title>
      <p>To establish a technological baseline for our complex biomedical relation extraction task. We employed
Mistral Large 2, providing it with the same annotation instructions and two representative examples
before tasking it with independent annotation to then be compared against human baseline.</p>
      <p>The F1 scores are slightly below human inter-annotator agreement, except for G-D annotations
compared to Annotator 1, where the LLM performed similarly. Notably, there’s a significant asymmetry
between precision and recall metrics.</p>
      <p>This asymmetry is due to the volume of annotations. Mistral 2 generated more G-D and G-G
events than human annotators, but only 30-40% of LLM-annotated events matched human annotations,
compared to 55-61% agreement between human annotators.</p>
      <p>Sentence selection patterns of both annotators and the LLM showed similar performance.</p>
      <p>These findings support that while the LLM may generate more false positives, they can detect many
of the events.</p>
      <p>As illustrated in Figure 2 for G-G data, category-specific performance reveals interesting patterns.
While Mistral underperformed in categories requiring relationship interpretation between entities, it
achieved comparable or superior performance to human annotators in entity extraction categories , for
these entity-based categories, the LLM demonstrated higher agreement with both human annotators
than the human annotators achieved with each other, suggesting particular strength in expanding
events with additional contextual information. A similar pattern was observed in G-D.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Our work-in-progress dataset aims to serves as seed and evaluation data for assessing LLM annotation
performance and establishing groundwork for synthetic data generation. A key application is testing
components for automatic KG construction, specifically relation extraction with contextual categories
that enable cross-source data connections.</p>
      <p>Preliminary evaluations using Mistral Large 2 showed the model performing below human annotators
in interaction labeling, yet achieving comparable or superior performance in attributing context to
detected events.</p>
      <sec id="sec-5-1">
        <title>5.1. Limitations &amp; Future Work</title>
        <p>This initial exploration of Complex Context-aware Event Extraction in cancer research has limitations
due to dataset size and using only two annotators. Inter-annotator agreement challenges, particularly
in categories like "Connection Disease &amp; species," revealed guideline ambiguities and annotation
environment problems.</p>
        <p>We identified three main issues: (1) underdefined criteria for gene-disease events causing
disagreement, (2) lack of automated highlighting leading to missed entities, and (3) pre-annotated entities
without proper ID linking creating matching problems alongside limited positional data.</p>
        <p>
          To address these challenges, we will: define clear G-D connection criteria similar to BioRED [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
guidelines; redo annotations using software like TeamTat to resolve highlighting, position information,
and normalization issues; and extend guidelines with more examples and case diferentiation. Future
plans include dataset expansion and adding a third annotator for the final release.
        </p>
        <p>We hope to stimulate discussion on integrating contextual information for KG construction and
evaluating LLMs as annotators for disease-KG construction.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Data Availability and Ethical Considerations</title>
      <p>The data set, prompts and the (revised) annotation guidelines will be made available at:
https://github.com/FMLabonte/CCEE_for_GRNs
\begin{acknowledgments}
We thank:
The TRA grant of the university of Bonn making this research possible.
The lamarr institute for the ample opportunities for exchange of ideas.
The DLR and Dr. Christine Hellweg for feedback on the usefulness of
˓→ annotations.</p>
      <p>All the people that helped with formatting, feedback and spell checking. Luna
˓→ Meyer, Valerie Dang, Max Brauner, Liliane Hanfeld and Lukas Grönwoldt.
Georgina Kowalski for the additional annotation work.
\end{acknowledgments}</p>
    </sec>
    <sec id="sec-7">
      <title>A. Appendix</title>
      <sec id="sec-7-1">
        <title>A.1. LLM as a judge evaluation</title>
        <p>Notably, in several instances, the LLM demonstrated higher agreement with individual human annotators
than the human annotators achieved with each other, suggesting the LLM’s efectiveness as an impartial
judge for this task.</p>
        <p>For this matching procedure, we employed GPT-4o with a temperature setting of 0 to maximize
deterministic outputs. This approach was primarily applied to categories containing full-text or entity
annotations where semantic rather than exact matching was appropriate.
1–12
1–12</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Mistral and Claude Sonnet 3.5 for grammar and
spelling checking. The authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Madhamshettiwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Maetschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reverter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ragan</surname>
          </string-name>
          ,
          <article-title>Gene regulatory network inference: evaluation and application to ovarian cancer allows the prioritization of drug targets</article-title>
          ,
          <source>Genome Medicine</source>
          <volume>4</volume>
          (
          <year>2012</year>
          )
          <article-title>41</article-title>
          . URL: https://doi.org/10.1186/gm340. doi:
          <volume>10</volume>
          .1186/gm340.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Raschka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schultz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Altay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ebeling</surname>
          </string-name>
          , H. Fröhlich,
          <article-title>Ai reveals insights into link between cd33 and cognitive impairment in alzheimer's disease</article-title>
          ,
          <source>PLOS Computational Biology</source>
          <volume>19</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          . URL: https://doi.org/10.1371/journal.pcbi.1009894. doi:
          <volume>10</volume>
          .1371/journal.pcbi.
          <volume>1009894</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Emmert-Streib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haibe-Kains</surname>
          </string-name>
          ,
          <article-title>Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks</article-title>
          ,
          <source>Frontiers in Cell and Developmental Biology</source>
          <volume>2</volume>
          (
          <year>2014</year>
          )
          <article-title>38</article-title>
          . URL: https://doi.org/10.3389/fcell.
          <year>2014</year>
          .
          <volume>00038</volume>
          . doi:
          <volume>10</volume>
          .3389/ fcell.
          <year>2014</year>
          .
          <volume>00038</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-F.</given-names>
            <surname>Juan</surname>
          </string-name>
          , H.-C. Huang,
          <article-title>Context-dependent gene regulatory network reveals regulation dynamics and cell trajectories using unspliced transcripts</article-title>
          ,
          <source>Briefings in Bioinformatics</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <article-title>bbac633</article-title>
          . URL: https://doi.org/10.1093/bib/bbac633. doi:
          <volume>10</volume>
          .1093/bib/bbac633.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Geißler</surname>
          </string-name>
          ,
          <article-title>The Kairntech Sherpa - an ML platform and API for the enrichment of (not only) scientific content</article-title>
          , in: G. Rehm,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bontcheva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choukri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hajič</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Piperidis</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Vasi l,jevs (Eds.),
          <source>Proceedings of the 1st International Workshop on Language Technology Platforms, European Language Resources Association</source>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>58</lpage>
          . URL: https://aclanthology. org/
          <year>2020</year>
          .iwltp-
          <volume>1</volume>
          .9/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gueta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Gilon</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Erell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jaber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kartha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Laish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feder</surname>
          </string-name>
          ,
          <article-title>Llms accelerate annotation for medical information extraction</article-title>
          , in: S. Hegselmann,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parziale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shanmugam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Asiedu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hartvigsen</surname>
          </string-name>
          , H. Singh (Eds.),
          <source>Proceedings of the 3rd Machine Learning for Health Symposium</source>
          , volume
          <volume>225</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>100</lpage>
          . URL: https://proceedings.mlr. press/v225/goel23a.html.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Babaiha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schultz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hofmann-Apitius</surname>
          </string-name>
          ,
          <article-title>Rationalism in the face of gpt hypes: Benchmarking the output of large language models against human expert-curated biomedical knowledge graphs</article-title>
          ,
          <source>Artificial Intelligence in the Life Sciences</source>
          <volume>5</volume>
          (
          <year>2024</year>
          )
          <article-title>100095</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S2667318524000023. doi:https://doi.org/ 10.1016/j.ailsci.
          <year>2024</year>
          .
          <volume>100095</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[8] Mistral large 2</source>
          , Mistral large 2, ???? URL: https://mistral.ai/news/mistral-large-
          <volume>2407</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Miwa</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-C. Cho</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <article-title>Event extraction across multiple levels of biological organization</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>28</volume>
          (
          <year>2012</year>
          )
          <fpage>i575</fpage>
          -
          <lpage>i581</lpage>
          . URL: https://doi.org/10.1093/ bioinformatics/bts407. doi:
          <volume>10</volume>
          .1093/bioinformatics/bts407.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>J.-D. Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Takagi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Yonezawa</surname>
          </string-name>
          ,
          <article-title>Overview of Genia event task in BioNLP shared task 2011</article-title>
          , in: J.
          <string-name>
            <surname>Tsujii</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-D. Kim</surname>
          </string-name>
          , S. Pyysalo (Eds.),
          <source>Proceedings of BioNLP Shared Task 2011 Workshop</source>
          , Association for Computational Linguistics, Portland, Oregon, USA,
          <year>2011</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>15</lpage>
          . URL: https://aclanthology.org/W11-1802/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rowley</surname>
          </string-name>
          , H.-W. Chun,
          <string-name>
            <given-names>S.-J.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-P.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tsujii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <article-title>Overview of the cancer genetics and pathway curation tasks of bionlp shared task 2013, BMC Bioinformatics 16 (</article-title>
          <year>2015</year>
          )
          <article-title>S2</article-title>
          . URL: https://doi.org/10.1186/
          <fpage>1471</fpage>
          -2105-16-S10-S2. doi:
          <volume>10</volume>
          .1186/
          <fpage>1471</fpage>
          -2105-16-S10-S2.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          , P.-T. Lai,
          <string-name>
            <surname>C.-H. Wei</surname>
            ,
            <given-names>C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Arighi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Biored: a rich biomedical relation extraction dataset</article-title>
          ,
          <source>Briefings in Bioinformatics</source>
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <article-title>bbac282</article-title>
          . URL: https://doi.org/10.1093/bib/bbac282. doi:
          <volume>10</volume>
          .1093/bib/bbac282.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sänger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Garda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Weber-Genzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Droop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuchs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Akbik</surname>
          </string-name>
          , U. Leser,
          <article-title>Hunflair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>40</volume>
          (
          <year>2024</year>
          ). URL: http://dx.doi.org/10.1093/bioinformatics/btae564. doi:
          <volume>10</volume>
          .1093/ bioinformatics/btae564.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Wilbur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rzhetsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shatkay</surname>
          </string-name>
          ,
          <article-title>New directions in biomedical text annotation: definitions, guidelines and corpus construction, BMC Bioinformatics 7 (</article-title>
          <year>2006</year>
          )
          <article-title>356</article-title>
          . URL: https://doi.org/10. 1186/
          <fpage>1471</fpage>
          -2105-7-356. doi:
          <volume>10</volume>
          .1186/
          <fpage>1471</fpage>
          -2105-7-356.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>K. L. Gwet</surname>
          </string-name>
          ,
          <article-title>Computing inter-rater reliability and its variance in the presence of high agreement</article-title>
          ,
          <source>British Journal of Mathematical and Statistical Psychology</source>
          (
          <year>2010</year>
          ). URL: https://doi.org/10.1348/ 000711006X126600. doi:
          <volume>10</volume>
          .1348/000711006X126600.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Cibulka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Strube</surname>
          </string-name>
          ,
          <article-title>The conundrum of kappa and why some musculoskeletal tests appear unreliable despite high agreement: A comparison of cohen kappa and gwet ac to assess observer agreement when using nominal and ordinal data</article-title>
          ,
          <source>Physical Therapy</source>
          <volume>101</volume>
          (
          <year>2021</year>
          )
          <article-title>pzab150</article-title>
          . URL: https://doi.org/10.1093/ptj/pzab150. doi:
          <volume>10</volume>
          .1093/ptj/pzab150.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>M. L. McHugh</surname>
          </string-name>
          ,
          <article-title>Interrater reliability: the kappa statistic</article-title>
          ,
          <source>Biochemia Medica</source>
          <volume>22</volume>
          (
          <year>2012</year>
          )
          <fpage>276</fpage>
          -
          <lpage>282</lpage>
          . URL: https://www.biochemia-medica.com/en/journal/22/3/10.11613/BM.
          <year>2012</year>
          .
          <volume>031</volume>
          . doi:
          <volume>10</volume>
          .11613/ BM.
          <year>2012</year>
          .
          <volume>031</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>