<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Task " : " U n s u p e r v i s e d T e x t C l a s s i f i c a t i o n " ,
" D a t a s e t " : "</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>AMATU@Simpletext2024: Are LLMs Any Good for Scientific Leaderboard Extraction?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Moritz Staudinger</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alaa El-Ebshihy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annisa Maulida Ningtyas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florina Piroi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Allan Hanbury</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Studios Austria</institution>
          ,
          <addr-line>Data Science Studio</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universität Wien</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2</volume>
      <issue>0</issue>
      <fpage>09</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>In this paper, we present our approach to solve the SOTA challenge of the SimpleText shared task at CLEF 2024. The objective of the challenge is to extract all (Task, Dataset, Metric, Score) tuples from scientific papers which report model score on benchmark datasets. In this work, we propose a rule-based classification model to identify papers that reports score information. We then apply diferent methods to extract TDMS using: (1) a baseline model from the literature, and (2) two Large Language Models (LLMs), GPT-3.5 and Mistral. Results show that the baseline model outperforms the LLMs in most cases, especially in zero-shot settings, with improvements seen in few-shot settings. Manual investigation shows that extracting TDMS from paper text is challenging, particularly for "Dataset" and "Score" extraction.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Scientific Text Extraction</kwd>
        <kwd>State-of-the-art</kwd>
        <kwd>Entity Extraction</kwd>
        <kwd>Relation Extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In our data-driven world, the volume of published literature, including newspaper articles, social media
posts, and scientific publications, is rapidly increasing. Since technological and scientific advancements
are generally communicated through scientific publications, it is important to find and keep track of
significant advances and challenges in various scientific fields. With the never-ending flow of new
scientific publications (e.g. 1,000 new ML publications per month on arXiv 1 alone), it is becoming
increasingly dificult to keep updated with the state-of-the-art for a given scientific task and compare
new research contributions with previous ones.</p>
      <p>
        In the particular area computer science of experimentation and evaluation of ML/IR models, assessing
the efectiveness of a new model or algorithm is dificult due, not least, to heterogeneous reporting
styles. To address this, a possibility is to create machine-readable results by either (1) creating machine
actionable publications (that is, publications prepared in such a way that they contain further specifically
formatted data which can be automatically processed and harvested correctly by algorithms) or (2)
standardizing the evaluation and experimentation environment.The first path, if established, would
allow creating and extracting comparable results with a minimal overhead for researchers, as the results
could be collected in a standardized format, along with the written submissions. This is, though, unlikely
to happen in the near future, as it requires a critical mass of scientists to alter their documentation
habits, as stated by Kabongo et. al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Standardizing experimentation and evaluation environments is done through shared tasks or
evaluation labs/tracks, which standardize the evaluation environment to evaluate the state-of-the-art
performance of predefined tasks, with given metrics and hidden evaluation datasets. Although the results of
these challenges are valuable to the community, they only provide comparable results for a few selected
tasks and datasets, but not for the vast variety of research. Therefore, many research publications are
not comparable, as they do not follow standardized evaluation strategies, use variations of datasets [? ]
or propose new tasks and datasets.</p>
      <p>
        Therefore, extracting scientific entities from scholarly articles is currently the best option for enabling
comparable and machine-actionable results throughout the scientific community. Although platforms
like PapersWithCode, AI-metrics, NLP-Progress, and the Open Research Knowledge Graph [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] enable
the comparison of research results on given datasets, they are manually curated, therefore limited by
the crowd-sourcing resources of the community and subject to human error.
      </p>
      <p>
        Automated extraction of predefined scientific entities, such as Task-Dataset-Metric-(Score) (TDMS) [
        <xref ref-type="bibr" rid="ref1 ref3">3,
1</xref>
        ], can help automatic knowledge base population and comparability of research contributions.
      </p>
      <p>
        In this work, we present our approaches for the SimpleText State-of-the-Art (SOTA) Extraction
Challenge at CLEF 2024 [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. The task is to determine whether a given scientific paper reports model
scores on benchmark datasets and, if so, extract all TDMS tuples. We use a large language model
(LLM)-based approach to extract and combine dependent scientific entities (e.g. a score depends on
a given dataset, task, and metric) in terms of the TDMS objective, and discuss improving this task’s
performance with two distinct rule-based prefiltering systems for faster and more accurate extraction.
As a baseline, we use the extraction tool presented by Kardas et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The remainder of this work is structured as follows. In Section 2, we discuss related work in the area
of scientific entity extraction. In Section 3, we discuss which methods we applied to approach the task
challenge. In Section 4, we present our results for the shared task test sets. In Section 5, we discuss our
approaches and limitations, which we then follow up with a summary and a brief outlook in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        In the area of IR and ML evaluation, one way to follow advancements in the area is the automatic
generation of leaderboards by extracting data from scholarly articles, with the CLEF Simpletext
State-ofthe-Art extraction Task contributing to the evaluation of these eforts. Augenstein et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] organized a
Shared Task at the SemEval Workshop 2017, where participants had to extract three types of entities
from scientific paragraphs: Task, Method, and Material. This work was extended by Gabor et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
where additionally to analyzing paragraphs, the authors have also used annotated abstracts.
      </p>
      <p>
        Independent of an evaluation lab or shared task, research on information extraction from publication
texts usually create specific datasets on which they experiment and evaluate their models. Jain et al.
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduced the SciREX dataset and model, which extracts dataset, metric, task, and method entities
from a corpus of 1,170 ML articles from PapersWithCode. Hou et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] published the first datasets
for Task, Dataset, Metric, and Score (TDMS) extraction in the NLP domain using distant supervision
annotations. Kardas et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] developed the AxCell pipeline to process LaTeX source code and extract
TDMS. Kabongo et al.[
        <xref ref-type="bibr" rid="ref1 ref10 ref2">1, 10, 2</xref>
        ] focused on mining for TDMS tuples, with the goal to automatically
populate the Open Research Knowledge Graph (ORKG) with this information. Automating this process
has the potential to accelerate the growth of the ORKG, permitting an easier comparison of research
results across scholarly articles. Yang et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] analyzed existing solutions assessing their limitations
and proposed an approach that does not require LaTeX sources, is not limited to a closed taxonomy
(e.g. not limited to extracting TDMS), and requires less supervision than previous solutions, like, for
example, Kardas et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>The SOTA challenge aims to extract all TDMS tuples from a given arXiv scientific paper, labeling the
article as “answerable” if no TDMS tuple is found. We first inspected the training and validation sets to
understand the distribution of the articles the contain TDMS tuples and “unanswerable” articles. The
training dataset includes 9,352 articles with LaTeX sources and annotations: 5,274 with TDMS tuples
and 4,078 labeled as “unanswerable”. Similarly, the validation set is composed of 100 articles: 50 articles
with TDMS tuples and 50 articles labeled as “unanswerable”. The training and the validation set are
evenly distributed between articles that contain TDMS and “unanswerable”.</p>
      <p>
        Therefore, our approach to solving the challenge consists of two modules: (1) Filtering unanswerable
documents by applying a rule-based binary classification model to identify papers that do not contain
TDMS, and (2) TDMS extraction to identify all TDMS in a given paper. For the TMDS extraction we
experiment with AxCell [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as a baseline model, and with GPT-3.5 and Mistral LLMs.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Filtering out unanswerable documents</title>
        <p>We apply a rule-based binary classification method to recognize papers which are classified as
“unanswerable”. These papers are excluded from further processing and, therefore, reduce the costs for
running the advanced models. For this, we evaluated three rule-based settings with similar
configurations but diferent outcomes, aiming for high recall to ensure only clearly unanswerable documents
are filtered out. Each setting was evaluated based on Precision, Recall, and Accuracy in identifying
"unanswerable" articles in the validation dataset. Table 1 shows the configurations tested. The first
two methods assess section titles to determine if a paper is “unanswerable”. The first approach ( Result
Section Exists) checks if any section title includes the terms result, experiment, or evaluation, indicating
the presence of scores. The second approach (Result Section Exists with add. terms) extends on this
idea, with the two additional terms comparison and performance. To build on the idea of section title
detection, we further scan for any tables in the result section (Result Table Exists). This was done by
heuristically looking if the result section of the LaTeX source code contains the phrase \begin{table,
instead of only looking at the section name. Although this improved Precision and Accuracy, the Recall
dropped significantly.</p>
        <p>As a result, we chose the first method presented, which filters on the basis of the three section names.
This method yields a similar Recall as the second approach, with only three papers of the validation
data set are diferently classified.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. TDMS extraction models</title>
        <p>
          We experimented with the TDMS extraction using diferent models (see Table 2). We divide the
experiments into two types: Baseline model and Large Language Models (LLMs). As a baseline model
we utilize the AxCell presented by Kardas et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. For the LLMs models, we utilize GPT-3.5 and
Mistral (an open source LLM) in zero-shot and few-shot settings and diferent sources as input for the
prompt text. We detail the implementation in the following sections.
        </p>
        <sec id="sec-3-2-1">
          <title>Filtered zero- or few-shot fulltext or az PwC information</title>
          <p>–
zero
zero
few
few
few
few
few
zero
zero
fulltext
fulltext
fulltext
fulltext
fulltext
fulltext
az
az
fulltext
fulltext
✗
✗
✗
✗
✗
✓
✗
✓
✗
✓</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.1. Baseline Model - AxCell</title>
          <p>
            We use the implementation of the AxCell system [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] (model Id 1 in Table 2). AxCell is a machine
learning pipeline, which extracts TDMS tuples from scientific papers, by combining neural networks
for text extraction and table extraction. The extracted linking candidates are then verified through
caption classification , mention lookup and table segmentation, before being merged into TDMS tuples.
          </p>
          <p>To use AxCell for this shared-task, we first download the eprint version of the arXiv publications
from the test datasets to create the same data structure as in the original paper. Then we filter out
"unanswerable" papers using the previously mentioned rule-based classification system (Section 3.1).
The eprint is then processed by the AxCell Extraction script to obtain all the unpacked LaTeX sources,
graphics, and an HTML version of the article. The processed articles are fed to a Neural Network which
ifrst extracts scientific entity candidates (Task, Dataset, Metric) and then tries to link them with the
extracted scores from the tables. Therefore, each table cell is annotated with meta-information, as the
TDM combination used to obtain a given score.
3.2.2. LLMs
We use two LLMs in our experiments: GPT-3.5 2 (models 2-8) and MistralAI (models 9-10) as an
OpenSource alternative to OpenAI’s GPT model. We divide our experiments using diferent criteria.
Instructions to the model: We set up models in a zero-shot setting, where we give the LLM
instructions, only (models 2, 3, 9, and 10), and models in a few-shot setting, where we give instructions
and a small number of examples (models 4 to 8).</p>
          <p>Filtering out unanswerable papers: In some of our models (ids 3, 5, 9 and 10), we filter out
unanswerable papers with our selected rule-based classification (Section 3.1). In others, we use the
complete test-sets without filtering, letting the model decide if the paper contains TDMS or unanswerable
(models 2, 4, and 6 to 8).</p>
          <p>Input text to the model: In models 2 to 6, 9 and 10 we use the full paper text as input to the LLM to
extract the TDMS. Models 7 and 8, inspired by Argumentative Zoning (AZ) [12] which defines the main
rhetorical structure in scientific articles, we extract only the text from sections referring to experiments
and results, in addition to the abstract. We believe that these sections contain the TDMS information
and, thus, avoid, processing the full paper text. We refer to the modules that utilize only the experiments
and results sections text to using az as input while others use full text input (Column 6 in Table 2).
2we used "gpt-3.5-turbo-0125" variant in our experiments
Additional Information from Paper with Code (PwC): We use PwC as a knowledge base to
collect lists of dataset and task names. These information are exhaustive lists of available datasets and
task names, which we then use to check if any of these datasets or any of these tasks are mentioned in
the full text of the paper. Although these lists contain all names of datasets and tasks of the test set (as
it was not specified at the time of evaluation that PwC was used as a ground truth), they do not contain
any information if a dataset or task was used in a specific paper, an information which can also be part
of the training data of LLMs. We used especially PwC as it is one of the most frequently used platforms
for the comparison of research results and provides an API to easily access the data.</p>
          <p>After extracting the matching datasets and tasks, they are then appended to the input prompt as
helping materials for the LLMs.</p>
          <p>We construct our prompts to the LLMs from the following components: «task description», «examples»,
«output format», «additional instructions», «input text», and «PwC information». In «task description»,
we describe the task, inputs, and outputs. «examples» are provided only in the few-shot setting, showing
input and expected output. In «output format», we describe the expected output format, which is a list
of JSON objects covering the TDMS in the paper. «additional instructions» emphasize key points (e.g.,
Score should be a numerical value). «input text» is a placeholder for the paper text, either the full text
or sections describing results and experiments. «PwC information» includes lists of datasets and tasks
found in the paper using PwC. The prompts used in the experiments are in Appendix A.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment and Results</title>
      <p>In this section, we describe the experiments submitted during each phase of the competition using the
models described in Section 3. The competition consists of two phases: Few-shot Phase (Phase 1) and
Zero-shot Phase (Phase 2). We give information about the test datasets, experiments, and results for
each phase.</p>
      <sec id="sec-4-1">
        <title>4.1. Test datasets</title>
        <p>We run the experiments in each phase on the test datasets provided by the competition organizers. This
test dataset comprises the LaTeX source of articles from arXiv. The test data includes 1.401 articles for
Phase 1 and 789 articles for Phase 2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiments</title>
        <p>In Table 3, we show the models used for experiments submitted in each phase of the competition (see
Section 3). For both phases, we consider the AxCell system as a baseline model. We use two and nine
LLM variations for Phase 1 and Phase 2, respectively.
Phase 1 Our aim in this phase is to compare the performance of the GPT-3.5 LLM in zero-shot settings
against the AxCell baseline. We also consider the efect of filtering “unanswerable” papers (Section 3.1)
to observe performance changes when reducing the number of papers processed by GPT-3.5.
Phase 2 In this phase, we compare the performance of LLM models in few-shot settings against
zero-shot settings, specifically comparing GPT35-fil-zero with GPT35-fil-few . We also compare
performance using the full paper text as input (i.e. GPT35-few) versus only sections referring to
experiments and results (i.e. GPT35-az-few). Additionally, we compare diferent LLM models (i.e.
GPT35-fil-zero vs. Mistral-fil-zero ). Lastly, observe performance of the LLMs when providing
external helping materials representing datasets and tasks (i.e GPT35-info-few, GPT35-az-info-few,
and Mistral-fil-info-zero ).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>In this section, we present the results of our submissions for each phase. The submissions are evaluated
based on: (1) Accuracy – measures whether the system can distinguish articles containing TDMS, (2)
Summary – measures quality of the extracted TDMS using Rouge1, Rouge2, RougeL, and RougeLsum,
and (3) precision, recall, and F1 – for each element in the TDMS (Task, Dataset, Metric, Score) tuples and
the overall average3. In the following, for each phase, we report the performance of our submissions
for Accuracy, Summary measures, and overall Precision, Recall, and F1.</p>
        <p>Phase 1 In Table 4 and Table 5, we show the performance of our submissions for Phase 1. Values in
bold are above the average for all submissions. The results in Table 4 show that the AxCell system
outperforms GPT-3.5 submissions in all measures except Accuracy, indicating that GPT-3.5 better
identifies papers containing TDMS. Generally, in Table 5, the Precision, Recall, and F1 for GPT35-zero
are lower than AxCell.</p>
        <p>Phase 2 In Table 6 and Table 7, we show the performance of our submissions for Phase 2. Values
in bold are above the average for all submissions. Similar to the previous phase, the AxCell system
outperforms the LLM submissions in all measures except for some cases in Accuracy. With regard to
our aimed experiments, we notice the following:
3Due to space, we report only the overall values, the full results can be found here:
https://docs.google.com/spreadsheets/d/1k82FmlztEBiNkKuskAZsNaovkblHeqmpKzD5C63Mn5Q/
1. Zero-shot vs. Few-shot settings (GPT35-fil-zero vs. GPT3.5-fil-few): The performance of
GPT35-fil-few is better than GPT35-fil-zero, showing that providing examples helps the model
detect TDMS. This is confirmed by comparing GPT35-few with AxCell, where GPT35-few is
better than AxCell in most cases.
2. Full text input vs. AZ input only (GPT35-few vs. GPT35-az-few): Generally, the performance
of the GPT35-few model better than that of GPT35-az-few except for the case of the Inexact
Precision. The significance diference in the results need to be verified by further experiments.
3. Providing the LLM with helpful material (GPT35-few vs. GPT35-info-few, GPT35-az-few
vs. GPT35-az-info-few, and Mistral-fil-zero vs. Mistral-fil-info-zero): Models given helpful
materials about datasets and tasks (GPT35-info-few, GPT35-az-info-few, and Mistral-fil-info-zero)
perform worse than their counterparts (GPT35-few, GPT35-az-few, and Mistral-fil-zero). We
expect that the reason behind that is the helpful materials may be misleading to the models.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this section, we present a discussion through conducting a manual investigation to compare the
output of the proposed models with each other and the ground truth annotation.</p>
      <p>AxCell vs. GPT-3.5 The AxCell system generally outperforms the LLM submissions, particularly the
GPT-3.5 submissions, in all metrics except for Accuracy. This suggests that GPT35-zero and GPT35-few
are better at identifying papers containing TDMS.</p>
      <p>Through manual analysis, we identify common causes for TDMS extraction errors GPT35-zero and
compare to AxCell and the ground truth data in Table 8. From the analysis, we found that GPT35-zero
extracts a broad range of information but with inconsistencies and potential errors. For the “Task”
entity, it predicts a mix of task names, some matching the ground truth and others deviating. As a result,
further investigation is needed for the potential hallucination. For the “Dataset” entity, it sometimes
combines multiple datasets despite correctly identifying individual ones. GPT35-zero also predicts
unconventional “Metric” not present in the ground truth data, such as Epoch Divergence. The “Score”
entity are mixed with percentages, raw scores, and string values.</p>
      <p>While AxCell outperforms GPT35-zero on the “Task” and “Dataset” entities, showing consistency
with the ground truth data, it struggles with accurately predicting the “Metric” entity. Both AxCell
and GPT35-zero often predict diferent metric names, such as Percentage error and Accuracy instead of
Percentage correctness. Additionally, AxCell predict “Score” entity difer from the ground truth data.
Despite these drawbacks, it is evident the AxCell system could produce better results in the entity-level
evaluation compared to the GPT35-zero model.</p>
      <p>We noticed both model extracted new information of Tuple(s) respect to the ground truth data.
Thus, we observe whether these tuples existed in the original paper or not. AxCell and GPT35-zero
could accurately predict the task name mentioned in the paper, such as Semantic Segmentation along
with the correct associated “Metric” name. However, for AxCell, the predicted “Dataset” name was
incorrect, which might be due to the parsing error that require further investigation. On the other
hand, GPT35-zero, could partially predict the correct “Dataset” name (StreetHazard). Nevertheless, the
“Score” entity remained challenging for both models. Additionally, according to our observation, the
annotated ground truth data is based on data collected from the results on PaperWithCode. As a result,
we consider that further work needs to be done to expand the annotated ground truth data source.
GPT-3.5 vs. Mistral Our experiments revealed that the Mistral-fil-zero language model outperforms
the GPT35-zero model across all evaluation metrics during Phase 2. Through manual analysis, we
observed that Mistral-fil-zero could more efectively identify TDMS from the input text which present
also in the ground truth data compared to GPT35-fil-zero . For this experiment, we utilized the
similar prompt for both LLMs model. However, we noticed that in many cases, GPT35-fil-zero return
“unanswerable” results, as presented in Table 9. We hypothesize that these cases, where GPT35-fil-zero
failed to extract the TDMS, caused its overall score to be lower than Mistral-fil-zero .</p>
      <p>From the analysis, Mistral-fil-zero accurately predicts the "dataset" entities according to the ground
truth. However, inconsistencies arise with the “Task”, “metric”, and “Score” entities. Although
Mistralifl-zero can extract some “Score” entities similar to the ground truth, there were diferences in its
“Score” prediction compare to the ground truth. Potential hallucinations were observed, such as in the
last row of the table where it extracted “Metric” and “Score” entities not found in the original paper.</p>
      <p>Fine-Grained Image Classification - iNaturalist - Top 1 Accuracy - 68.2
Fine-Grained Image Classification - Stanford Cars - Accuracy - 93.8%</p>
      <p>Fine-Grained Image Classification - CUB-200-2011 - Accuracy - 87.9</p>
      <p>Fine-grained image recognition - CUB-200-2011 - Recognition accuracy - 85.3
Image Classification - iNaturalist - Top 1 Accuracy - 31.12%
Image Classification - iNaturalist - Top 5 Accuracy - 52.76%
Image Classification - ImageNet - Top 1 Accuracy - 68.29%
k-Nearest-Neighbor (kNN) search - ImageNet - Top-1 accuracy - 68.29
k-Nearest-Neighbor (kNN) search - ImageNet - Top-5 accuracy - 87.75
k-Nearest-Neighbor (kNN) search - iNaturalist - Top-5 accuracy - 52.76
k-Nearest-Neighbor (kNN) search - iNaturalist - Top-1 accuracy - 31.12</p>
      <p>Triplet evaluation - PIT - Triplet accuracy (m=0.2) - 87.16
Further investigation is needed to understand this issue. Nevertheless, Mistral-fil-zero consistently
predicts the “Task” and “Dataset” entities.</p>
      <sec id="sec-5-1">
        <title>GPT-3.5 Zero-shot vs. Few-shot Settings The performance of GPT35-fil-zero and GPT35-fil-few</title>
        <p>models is comparable, as presented from Tables 6 and 7. While GPT35-fil-few model performs slightly
better than GPT35-fil-zero in all evaluation metrics, further experiments might be needed to generalize
the results. Table 10 presents a sample of the output generated by both models for the ArXiv ID
1711.05225v3, in comparison to the ground truth data.</p>
        <p>We observe that both models extract a wide range of information, including some inconsistent and
inaccurate information. This likely contributes to their lower performance, despite errors from the
ifltering process. For the “Task” entity, both models predict diferent task names, one of which matches
the ground truth data. Both models could extract the “Dataset” and the “Metric” entity accurately.
However, similar to the previous discussion, extracting the entity “Score” is challenging, often resulting
in string values. Besides, we argue that the filtering process may mislead the model so that both models,
afecting their performance compared to other proposed models in Phase 2.</p>
        <p>
          To sum up Extracting entities from the TDMS task is challenging, particularly for the “Score” and
“Dataset” entities. The proposed models struggle with extracting the “Score” entity. This dificulty arises
from the diverse formats authors use to present results, such as tables, graphs, or plain text. Additionally,
the variability in naming conventions for the “Dataset” entity across papers poses a challenge, as noted
by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. However, our models exhibit less variability in extracting the “Dataset” entity compared to
the ground truth, which may use naming conventions not found in the papers. Further investigation
is needed to understand the construction of the ground truth data. Consistent with [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], extracting
the ‘Task’ entity is relatively more straightforward, as it is rarely referenced diferently across papers
addressing the task.
        </p>
        <p>Overall, the AxCell system performance shows better performances in compared to the LLM
submissions both GPT-3.5 and Mistral in all submission. However, the two variants of the GPT-3.5 model,
GPT35-zero and GPT35-few, surpass AxCell in terms of Accuracy, indicating that these models are
better at identifying papers containing TDMS information. While AxCell consistently extracts each
entity of TDMS, particularly “Task”, “Metric”, and “Dataset”, there was an error where the dataset name
was not present in the publication.</p>
        <p>This could be due to the underlying taxonomy, which maps the dataset to the best fitting dataset in
this taxonomy. Additionally, we discovered after the deadline that there is a small overlap (around 5%)
between AxCell’s training dataset and the test data of the shared task, as both used PapersWithCode as
ground-truth data. Such contamination are quite frequently nowadays, as many LLMs are not disclosing
their training data, and therefore many baselines are already compromised [13, 14]. In the future, we
plan to investigate how this bias afects results and explore ways to mitigate such contamination.</p>
        <p>Moreover, Large Language Models (LLMs) show performance comparable to the AxCell system.
Incorporating one or more examples (few-shot learning) into the prompt improves TDMS extraction
quality. Nevertheless, the experiment heavily relies on prompts, which may influence the models’
output during evaluation phases. This observation aligns with findings that the quality of outputs
from conversational LLMs is directly influenced by the quality of the prompts provided by users [ 15].
Therefore, further investigation and refinement in the process of prompt engineering are essential.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Summary and Future Work</title>
      <p>We have presented our approach to solve the SOTA challenge of the SimpleText shared task, composed
of two modules: (1) applying rule-based classification to verify if a paper contains TDMS, and (2)
extracting TDMS from papers with result information. We use the AxCell implementation as a baseline
for TDMS extraction and experimented with GPT-3.5 and Mistral as LLMs in zero-shot and few-shot
settings with diferent input information. The results show that AxCell outperforms LLMs when the
zero-shot prompting paradigm is applied. LLMs on the other hand surpass AxCell’s performance in
few-shot settings. We conducted a manual investigation and showed that the LLM instructions can be
misleading in the “Dataset” and “Score” extraction in zero-shot settings, with improvements seen in
few-shot settings. We argue that the LLMs output are sensitive to the instructions given through the
prompts.</p>
      <p>Our discussion (Section 5) shows diferent direction of future work, including expanding the ground
truth dataset with data from papers and investigating potential hallucination that may occur from the
LLMs extraction. Additionally, our findings suggest that performance of the LLMs which are given the
sections from the paper text referring to experiments and results only is comparable to those given
the full paper text. Additionally, the results of LLMs using few-shot settings are comparable to, and
sometimes better than, the AxCell system. The open-source Mistral model outperforms the GPT-3.5
model. To verify these assumptions, we plan to repeat the experiments and conduct statistical analysis.
Scientific Publications, Association for Computational Linguistics, 2022, pp. 20–25. URL: https:
//aclanthology.org/2022.wiesp-1.3.
[12] S. Teufel, et al., Argumentative zoning: Information extraction from scientific text, Ph.D. thesis,</p>
      <p>Citeseer, 1999.
[13] S. Balloccu, P. Schmidtová, M. Lango, O. Dusek, Leak, cheat, repeat: Data contamination and
evaluation malpractices in closed-source LLMs, in: Y. Graham, M. Purver (Eds.), Proceedings of
the 18th Conference of the European Chapter of the Association for Computational Linguistics
(Volume 1: Long Papers), Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp.
67–93. URL: https://aclanthology.org/2024.eacl-long.5.
[14] O. Sainz, J. Campos, I. García-Ferrero, J. Etxaniz, O. L. de Lacalle, E. Agirre, NLP evaluation in
trouble: On the need to measure LLM data contamination for each benchmark, in: H. Bouamor,
J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP
2023, Association for Computational Linguistics, Singapore, 2023, pp. 10776–10787. URL: https:
//aclanthology.org/2023.findings-emnlp.722. doi: 10.18653/v1/2023.findings-emnlp.722.
[15] X. Ma, J. Li, M. Zhang, Chain of thought with explicit evidence reasoning for few-shot relation
extraction, 2024. arXiv:2311.05922.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Prompts for LLMs</title>
      <sec id="sec-7-1">
        <title>A.1. Zero-Shot Prompt</title>
        <p>&lt;&lt; FORMATTING &gt;&gt;
Answer i n t h e form o f a l i s t o f JSON o b j e c t s a s f o l l o w s .</p>
        <p>The o u t p u t s h o u l d be a markdown c o d e s n i p p e t f o r m a t t e d a s a l i s t o f JSON o b j e c t s i n
t h e f o l l o w i n g schema , i n c l u d i n g t h e l e a d i n g and t r a i l i n g " ‘ ‘ ‘ j s o n " and " ‘ ‘ ‘ " :
‘ ‘ ‘ j s o n
[
{
}
]
‘ ‘ ‘
" Task " : s t r i n g / / E x t r a c t t h e r e s e a r c h p r o b l e m o r f o c u s m e n t i o n e d i n t h e p a p e r
. Use ’ ’ i f n o t a v a i l a b l e .
" D a t a s e t " : s t r i n g / / E x t r a c t t h e d a t a s e t ( s ) u s e d f o r t h e machine l e a r n i n g
e x p e r i m e n t s . Use ’ ’ i f n o t a v a i l a b l e .
" M e t r i c " : s t r i n g / / E x t r a c t t h e e v a l u a t i o n measure ( s ) u s e d t o a s s e s s t h e
m o d e l s ’ p e r f o r m a n c e . Use ’ ’ i f n o t a v a i l a b l e .
" S c o r e " : s t r i n g / / E x t r a c t t h e b e s t n u m e r i c v a l u e ( s ) r e p r e s e n t i n g t h e model ’ s
p e r f o r m a n c e on a s p e c i f i c m e t r i c . Use ’ ’ i f n o t a v a i l a b l e .
{ {</p>
      </sec>
      <sec id="sec-7-2">
        <title>A.2. Few-Shot Prompt</title>
        <p>" t a s k " : " Keyword S p o t t i n g " ,
" d a t a s e t " : " Hey S i r i " ,
" m e t r i c " : " E r r o r R a t e " ,
" s c o r e " : " 0 . 4 5 "
&lt;&lt; Example 3 &gt;&gt;
I n p u t : T h i s p a p e r i s c o n c e r n e d w i t h t h e form o f t y p e d name b i n d i n g u s e d by t h e
FreshML f a m i l y o f l a n g u a g e s . I t s c h a r a c t e r i s t i c f e a t u r e i s t h a t a name b i n d i n g
i s r e p r e s e n t e d by an a b s t r a c t ( name , v a l u e ) − p a i r t h a t may o n l y be d e c o n s t r u c t e d
v i a t h e g e n e r a t i o n o f f r e s h bound names . The p a p e r p r o v e s a new r e s u l t a b o u t
what o p e r a t i o n s on names can co − e x i s t w i t h t h i s c o n s t r u c t . I n FreshML t h e o n l y
o b s e r v a t i o n one can make o f names i s t o t e s t w h e t h e r o r n o t t h e y a r e e q u a l .</p>
        <p>T h i s r e s t r i c t e d amount o f o b s e r v a t i o n was t h o u g h t n e c e s s a r y t o e n s u r e t h a t t h e r e
i s no o b s e r v a b l e d i f f e r e n c e b e t w e e n alpha − e q u i v a l e n t name b i n d e r s . Y e t from an
a l g o r i t h m i c p o i n t o f view i t would be d e s i r a b l e t o a l l o w o t h e r o p e r a t i o n s and
r e l a t i o n s on names , s u c h a s a t o t a l o r d e r i n g . T h i s p a p e r shows t h a t , c o n t r a r y
t o e x p e c t a t i o n s , one may add n o t j u s t o r d e r i n g , b u t a l m o s t any r e l a t i o n o r
n u m e r i c a l f u n c t i o n on names w i t h o u t d i s t u r b i n g t h e f u n d a m e n t a l c o r r e c t n e s s
r e s u l t a b o u t t h i s form o f t y p e d name b i n d i n g ( t h a t o b j e c t − l e v e l alpha −
e q u i v a l e n c e p r e c i s e l y c o r r e s p o n d s t o c o n t e x t u a l e q u i v a l e n c e a t t h e programming
meta − l e v e l ) , s o l o n g a s one t a k e s t h e s t a t e o f d y n a m i c a l l y c r e a t e d names i n t o
a c c o u n t .</p>
        <p>E x p e c t e d Output Format :
[ ]</p>
        <p>A d d i t i o n a l f o r m a t t i n g i n s t r u c t i o n s :
P r o v i d e t h e o u t p u t a s a l i s t o f v a l i d JSON o b j e c t s , u s i n g t h e a b o v e d e s c r i p t i o n .
I f a v a l u e i s n o t a v a i l a b l e o r a p p l i c a b l e , r e t u r n an empty l i s t .</p>
        <p>Make s u r e t h e JSON o u t p u t i s p r o p e r l y f o r m a t t e d .</p>
        <p>Don ’ t e x t r a c t t e x t o r i n f o r m a t i o n i n t h e ’ S c o r e ’ f i e l d . E x t r a c t o n l y t h e n u m e r i c
v a l u e s t h a t i n d i c a t e t h e ’ S c o r e ’ .</p>
        <p>Be s t r i c t when r e t u r n i n g t h e s c o r e . Only i n c l u d e n u m e r i c v a l u e s .</p>
        <p>Do n o t combine m u l t i p l e d a t a s e t s o r m e t r i c s o r s c o r e s f o r t h e same t a s k i n t o a
s i n g l e JSON o b j e c t .</p>
        <p>Now , g i v e n a new s c h o l a r l y a r t i c l e , your t a s k i s t o d e t e r m i n e i f i t c o n t a i n s
m e n t i o n s o f a l l f o u r e l e m e n t s ( t a s k , d a t a s e t , m e t r i c , s c o r e ) t o g e t h e r . I f t h e
same t a s k i s e v a l u a t e d on m u l t i p l e d a t a s e t s o r w i t h m u l t i p l e m e t r i c s , e a c h
c o m b i n a t i o n s h o u l d be r e p r e s e n t e d a s a s e p a r a t e JSON o b j e c t w i t h i n t h e array , a s
l o n g a s t h e t a s k , d a t a s e t , and m e t r i c a r e r e l a t e d t o e a c h o t h e r . G e n e r a t e your
own o u t p u t b a s e d on t h e i n p u t t e x t , u s i n g t h e e x a m p l e s o n l y a s a r e f e r e n c e f o r
t h e d e s i r e d f o r m a t :
&lt;&lt; INPUT &gt;&gt;
{ i n p u t _ c o n t e n t }
&lt;&lt; OUTPUT &gt;&gt;
" " "</p>
      </sec>
      <sec id="sec-7-3">
        <title>A.3. Few-Shot Prompt with data from PwC</title>
        <p>" " "
You a r e an e x p e r t i n machine l e a r n i n g who can i d e n t i f y i f a r e s e a r c h p a p e r c o n t a i n s
c e r t a i n key e l e m e n t s r e l a t e d t o t a s k s , d a t a s e t s , m e t r i c s , and s c o r e s .</p>
        <p>S p e c i f i c a l l y , you n e e d t o l o o k f o r t h e f o l l o w i n g :
Task : A p h r a s e d e s c r i b i n g t h e r e s e a r c h p r o b l e m o r f o c u s , o f t e n f o u n d i n t h e t i t l e ,
a b s t r a c t , i n t r o d u c t i o n , o r r e s u l t s t a b l e s / d i s c u s s i o n .</p>
        <p>D a t a s e t : A m e n t i o n o f t h e d a t a s e t ( s ) u s e d f o r t h e machine l e a r n i n g e x p e r i m e n t s ,
u s u a l l y l o c a t e d n e a r t h e t a s k m e n t i o n s .
M e t r i c : P h r a s e s r e f e r r i n g t o t h e e v a l u a t i o n m e a s u r e s u s e d t o a s s e s s t h e m o d e l s ’
p e r f o r m a n c e on t h e g i v e n t a s k and d a t a s e t . T h e s e a r e commonly f o u n d i n r e s u l t s
t a b l e s / f i g u r e s and t h e d i s c u s s i o n s e c t i o n .</p>
        <p>S c o r e : The n u m e r i c v a l u e ( s ) r e p r e s e n t i n g t h e model ’ s p e r f o r m a n c e on a s p e c i f i c
m e t r i c . M u l t i p l e s c o r e s may be r e p o r t e d f o r a s i n g l e m e t r i c , i n which c a s e t h e
b e s t s c o r e s h o u l d be i d e n t i f i e d .</p>
        <p>Here a r e a few e x a m p l e s o f t h e i n p u t and e x p e c t e d o u t p u t f o r m a t :
&lt;&lt; Example 1 &gt;&gt;
I n p u t : S t r e a m i n g keyword s p o t t i n g i s a w i d e l y u s e d s o l u t i o n f o r a c t i v a t i n g v o i c e
a s s i s t a n t s . We a p p l y o u r method f o r ’ hey S i r i ’ d e t e c t i o n . Compared t o t h e b e s t
o f t h e two p r i o r works , o u r method r e d u c e s t h e FRR from 1 . 7 % t o 0 . 4 5 % , which
y i e l d s a b o u t 73% r e l a t i v e FRR improvement .</p>
        <p>E x p e c t e d Output Format :
[
&lt;&lt; Example 3 &gt;&gt;
I n p u t : T h i s p a p e r i s c o n c e r n e d w i t h t h e form o f t y p e d name b i n d i n g u s e d by t h e
FreshML f a m i l y o f l a n g u a g e s . I t s c h a r a c t e r i s t i c f e a t u r e i s t h a t a name b i n d i n g
i s r e p r e s e n t e d by an a b s t r a c t ( name , v a l u e ) − p a i r t h a t may o n l y be d e c o n s t r u c t e d
v i a t h e g e n e r a t i o n o f f r e s h bound names . The p a p e r p r o v e s a new r e s u l t a b o u t
what o p e r a t i o n s on names can co − e x i s t w i t h t h i s c o n s t r u c t . I n FreshML t h e o n l y
o b s e r v a t i o n one can make o f names i s t o t e s t w h e t h e r o r n o t t h e y a r e e q u a l .
T h i s r e s t r i c t e d amount o f o b s e r v a t i o n was t h o u g h t n e c e s s a r y t o e n s u r e t h a t t h e r e
i s no o b s e r v a b l e d i f f e r e n c e b e t w e e n alpha − e q u i v a l e n t name b i n d e r s . Y e t from an
a l g o r i t h m i c p o i n t o f view i t would be d e s i r a b l e t o a l l o w o t h e r o p e r a t i o n s and
r e l a t i o n s on names , s u c h a s a t o t a l o r d e r i n g . T h i s p a p e r shows t h a t , c o n t r a r y
t o e x p e c t a t i o n s , one may add n o t j u s t o r d e r i n g , b u t a l m o s t any r e l a t i o n o r
n u m e r i c a l f u n c t i o n on names w i t h o u t d i s t u r b i n g t h e f u n d a m e n t a l c o r r e c t n e s s
r e s u l t a b o u t t h i s form o f t y p e d name b i n d i n g ( t h a t o b j e c t − l e v e l alpha −
e q u i v a l e n c e p r e c i s e l y c o r r e s p o n d s t o c o n t e x t u a l e q u i v a l e n c e a t t h e programming
meta − l e v e l ) , s o l o n g a s one t a k e s t h e s t a t e o f d y n a m i c a l l y c r e a t e d names i n t o
a c c o u n t .</p>
        <p>E x p e c t e d Output Format :
&lt;&lt; FORMATTING &gt;&gt;
Answer i n t h e form o f a l i s t o f JSON o b j e c t s a s f o l l o w s . The o u t p u t s h o u l d be a
markdown c o d e s n i p p e t f o r m a t t e d a s a l i s t o f JSON o b j e c t s i n t h e f o l l o w i n g
schema , i n c l u d i n g t h e l e a d i n g and t r a i l i n g " j s o n " and " " :</p>
        <p>Now , g i v e n : ( 1 ) a new s c h o l a r l y a r t i c l e , ( 2 ) l i s t o f d a t a s e t s , and ( 3 ) l i s t o f
t a s k s t h a t we manually i d e n t i f i e d a l s o i n t h e a r t i c l e a s h e l p i n g m a t e r i a l s t o
you . Your t a s k i s t o d e t e r m i n e i f t h e a r t i c l e c o n t a i n s m e n t i o n s o f a l l f o u r
e l e m e n t s ( t a s k , d a t a s e t , m e t r i c , s c o r e ) t o g e t h e r u s i n g t h e l i s t s o f t h e d a t a s e t s
and t h e t a s k . You can a l s o e x t r a c t o t h e r d a t a s e t s and t a s k s o u t s i d e t h e g i v e n
l i s t s . I f t h e same t a s k i s e v a l u a t e d on m u l t i p l e d a t a s e t s o r w i t h m u l t i p l e
m e t r i c s , e a c h c o m b i n a t i o n s h o u l d be r e p r e s e n t e d a s a s e p a r a t e JSON o b j e c t w i t h i n
t h e array , a s l o n g a s t h e t a s k , d a t a s e t , and m e t r i c a r e r e l a t e d t o e a c h o t h e r .
G e n e r a t e your own o u t p u t b a s e d on t h e i n p u t t e x t , u s i n g t h e e x a m p l e s o n l y a s a
r e f e r e n c e f o r t h e d e s i r e d f o r m a t :
&lt;&lt; INPUT &gt;&gt;
s c h o l a r l y a r t i c l e :
{ i n p u t _ t e x _ c o n t e n t }</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kabongo</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Automated mining of leaderboards for empirical AI research</article-title>
          , in: H.
          <string-name>
            <surname>-R. Ke</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Sugiyama (Eds.),
          <source>Towards Open and Trustworthy Digital Societies</source>
          , Springer International Publishing,
          <year>2021</year>
          , pp.
          <fpage>453</fpage>
          -
          <lpage>470</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -91669-5_
          <fpage>35</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kabongo</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>ORKG-leaderboards: a systematic workflow for mining leaderboards as a knowledge graph</article-title>
          ,
          <source>International Journal on Digital Libraries</source>
          <volume>25</volume>
          (
          <year>2023</year>
          )
          <fpage>41</fpage>
          -
          <lpage>54</lpage>
          . URL: https://doi.org/10.1007/s00799-023-00366-1. doi:
          <volume>10</volume>
          .1007/s00799-023-00366-1.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          , M. van Zuylen,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Beltagy</surname>
          </string-name>
          ,
          <article-title>SciREX: A challenge dataset for documentlevel information extraction</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7506</fpage>
          -
          <lpage>7516</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>670</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>670</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , E. SanJuan, S. Huet,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azarbonyad</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Vezzani</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2024 simpletext track - improving access to scientific texts for everyone</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science (LNCS)</source>
          , Springer, Heidelberg, Germany,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kabongo</surname>
            ,
            <given-names>H. B.</given-names>
          </string-name>
          <string-name>
            <surname>Giglou</surname>
            ,
            <given-names>Y. Zhang,</given-names>
          </string-name>
          <article-title>Overview of the CLEF 2024 simpletext task 4: SOTA? tracking the state-of-the-art in scholarly publications</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings</source>
          , CEUR-WS,
          <year>Online</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kardas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Czapla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stenetorp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , R. Stojnic,
          <article-title>AxCell: Automatic extraction of results from machine learning papers</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>8580</fpage>
          -
          <lpage>8594</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>692</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>692</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Augenstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Riedel</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Vikraman</surname>
            ,
            <given-names>A. McCallum,</given-names>
          </string-name>
          <article-title>SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications</article-title>
          , in: S. Bethard,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carpuat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Apidianaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Mohammad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          , D. Jurgens (Eds.),
          <source>Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>546</fpage>
          -
          <lpage>555</lpage>
          . URL: https://aclanthology.org/S17-2091. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>S17</fpage>
          -2091.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gábor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.-K. Schumann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>QasemiZadeh</surname>
          </string-name>
          , H. Zargayouna, T. Charnois,
          <article-title>SemEval2018 task 7: Semantic relation extraction and classification in scientific papers</article-title>
          , in: M.
          <string-name>
            <surname>Apidianaki</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>May</surname>
            , E. Shutova,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
          </string-name>
          , M. Carpuat (Eds.),
          <source>Proceedings of the 12th International Workshop on Semantic Evaluation</source>
          , Association for Computational Linguistics, New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>679</fpage>
          -
          <lpage>688</lpage>
          . URL: https://aclanthology.org/S18-1111. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>S18</fpage>
          -1111.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jochim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gleize</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>Identification of tasks, datasets, evaluation metrics, and numeric scores for scientific leaderboards construction, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (</article-title>
          <year>2019</year>
          )
          <fpage>5203</fpage>
          -
          <lpage>5213</lpage>
          . URL: https://www.aclweb.org/anthology/P19-1513. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1513, conference Name:
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics Place: Florence, Italy Publisher: Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kabongo</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Zero-shot entailment of leaderboards for empirical AI research</article-title>
          ,
          <source>in: 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>241</lpage>
          . URL: https://ieeexplore.ieee.org/document/10265895/. doi:
          <volume>10</volume>
          .1109/JCDL57899.
          <year>2023</year>
          .
          <volume>00042</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tensmeyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Wigington, IN: Table entity LINker for extracting leaderboards from machine learning publications</article-title>
          , in: T. Ghosal,
          <string-name>
            <given-names>S.</given-names>
            <surname>Blanco-Cuaresma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Accomazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Patton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Grezes</surname>
          </string-name>
          , T. Allen (Eds.),
          <source>Proceedings of the first Workshop on Information Extraction</source>
          from
          <article-title>d a t a s e t s l i s t : { d a t a s e t _ l i s t } t a s k s l i s t : { t a s k s _ l i s t }</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>