1. Introduction

Task " : " U n s u p e r v i s e d T e x t C l a s s i f i c a t i o n " , " D a t a s e t " : "

AMATU@Simpletext2024: Are LLMs Any Good for Scientific Leaderboard Extraction?

Moritz Staudinger

Alaa El-Ebshihy

0 1

Annisa Maulida Ningtyas

Florina Piroi

0 1

Allan Hanbury

1 0 Research Studios Austria , Data Science Studio , Austria 1 Technische Universität Wien , Austria

2024

2 0 09 12

In this paper, we present our approach to solve the SOTA challenge of the SimpleText shared task at CLEF 2024. The objective of the challenge is to extract all (Task, Dataset, Metric, Score) tuples from scientific papers which report model score on benchmark datasets. In this work, we propose a rule-based classification model to identify papers that reports score information. We then apply diferent methods to extract TDMS using: (1) a baseline model from the literature, and (2) two Large Language Models (LLMs), GPT-3.5 and Mistral. Results show that the baseline model outperforms the LLMs in most cases, especially in zero-shot settings, with improvements seen in few-shot settings. Manual investigation shows that extracting TDMS from paper text is challenging, particularly for "Dataset" and "Score" extraction.

eol>Scientific Text Extraction State-of-the-art Entity Extraction Relation Extraction

1. Introduction

In our data-driven world, the volume of published literature, including newspaper articles, social media posts, and scientific publications, is rapidly increasing. Since technological and scientific advancements are generally communicated through scientific publications, it is important to find and keep track of significant advances and challenges in various scientific fields. With the never-ending flow of new scientific publications (e.g. 1,000 new ML publications per month on arXiv 1 alone), it is becoming increasingly dificult to keep updated with the state-of-the-art for a given scientific task and compare new research contributions with previous ones.

In the particular area computer science of experimentation and evaluation of ML/IR models, assessing the efectiveness of a new model or algorithm is dificult due, not least, to heterogeneous reporting styles. To address this, a possibility is to create machine-readable results by either (1) creating machine actionable publications (that is, publications prepared in such a way that they contain further specifically formatted data which can be automatically processed and harvested correctly by algorithms) or (2) standardizing the evaluation and experimentation environment.The first path, if established, would allow creating and extracting comparable results with a minimal overhead for researchers, as the results could be collected in a standardized format, along with the written submissions. This is, though, unlikely to happen in the near future, as it requires a critical mass of scientists to alter their documentation habits, as stated by Kabongo et. al. [ 1 ].

Standardizing experimentation and evaluation environments is done through shared tasks or evaluation labs/tracks, which standardize the evaluation environment to evaluate the state-of-the-art performance of predefined tasks, with given metrics and hidden evaluation datasets. Although the results of these challenges are valuable to the community, they only provide comparable results for a few selected tasks and datasets, but not for the vast variety of research. Therefore, many research publications are not comparable, as they do not follow standardized evaluation strategies, use variations of datasets [? ] or propose new tasks and datasets.

Therefore, extracting scientific entities from scholarly articles is currently the best option for enabling comparable and machine-actionable results throughout the scientific community. Although platforms like PapersWithCode, AI-metrics, NLP-Progress, and the Open Research Knowledge Graph [ 2 ] enable the comparison of research results on given datasets, they are manually curated, therefore limited by the crowd-sourcing resources of the community and subject to human error.

Automated extraction of predefined scientific entities, such as Task-Dataset-Metric-(Score) (TDMS) [ 3, 1 ], can help automatic knowledge base population and comparability of research contributions.

In this work, we present our approaches for the SimpleText State-of-the-Art (SOTA) Extraction Challenge at CLEF 2024 [ 4, 5 ]. The task is to determine whether a given scientific paper reports model scores on benchmark datasets and, if so, extract all TDMS tuples. We use a large language model (LLM)-based approach to extract and combine dependent scientific entities (e.g. a score depends on a given dataset, task, and metric) in terms of the TDMS objective, and discuss improving this task’s performance with two distinct rule-based prefiltering systems for faster and more accurate extraction. As a baseline, we use the extraction tool presented by Kardas et al. [ 6 ].

The remainder of this work is structured as follows. In Section 2, we discuss related work in the area of scientific entity extraction. In Section 3, we discuss which methods we applied to approach the task challenge. In Section 4, we present our results for the shared task test sets. In Section 5, we discuss our approaches and limitations, which we then follow up with a summary and a brief outlook in Section 6.

2. Related work

In the area of IR and ML evaluation, one way to follow advancements in the area is the automatic generation of leaderboards by extracting data from scholarly articles, with the CLEF Simpletext State-ofthe-Art extraction Task contributing to the evaluation of these eforts. Augenstein et al. [ 7 ] organized a Shared Task at the SemEval Workshop 2017, where participants had to extract three types of entities from scientific paragraphs: Task, Method, and Material. This work was extended by Gabor et al. [ 8 ], where additionally to analyzing paragraphs, the authors have also used annotated abstracts.

Independent of an evaluation lab or shared task, research on information extraction from publication texts usually create specific datasets on which they experiment and evaluate their models. Jain et al. [ 3 ] introduced the SciREX dataset and model, which extracts dataset, metric, task, and method entities from a corpus of 1,170 ML articles from PapersWithCode. Hou et al. [ 9 ] published the first datasets for Task, Dataset, Metric, and Score (TDMS) extraction in the NLP domain using distant supervision annotations. Kardas et al. [ 6 ] developed the AxCell pipeline to process LaTeX source code and extract TDMS. Kabongo et al.[ 1, 10, 2 ] focused on mining for TDMS tuples, with the goal to automatically populate the Open Research Knowledge Graph (ORKG) with this information. Automating this process has the potential to accelerate the growth of the ORKG, permitting an easier comparison of research results across scholarly articles. Yang et al. [ 11 ] analyzed existing solutions assessing their limitations and proposed an approach that does not require LaTeX sources, is not limited to a closed taxonomy (e.g. not limited to extracting TDMS), and requires less supervision than previous solutions, like, for example, Kardas et al. [ 6 ].

3. Methods

The SOTA challenge aims to extract all TDMS tuples from a given arXiv scientific paper, labeling the article as “answerable” if no TDMS tuple is found. We first inspected the training and validation sets to understand the distribution of the articles the contain TDMS tuples and “unanswerable” articles. The training dataset includes 9,352 articles with LaTeX sources and annotations: 5,274 with TDMS tuples and 4,078 labeled as “unanswerable”. Similarly, the validation set is composed of 100 articles: 50 articles with TDMS tuples and 50 articles labeled as “unanswerable”. The training and the validation set are evenly distributed between articles that contain TDMS and “unanswerable”.

Therefore, our approach to solving the challenge consists of two modules: (1) Filtering unanswerable documents by applying a rule-based binary classification model to identify papers that do not contain TDMS, and (2) TDMS extraction to identify all TDMS in a given paper. For the TMDS extraction we experiment with AxCell [ 6 ] as a baseline model, and with GPT-3.5 and Mistral LLMs.

3.1. Filtering out unanswerable documents

We apply a rule-based binary classification method to recognize papers which are classified as “unanswerable”. These papers are excluded from further processing and, therefore, reduce the costs for running the advanced models. For this, we evaluated three rule-based settings with similar configurations but diferent outcomes, aiming for high recall to ensure only clearly unanswerable documents are filtered out. Each setting was evaluated based on Precision, Recall, and Accuracy in identifying "unanswerable" articles in the validation dataset. Table 1 shows the configurations tested. The first two methods assess section titles to determine if a paper is “unanswerable”. The first approach ( Result Section Exists) checks if any section title includes the terms result, experiment, or evaluation, indicating the presence of scores. The second approach (Result Section Exists with add. terms) extends on this idea, with the two additional terms comparison and performance. To build on the idea of section title detection, we further scan for any tables in the result section (Result Table Exists). This was done by heuristically looking if the result section of the LaTeX source code contains the phrase \begin{table, instead of only looking at the section name. Although this improved Precision and Accuracy, the Recall dropped significantly.

As a result, we chose the first method presented, which filters on the basis of the three section names. This method yields a similar Recall as the second approach, with only three papers of the validation data set are diferently classified.

3.2. TDMS extraction models

We experimented with the TDMS extraction using diferent models (see Table 2). We divide the experiments into two types: Baseline model and Large Language Models (LLMs). As a baseline model we utilize the AxCell presented by Kardas et al. [ 6 ]. For the LLMs models, we utilize GPT-3.5 and Mistral (an open source LLM) in zero-shot and few-shot settings and diferent sources as input for the prompt text. We detail the implementation in the following sections.

Filtered zero- or few-shot fulltext or az PwC information

– zero zero few few few few few zero zero fulltext fulltext fulltext fulltext fulltext fulltext az az fulltext fulltext ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✓

3.2.1. Baseline Model - AxCell

We use the implementation of the AxCell system [ 6 ] (model Id 1 in Table 2). AxCell is a machine learning pipeline, which extracts TDMS tuples from scientific papers, by combining neural networks for text extraction and table extraction. The extracted linking candidates are then verified through caption classification , mention lookup and table segmentation, before being merged into TDMS tuples.

To use AxCell for this shared-task, we first download the eprint version of the arXiv publications from the test datasets to create the same data structure as in the original paper. Then we filter out "unanswerable" papers using the previously mentioned rule-based classification system (Section 3.1). The eprint is then processed by the AxCell Extraction script to obtain all the unpacked LaTeX sources, graphics, and an HTML version of the article. The processed articles are fed to a Neural Network which ifrst extracts scientific entity candidates (Task, Dataset, Metric) and then tries to link them with the extracted scores from the tables. Therefore, each table cell is annotated with meta-information, as the TDM combination used to obtain a given score. 3.2.2. LLMs We use two LLMs in our experiments: GPT-3.5 2 (models 2-8) and MistralAI (models 9-10) as an OpenSource alternative to OpenAI’s GPT model. We divide our experiments using diferent criteria. Instructions to the model: We set up models in a zero-shot setting, where we give the LLM instructions, only (models 2, 3, 9, and 10), and models in a few-shot setting, where we give instructions and a small number of examples (models 4 to 8).

Filtering out unanswerable papers: In some of our models (ids 3, 5, 9 and 10), we filter out unanswerable papers with our selected rule-based classification (Section 3.1). In others, we use the complete test-sets without filtering, letting the model decide if the paper contains TDMS or unanswerable (models 2, 4, and 6 to 8).

Input text to the model: In models 2 to 6, 9 and 10 we use the full paper text as input to the LLM to extract the TDMS. Models 7 and 8, inspired by Argumentative Zoning (AZ) [12] which defines the main rhetorical structure in scientific articles, we extract only the text from sections referring to experiments and results, in addition to the abstract. We believe that these sections contain the TDMS information and, thus, avoid, processing the full paper text. We refer to the modules that utilize only the experiments and results sections text to using az as input while others use full text input (Column 6 in Table 2). 2we used "gpt-3.5-turbo-0125" variant in our experiments Additional Information from Paper with Code (PwC): We use PwC as a knowledge base to collect lists of dataset and task names. These information are exhaustive lists of available datasets and task names, which we then use to check if any of these datasets or any of these tasks are mentioned in the full text of the paper. Although these lists contain all names of datasets and tasks of the test set (as it was not specified at the time of evaluation that PwC was used as a ground truth), they do not contain any information if a dataset or task was used in a specific paper, an information which can also be part of the training data of LLMs. We used especially PwC as it is one of the most frequently used platforms for the comparison of research results and provides an API to easily access the data.

After extracting the matching datasets and tasks, they are then appended to the input prompt as helping materials for the LLMs.

We construct our prompts to the LLMs from the following components: «task description», «examples», «output format», «additional instructions», «input text», and «PwC information». In «task description», we describe the task, inputs, and outputs. «examples» are provided only in the few-shot setting, showing input and expected output. In «output format», we describe the expected output format, which is a list of JSON objects covering the TDMS in the paper. «additional instructions» emphasize key points (e.g., Score should be a numerical value). «input text» is a placeholder for the paper text, either the full text or sections describing results and experiments. «PwC information» includes lists of datasets and tasks found in the paper using PwC. The prompts used in the experiments are in Appendix A.

4. Experiment and Results

In this section, we describe the experiments submitted during each phase of the competition using the models described in Section 3. The competition consists of two phases: Few-shot Phase (Phase 1) and Zero-shot Phase (Phase 2). We give information about the test datasets, experiments, and results for each phase.

4.1. Test datasets

We run the experiments in each phase on the test datasets provided by the competition organizers. This test dataset comprises the LaTeX source of articles from arXiv. The test data includes 1.401 articles for Phase 1 and 789 articles for Phase 2.

4.2. Experiments

In Table 3, we show the models used for experiments submitted in each phase of the competition (see Section 3). For both phases, we consider the AxCell system as a baseline model. We use two and nine LLM variations for Phase 1 and Phase 2, respectively. Phase 1 Our aim in this phase is to compare the performance of the GPT-3.5 LLM in zero-shot settings against the AxCell baseline. We also consider the efect of filtering “unanswerable” papers (Section 3.1) to observe performance changes when reducing the number of papers processed by GPT-3.5. Phase 2 In this phase, we compare the performance of LLM models in few-shot settings against zero-shot settings, specifically comparing GPT35-fil-zero with GPT35-fil-few . We also compare performance using the full paper text as input (i.e. GPT35-few) versus only sections referring to experiments and results (i.e. GPT35-az-few). Additionally, we compare diferent LLM models (i.e. GPT35-fil-zero vs. Mistral-fil-zero ). Lastly, observe performance of the LLMs when providing external helping materials representing datasets and tasks (i.e GPT35-info-few, GPT35-az-info-few, and Mistral-fil-info-zero ).

4.3. Results

In this section, we present the results of our submissions for each phase. The submissions are evaluated based on: (1) Accuracy – measures whether the system can distinguish articles containing TDMS, (2) Summary – measures quality of the extracted TDMS using Rouge1, Rouge2, RougeL, and RougeLsum, and (3) precision, recall, and F1 – for each element in the TDMS (Task, Dataset, Metric, Score) tuples and the overall average3. In the following, for each phase, we report the performance of our submissions for Accuracy, Summary measures, and overall Precision, Recall, and F1.

Phase 1 In Table 4 and Table 5, we show the performance of our submissions for Phase 1. Values in bold are above the average for all submissions. The results in Table 4 show that the AxCell system outperforms GPT-3.5 submissions in all measures except Accuracy, indicating that GPT-3.5 better identifies papers containing TDMS. Generally, in Table 5, the Precision, Recall, and F1 for GPT35-zero are lower than AxCell.

Phase 2 In Table 6 and Table 7, we show the performance of our submissions for Phase 2. Values in bold are above the average for all submissions. Similar to the previous phase, the AxCell system outperforms the LLM submissions in all measures except for some cases in Accuracy. With regard to our aimed experiments, we notice the following: 3Due to space, we report only the overall values, the full results can be found here: https://docs.google.com/spreadsheets/d/1k82FmlztEBiNkKuskAZsNaovkblHeqmpKzD5C63Mn5Q/ 1. Zero-shot vs. Few-shot settings (GPT35-fil-zero vs. GPT3.5-fil-few): The performance of GPT35-fil-few is better than GPT35-fil-zero, showing that providing examples helps the model detect TDMS. This is confirmed by comparing GPT35-few with AxCell, where GPT35-few is better than AxCell in most cases. 2. Full text input vs. AZ input only (GPT35-few vs. GPT35-az-few): Generally, the performance of the GPT35-few model better than that of GPT35-az-few except for the case of the Inexact Precision. The significance diference in the results need to be verified by further experiments. 3. Providing the LLM with helpful material (GPT35-few vs. GPT35-info-few, GPT35-az-few vs. GPT35-az-info-few, and Mistral-fil-zero vs. Mistral-fil-info-zero): Models given helpful materials about datasets and tasks (GPT35-info-few, GPT35-az-info-few, and Mistral-fil-info-zero) perform worse than their counterparts (GPT35-few, GPT35-az-few, and Mistral-fil-zero). We expect that the reason behind that is the helpful materials may be misleading to the models.

5. Discussion

In this section, we present a discussion through conducting a manual investigation to compare the output of the proposed models with each other and the ground truth annotation.

AxCell vs. GPT-3.5 The AxCell system generally outperforms the LLM submissions, particularly the GPT-3.5 submissions, in all metrics except for Accuracy. This suggests that GPT35-zero and GPT35-few are better at identifying papers containing TDMS.

Through manual analysis, we identify common causes for TDMS extraction errors GPT35-zero and compare to AxCell and the ground truth data in Table 8. From the analysis, we found that GPT35-zero extracts a broad range of information but with inconsistencies and potential errors. For the “Task” entity, it predicts a mix of task names, some matching the ground truth and others deviating. As a result, further investigation is needed for the potential hallucination. For the “Dataset” entity, it sometimes combines multiple datasets despite correctly identifying individual ones. GPT35-zero also predicts unconventional “Metric” not present in the ground truth data, such as Epoch Divergence. The “Score” entity are mixed with percentages, raw scores, and string values.

While AxCell outperforms GPT35-zero on the “Task” and “Dataset” entities, showing consistency with the ground truth data, it struggles with accurately predicting the “Metric” entity. Both AxCell and GPT35-zero often predict diferent metric names, such as Percentage error and Accuracy instead of Percentage correctness. Additionally, AxCell predict “Score” entity difer from the ground truth data. Despite these drawbacks, it is evident the AxCell system could produce better results in the entity-level evaluation compared to the GPT35-zero model.

We noticed both model extracted new information of Tuple(s) respect to the ground truth data. Thus, we observe whether these tuples existed in the original paper or not. AxCell and GPT35-zero could accurately predict the task name mentioned in the paper, such as Semantic Segmentation along with the correct associated “Metric” name. However, for AxCell, the predicted “Dataset” name was incorrect, which might be due to the parsing error that require further investigation. On the other hand, GPT35-zero, could partially predict the correct “Dataset” name (StreetHazard). Nevertheless, the “Score” entity remained challenging for both models. Additionally, according to our observation, the annotated ground truth data is based on data collected from the results on PaperWithCode. As a result, we consider that further work needs to be done to expand the annotated ground truth data source. GPT-3.5 vs. Mistral Our experiments revealed that the Mistral-fil-zero language model outperforms the GPT35-zero model across all evaluation metrics during Phase 2. Through manual analysis, we observed that Mistral-fil-zero could more efectively identify TDMS from the input text which present also in the ground truth data compared to GPT35-fil-zero . For this experiment, we utilized the similar prompt for both LLMs model. However, we noticed that in many cases, GPT35-fil-zero return “unanswerable” results, as presented in Table 9. We hypothesize that these cases, where GPT35-fil-zero failed to extract the TDMS, caused its overall score to be lower than Mistral-fil-zero .

From the analysis, Mistral-fil-zero accurately predicts the "dataset" entities according to the ground truth. However, inconsistencies arise with the “Task”, “metric”, and “Score” entities. Although Mistralifl-zero can extract some “Score” entities similar to the ground truth, there were diferences in its “Score” prediction compare to the ground truth. Potential hallucinations were observed, such as in the last row of the table where it extracted “Metric” and “Score” entities not found in the original paper.

Fine-Grained Image Classification - iNaturalist - Top 1 Accuracy - 68.2 Fine-Grained Image Classification - Stanford Cars - Accuracy - 93.8%

Fine-Grained Image Classification - CUB-200-2011 - Accuracy - 87.9

Fine-grained image recognition - CUB-200-2011 - Recognition accuracy - 85.3 Image Classification - iNaturalist - Top 1 Accuracy - 31.12% Image Classification - iNaturalist - Top 5 Accuracy - 52.76% Image Classification - ImageNet - Top 1 Accuracy - 68.29% k-Nearest-Neighbor (kNN) search - ImageNet - Top-1 accuracy - 68.29 k-Nearest-Neighbor (kNN) search - ImageNet - Top-5 accuracy - 87.75 k-Nearest-Neighbor (kNN) search - iNaturalist - Top-5 accuracy - 52.76 k-Nearest-Neighbor (kNN) search - iNaturalist - Top-1 accuracy - 31.12

Triplet evaluation - PIT - Triplet accuracy (m=0.2) - 87.16 Further investigation is needed to understand this issue. Nevertheless, Mistral-fil-zero consistently predicts the “Task” and “Dataset” entities.

GPT-3.5 Zero-shot vs. Few-shot Settings The performance of GPT35-fil-zero and GPT35-fil-few

models is comparable, as presented from Tables 6 and 7. While GPT35-fil-few model performs slightly better than GPT35-fil-zero in all evaluation metrics, further experiments might be needed to generalize the results. Table 10 presents a sample of the output generated by both models for the ArXiv ID 1711.05225v3, in comparison to the ground truth data.

We observe that both models extract a wide range of information, including some inconsistent and inaccurate information. This likely contributes to their lower performance, despite errors from the ifltering process. For the “Task” entity, both models predict diferent task names, one of which matches the ground truth data. Both models could extract the “Dataset” and the “Metric” entity accurately. However, similar to the previous discussion, extracting the entity “Score” is challenging, often resulting in string values. Besides, we argue that the filtering process may mislead the model so that both models, afecting their performance compared to other proposed models in Phase 2.

To sum up Extracting entities from the TDMS task is challenging, particularly for the “Score” and “Dataset” entities. The proposed models struggle with extracting the “Score” entity. This dificulty arises from the diverse formats authors use to present results, such as tables, graphs, or plain text. Additionally, the variability in naming conventions for the “Dataset” entity across papers poses a challenge, as noted by [ 2 ]. However, our models exhibit less variability in extracting the “Dataset” entity compared to the ground truth, which may use naming conventions not found in the papers. Further investigation is needed to understand the construction of the ground truth data. Consistent with [ 2 ], extracting the ‘Task’ entity is relatively more straightforward, as it is rarely referenced diferently across papers addressing the task.

Overall, the AxCell system performance shows better performances in compared to the LLM submissions both GPT-3.5 and Mistral in all submission. However, the two variants of the GPT-3.5 model, GPT35-zero and GPT35-few, surpass AxCell in terms of Accuracy, indicating that these models are better at identifying papers containing TDMS information. While AxCell consistently extracts each entity of TDMS, particularly “Task”, “Metric”, and “Dataset”, there was an error where the dataset name was not present in the publication.

This could be due to the underlying taxonomy, which maps the dataset to the best fitting dataset in this taxonomy. Additionally, we discovered after the deadline that there is a small overlap (around 5%) between AxCell’s training dataset and the test data of the shared task, as both used PapersWithCode as ground-truth data. Such contamination are quite frequently nowadays, as many LLMs are not disclosing their training data, and therefore many baselines are already compromised [13, 14]. In the future, we plan to investigate how this bias afects results and explore ways to mitigate such contamination.

Moreover, Large Language Models (LLMs) show performance comparable to the AxCell system. Incorporating one or more examples (few-shot learning) into the prompt improves TDMS extraction quality. Nevertheless, the experiment heavily relies on prompts, which may influence the models’ output during evaluation phases. This observation aligns with findings that the quality of outputs from conversational LLMs is directly influenced by the quality of the prompts provided by users [ 15]. Therefore, further investigation and refinement in the process of prompt engineering are essential.

6. Summary and Future Work

We have presented our approach to solve the SOTA challenge of the SimpleText shared task, composed of two modules: (1) applying rule-based classification to verify if a paper contains TDMS, and (2) extracting TDMS from papers with result information. We use the AxCell implementation as a baseline for TDMS extraction and experimented with GPT-3.5 and Mistral as LLMs in zero-shot and few-shot settings with diferent input information. The results show that AxCell outperforms LLMs when the zero-shot prompting paradigm is applied. LLMs on the other hand surpass AxCell’s performance in few-shot settings. We conducted a manual investigation and showed that the LLM instructions can be misleading in the “Dataset” and “Score” extraction in zero-shot settings, with improvements seen in few-shot settings. We argue that the LLMs output are sensitive to the instructions given through the prompts.

Our discussion (Section 5) shows diferent direction of future work, including expanding the ground truth dataset with data from papers and investigating potential hallucination that may occur from the LLMs extraction. Additionally, our findings suggest that performance of the LLMs which are given the sections from the paper text referring to experiments and results only is comparable to those given the full paper text. Additionally, the results of LLMs using few-shot settings are comparable to, and sometimes better than, the AxCell system. The open-source Mistral model outperforms the GPT-3.5 model. To verify these assumptions, we plan to repeat the experiments and conduct statistical analysis. Scientific Publications, Association for Computational Linguistics, 2022, pp. 20–25. URL: https: //aclanthology.org/2022.wiesp-1.3. [12] S. Teufel, et al., Argumentative zoning: Information extraction from scientific text, Ph.D. thesis,

Citeseer, 1999. [13] S. Balloccu, P. Schmidtová, M. Lango, O. Dusek, Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp. 67–93. URL: https://aclanthology.org/2024.eacl-long.5. [14] O. Sainz, J. Campos, I. García-Ferrero, J. Etxaniz, O. L. de Lacalle, E. Agirre, NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, 2023, pp. 10776–10787. URL: https: //aclanthology.org/2023.findings-emnlp.722. doi: 10.18653/v1/2023.findings-emnlp.722. [15] X. Ma, J. Li, M. Zhang, Chain of thought with explicit evidence reasoning for few-shot relation extraction, 2024. arXiv:2311.05922.

A. Prompts for LLMs A.1. Zero-Shot Prompt

<< FORMATTING >> Answer i n t h e form o f a l i s t o f JSON o b j e c t s a s f o l l o w s .

The o u t p u t s h o u l d be a markdown c o d e s n i p p e t f o r m a t t e d a s a l i s t o f JSON o b j e c t s i n t h e f o l l o w i n g schema , i n c l u d i n g t h e l e a d i n g and t r a i l i n g " ‘ ‘ ‘ j s o n " and " ‘ ‘ ‘ " : ‘ ‘ ‘ j s o n [ { } ] ‘ ‘ ‘ " Task " : s t r i n g / / E x t r a c t t h e r e s e a r c h p r o b l e m o r f o c u s m e n t i o n e d i n t h e p a p e r . Use ’ ’ i f n o t a v a i l a b l e . " D a t a s e t " : s t r i n g / / E x t r a c t t h e d a t a s e t ( s ) u s e d f o r t h e machine l e a r n i n g e x p e r i m e n t s . Use ’ ’ i f n o t a v a i l a b l e . " M e t r i c " : s t r i n g / / E x t r a c t t h e e v a l u a t i o n measure ( s ) u s e d t o a s s e s s t h e m o d e l s ’ p e r f o r m a n c e . Use ’ ’ i f n o t a v a i l a b l e . " S c o r e " : s t r i n g / / E x t r a c t t h e b e s t n u m e r i c v a l u e ( s ) r e p r e s e n t i n g t h e model ’ s p e r f o r m a n c e on a s p e c i f i c m e t r i c . Use ’ ’ i f n o t a v a i l a b l e . { {

A.2. Few-Shot Prompt

" t a s k " : " Keyword S p o t t i n g " , " d a t a s e t " : " Hey S i r i " , " m e t r i c " : " E r r o r R a t e " , " s c o r e " : " 0 . 4 5 " << Example 3 >> I n p u t : T h i s p a p e r i s c o n c e r n e d w i t h t h e form o f t y p e d name b i n d i n g u s e d by t h e FreshML f a m i l y o f l a n g u a g e s . I t s c h a r a c t e r i s t i c f e a t u r e i s t h a t a name b i n d i n g i s r e p r e s e n t e d by an a b s t r a c t ( name , v a l u e ) − p a i r t h a t may o n l y be d e c o n s t r u c t e d v i a t h e g e n e r a t i o n o f f r e s h bound names . The p a p e r p r o v e s a new r e s u l t a b o u t what o p e r a t i o n s on names can co − e x i s t w i t h t h i s c o n s t r u c t . I n FreshML t h e o n l y o b s e r v a t i o n one can make o f names i s t o t e s t w h e t h e r o r n o t t h e y a r e e q u a l .

T h i s r e s t r i c t e d amount o f o b s e r v a t i o n was t h o u g h t n e c e s s a r y t o e n s u r e t h a t t h e r e i s no o b s e r v a b l e d i f f e r e n c e b e t w e e n alpha − e q u i v a l e n t name b i n d e r s . Y e t from an a l g o r i t h m i c p o i n t o f view i t would be d e s i r a b l e t o a l l o w o t h e r o p e r a t i o n s and r e l a t i o n s on names , s u c h a s a t o t a l o r d e r i n g . T h i s p a p e r shows t h a t , c o n t r a r y t o e x p e c t a t i o n s , one may add n o t j u s t o r d e r i n g , b u t a l m o s t any r e l a t i o n o r n u m e r i c a l f u n c t i o n on names w i t h o u t d i s t u r b i n g t h e f u n d a m e n t a l c o r r e c t n e s s r e s u l t a b o u t t h i s form o f t y p e d name b i n d i n g ( t h a t o b j e c t − l e v e l alpha − e q u i v a l e n c e p r e c i s e l y c o r r e s p o n d s t o c o n t e x t u a l e q u i v a l e n c e a t t h e programming meta − l e v e l ) , s o l o n g a s one t a k e s t h e s t a t e o f d y n a m i c a l l y c r e a t e d names i n t o a c c o u n t .

E x p e c t e d Output Format : [ ]

A d d i t i o n a l f o r m a t t i n g i n s t r u c t i o n s : P r o v i d e t h e o u t p u t a s a l i s t o f v a l i d JSON o b j e c t s , u s i n g t h e a b o v e d e s c r i p t i o n . I f a v a l u e i s n o t a v a i l a b l e o r a p p l i c a b l e , r e t u r n an empty l i s t .

Make s u r e t h e JSON o u t p u t i s p r o p e r l y f o r m a t t e d .

Don ’ t e x t r a c t t e x t o r i n f o r m a t i o n i n t h e ’ S c o r e ’ f i e l d . E x t r a c t o n l y t h e n u m e r i c v a l u e s t h a t i n d i c a t e t h e ’ S c o r e ’ .

Be s t r i c t when r e t u r n i n g t h e s c o r e . Only i n c l u d e n u m e r i c v a l u e s .

Do n o t combine m u l t i p l e d a t a s e t s o r m e t r i c s o r s c o r e s f o r t h e same t a s k i n t o a s i n g l e JSON o b j e c t .

Now , g i v e n a new s c h o l a r l y a r t i c l e , your t a s k i s t o d e t e r m i n e i f i t c o n t a i n s m e n t i o n s o f a l l f o u r e l e m e n t s ( t a s k , d a t a s e t , m e t r i c , s c o r e ) t o g e t h e r . I f t h e same t a s k i s e v a l u a t e d on m u l t i p l e d a t a s e t s o r w i t h m u l t i p l e m e t r i c s , e a c h c o m b i n a t i o n s h o u l d be r e p r e s e n t e d a s a s e p a r a t e JSON o b j e c t w i t h i n t h e array , a s l o n g a s t h e t a s k , d a t a s e t , and m e t r i c a r e r e l a t e d t o e a c h o t h e r . G e n e r a t e your own o u t p u t b a s e d on t h e i n p u t t e x t , u s i n g t h e e x a m p l e s o n l y a s a r e f e r e n c e f o r t h e d e s i r e d f o r m a t : << INPUT >> { i n p u t _ c o n t e n t } << OUTPUT >> " " "

A.3. Few-Shot Prompt with data from PwC

" " " You a r e an e x p e r t i n machine l e a r n i n g who can i d e n t i f y i f a r e s e a r c h p a p e r c o n t a i n s c e r t a i n key e l e m e n t s r e l a t e d t o t a s k s , d a t a s e t s , m e t r i c s , and s c o r e s .

S p e c i f i c a l l y , you n e e d t o l o o k f o r t h e f o l l o w i n g : Task : A p h r a s e d e s c r i b i n g t h e r e s e a r c h p r o b l e m o r f o c u s , o f t e n f o u n d i n t h e t i t l e , a b s t r a c t , i n t r o d u c t i o n , o r r e s u l t s t a b l e s / d i s c u s s i o n .

D a t a s e t : A m e n t i o n o f t h e d a t a s e t ( s ) u s e d f o r t h e machine l e a r n i n g e x p e r i m e n t s , u s u a l l y l o c a t e d n e a r t h e t a s k m e n t i o n s . M e t r i c : P h r a s e s r e f e r r i n g t o t h e e v a l u a t i o n m e a s u r e s u s e d t o a s s e s s t h e m o d e l s ’ p e r f o r m a n c e on t h e g i v e n t a s k and d a t a s e t . T h e s e a r e commonly f o u n d i n r e s u l t s t a b l e s / f i g u r e s and t h e d i s c u s s i o n s e c t i o n .

S c o r e : The n u m e r i c v a l u e ( s ) r e p r e s e n t i n g t h e model ’ s p e r f o r m a n c e on a s p e c i f i c m e t r i c . M u l t i p l e s c o r e s may be r e p o r t e d f o r a s i n g l e m e t r i c , i n which c a s e t h e b e s t s c o r e s h o u l d be i d e n t i f i e d .

Here a r e a few e x a m p l e s o f t h e i n p u t and e x p e c t e d o u t p u t f o r m a t : << Example 1 >> I n p u t : S t r e a m i n g keyword s p o t t i n g i s a w i d e l y u s e d s o l u t i o n f o r a c t i v a t i n g v o i c e a s s i s t a n t s . We a p p l y o u r method f o r ’ hey S i r i ’ d e t e c t i o n . Compared t o t h e b e s t o f t h e two p r i o r works , o u r method r e d u c e s t h e FRR from 1 . 7 % t o 0 . 4 5 % , which y i e l d s a b o u t 73% r e l a t i v e FRR improvement .

E x p e c t e d Output Format : [ << Example 3 >> I n p u t : T h i s p a p e r i s c o n c e r n e d w i t h t h e form o f t y p e d name b i n d i n g u s e d by t h e FreshML f a m i l y o f l a n g u a g e s . I t s c h a r a c t e r i s t i c f e a t u r e i s t h a t a name b i n d i n g i s r e p r e s e n t e d by an a b s t r a c t ( name , v a l u e ) − p a i r t h a t may o n l y be d e c o n s t r u c t e d v i a t h e g e n e r a t i o n o f f r e s h bound names . The p a p e r p r o v e s a new r e s u l t a b o u t what o p e r a t i o n s on names can co − e x i s t w i t h t h i s c o n s t r u c t . I n FreshML t h e o n l y o b s e r v a t i o n one can make o f names i s t o t e s t w h e t h e r o r n o t t h e y a r e e q u a l . T h i s r e s t r i c t e d amount o f o b s e r v a t i o n was t h o u g h t n e c e s s a r y t o e n s u r e t h a t t h e r e i s no o b s e r v a b l e d i f f e r e n c e b e t w e e n alpha − e q u i v a l e n t name b i n d e r s . Y e t from an a l g o r i t h m i c p o i n t o f view i t would be d e s i r a b l e t o a l l o w o t h e r o p e r a t i o n s and r e l a t i o n s on names , s u c h a s a t o t a l o r d e r i n g . T h i s p a p e r shows t h a t , c o n t r a r y t o e x p e c t a t i o n s , one may add n o t j u s t o r d e r i n g , b u t a l m o s t any r e l a t i o n o r n u m e r i c a l f u n c t i o n on names w i t h o u t d i s t u r b i n g t h e f u n d a m e n t a l c o r r e c t n e s s r e s u l t a b o u t t h i s form o f t y p e d name b i n d i n g ( t h a t o b j e c t − l e v e l alpha − e q u i v a l e n c e p r e c i s e l y c o r r e s p o n d s t o c o n t e x t u a l e q u i v a l e n c e a t t h e programming meta − l e v e l ) , s o l o n g a s one t a k e s t h e s t a t e o f d y n a m i c a l l y c r e a t e d names i n t o a c c o u n t .

E x p e c t e d Output Format : << FORMATTING >> Answer i n t h e form o f a l i s t o f JSON o b j e c t s a s f o l l o w s . The o u t p u t s h o u l d be a markdown c o d e s n i p p e t f o r m a t t e d a s a l i s t o f JSON o b j e c t s i n t h e f o l l o w i n g schema , i n c l u d i n g t h e l e a d i n g and t r a i l i n g " j s o n " and " " :

Now , g i v e n : ( 1 ) a new s c h o l a r l y a r t i c l e , ( 2 ) l i s t o f d a t a s e t s , and ( 3 ) l i s t o f t a s k s t h a t we manually i d e n t i f i e d a l s o i n t h e a r t i c l e a s h e l p i n g m a t e r i a l s t o you . Your t a s k i s t o d e t e r m i n e i f t h e a r t i c l e c o n t a i n s m e n t i o n s o f a l l f o u r e l e m e n t s ( t a s k , d a t a s e t , m e t r i c , s c o r e ) t o g e t h e r u s i n g t h e l i s t s o f t h e d a t a s e t s and t h e t a s k . You can a l s o e x t r a c t o t h e r d a t a s e t s and t a s k s o u t s i d e t h e g i v e n l i s t s . I f t h e same t a s k i s e v a l u a t e d on m u l t i p l e d a t a s e t s o r w i t h m u l t i p l e m e t r i c s , e a c h c o m b i n a t i o n s h o u l d be r e p r e s e n t e d a s a s e p a r a t e JSON o b j e c t w i t h i n t h e array , a s l o n g a s t h e t a s k , d a t a s e t , and m e t r i c a r e r e l a t e d t o e a c h o t h e r . G e n e r a t e your own o u t p u t b a s e d on t h e i n p u t t e x t , u s i n g t h e e x a m p l e s o n l y a s a r e f e r e n c e f o r t h e d e s i r e d f o r m a t : << INPUT >> s c h o l a r l y a r t i c l e : { i n p u t _ t e x _ c o n t e n t }

[1]

Kabongo , J. D'Souza , S. Auer , Automated mining of leaderboards for empirical AI research , in: H. -R. Ke , C. S. Lee , K. Sugiyama (Eds.), Towards Open and Trustworthy Digital Societies , Springer International Publishing, 2021 , pp. 453 - 470 . doi: 10 .1007/978-3- 030 -91669-5_ 35 .

[2]

Kabongo , J. D'Souza , S. Auer , ORKG-leaderboards: a systematic workflow for mining leaderboards as a knowledge graph , International Journal on Digital Libraries 25 ( 2023 ) 41 - 54 . URL: https://doi.org/10.1007/s00799-023-00366-1. doi: 10 .1007/s00799-023-00366-1.

[3]

Jain , M. van Zuylen,

Hajishirzi , I. Beltagy , SciREX: A challenge dataset for documentlevel information extraction , in: D. Jurafsky , J.

Chai , N.

Schluter , J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , 2020 , pp. 7506 - 7516 . URL: https://aclanthology.org/ 2020 .acl-main. 670 . doi: 10 .18653/v1/ 2020 .acl-main. 670 .

[4]

Ermakova , E. SanJuan, S. Huet,

Azarbonyad , G. M. D. Nunzio , F. Vezzani , J. D'Souza , J. Kamps , Overview of the CLEF 2024 simpletext track - improving access to scientific texts for everyone , in: L. Goeuriot , P.

Mulhem , G.

Quénot , D.

Schwab , L.

Soulier , G. M. D. Nunzio , P. Galuščáková , A. G. S. de Herrera , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science (LNCS) , Springer, Heidelberg, Germany, 2024 .

[5] J. D'Souza , S.

Kabongo , H. B.

Giglou , Y. Zhang,

Overview of the CLEF 2024 simpletext task 4: SOTA? tracking the state-of-the-art in scholarly publications , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings , CEUR-WS, Online , 2024 .

[6]

Kardas ,

Czapla ,

Stenetorp ,

Ruder ,

Riedel ,

Taylor , R. Stojnic, AxCell: Automatic extraction of results from machine learning papers , in: B. Webber , T. Cohn, Y. He , Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Association for Computational Linguistics , Online, 2020 , pp. 8580 - 8594 . URL: https://aclanthology.org/ 2020 .emnlp-main. 692 . doi: 10 .18653/v1/ 2020 .emnlp-main. 692 .

[7]

Augenstein , M. Das , S.

Riedel , L.

Vikraman , A. McCallum,

SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications , in: S. Bethard,

Carpuat ,

Apidianaki ,

S. M.

Mohammad ,

Cer , D. Jurgens (Eds.), Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , Association for Computational Linguistics , Vancouver, Canada, 2017 , pp. 546 - 555 . URL: https://aclanthology.org/S17-2091. doi: 10 .18653/ v1/ S17 -2091.

[8]

Gábor ,

Buscaldi , A.-K. Schumann , B. QasemiZadeh , H. Zargayouna, T. Charnois, SemEval2018 task 7: Semantic relation extraction and classification in scientific papers , in: M. Apidianaki , S. M.

Mohammad , J.

May , E. Shutova, S.

Bethard , M. Carpuat (Eds.), Proceedings of the 12th International Workshop on Semantic Evaluation , Association for Computational Linguistics, New Orleans, Louisiana, 2018 , pp. 679 - 688 . URL: https://aclanthology.org/S18-1111. doi: 10 .18653/ v1/ S18 -1111.

[9]

Hou ,

Jochim ,

Gleize ,

Bonin ,

Ganguly , Identification of tasks, datasets, evaluation metrics, and numeric scores for scientific leaderboards construction, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics ( 2019 ) 5203 - 5213 . URL: https://www.aclweb.org/anthology/P19-1513. doi: 10 .18653/v1/ P19 -1513, conference Name: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics Place: Florence, Italy Publisher: Association for Computational Linguistics .

[10]

Kabongo , J. D'Souza , S. Auer , Zero-shot entailment of leaderboards for empirical AI research , in: 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL) , IEEE, 2023 , pp. 237 - 241 . URL: https://ieeexplore.ieee.org/document/10265895/. doi: 10 .1109/JCDL57899. 2023 . 00042 .

[11]

Yang ,

Tensmeyer , C. Wigington, IN: Table entity LINker for extracting leaderboards from machine learning publications , in: T. Ghosal,

Blanco-Cuaresma ,

Accomazzi ,

R. M.

Patton ,

Grezes , T. Allen (Eds.), Proceedings of the first Workshop on Information Extraction from d a t a s e t s l i s t : { d a t a s e t _ l i s t } t a s k s l i s t : { t a s k s _ l i s t }