-

1613-0073

Towards Dataset for Extracting Relations in the Climate-Change Domain

Andrija Poleksić

andrija.poleksic@uniri.hr 0

Sanda Martinčić-Ipšić

0 0 Faculty of Informatics and Digital Technologies (University of Rijeka) , Radmile Matejčić 2, Rijeka, 51000 , Croatia

The impacts of global warming and climate change on ecosystems, weather patterns and human societies pose a significant threat to biodiversity and the sustainability of our planet. Despite the widespread scientific consensus, climate change denial persists among a segment of the population, either due to misconceptions or vested interests. Recent research shows that progress is being made in addressing climate denial as a majority acknowledges man-made climate change. However, the spread of misinformation remains a challenge, often perpetuated by corporate interests. To overcome these challenges, we propose constructing a dataset tailored for automated extraction and structuring of climate changerelated scientific findings, focusing on relation extraction (RE) from scientific papers. Our research outlines the steps involved, including the preparation of the dataset for further training of the BERT-based model and downstream relation extraction task formulation. We discuss the process of data collection, preprocessing techniques and preliminary dataset analysis. Additionally, we highlight the need for a specialized Named Entity Recognition model for the climate-change domain and underline the need for annotation of domain-specific relations.

dataset climate change relation extraction scientific papers

CEUR ceur-ws.org

1. Introduction

Global warming and climate change have profound and far-reaching efects on global ecosystems, weather patterns, sea levels, and human societies, constituting a critical threat to the planet’s biodiversity and the prospect of a sustainable future [1]. Despite the widespread acceptance and scientific backing of climate change concepts, there remains a segment of the population that denies human impact on climate change, referred to as climate denial. Climate denial is driven either by misguided beliefs [2] or vested corporate interests [ 3 ]. A study by Areni et al. [2] investigates the dynamics between supporter and denier groups of Reddit users. They observe that supporters frequently reference scientific work, whereas deniers tend to rely more on alternative media and sources. Recent comprehensive research conducted by Andre et al. [ 4 ] demonstrates significant strides in addressing the issue of climate change denial. Their findings reveal that up to 86% of individuals acknowledge the reality of human-induced climate change and endorse measures aimed at mitigating human impact on the climate. Substantial climate denial stems from the dissemination of misinformation by large companies, often driven by vested interests, such as oil companies [ 5 ] and false scientific doubt creations, as elaborated by Oreskes and Conway [ 6 ]. Furthermore, the ever-increasing amount of data and information, including scientific papers, propels the need for automated information processing to speed up informed research decisions and facilitate fact-checking.

Motivated by both these challenges - information deluge and climate change, in this paper, we propose steps to construct the dataset that is fit to automatically extract and structure climate change-related scientific findings using information extraction (IE) methods. Specifically, we focus on the preliminary steps for relation extraction (RE) from scientific papers. Relation extraction (RE) is tasked with the identification of relations between entities in sentences, paragraphs or larger units of text. Sentence-level relation extraction involves identifying and classifying relations between entities in a single sentence. The goal is to determine the relation or association between two entities, typically represented by nouns or noun phrases such as people, organizations, or locations - named entities [ 7 ]. Our overall research plan consists of several steps: • Preparation of the dataset of scientific papers for a climate-change domain suitable for the training of a BERT-like model; • Additional pretraining (training with available pretrained weights) of the BERT-like model to adapt to the climate-change domain; • Definition of relation types for relation extraction and construction of the dataset for the ifne-tuning of the newly trained model(s) on the task of sentence-level relation extraction; • Construction and curation of the climate-change knowledge graph from a high-quality journal.

In the next Section 2 is a short overview of the related work on pertained language models, relation extraction datasets and relation annotation. Section 3 elaborates on data collection, preprocessing and a preliminary analysis of the data. The final Sections 4 and 5, cover the results, discussion and conclusions respectively.

2. Related Work

Recent research eforts [ 8, 9, 10, 11, 12 ] report using pretrained models for text classification and sequence labelling tasks. One of the prominent ones is BERT (Bidirectional Encoder Representations from Transformers), an encoder-only transformer model trained on masked language modelling (MLM) task [ 13 ]. Although it is shown that encoder-decoder architecture models such as BART [ 14 ] and T5 [ 15 ] provide comparable and sometimes better results [ 16 ], they require the training of a larger number of parameters, which ultimately requires a larger amount of data and computational resources.

Lee et al. [ 8 ] perform additional training of the original BERT deep neural model [ 13 ] for the biomedical domain - BioBERT. They report that no new WordPeace vocabulary is needed, ensuring the compatibility of the two pretrained models (BioBERT and BERT). BioBERT achieves new SOTA results on benchmarks for relation extraction and named entity recognition. ClinicalBERT model [ 11 ] follows the same principle and further trains the BERT and BioBERT models on a large multicenter dataset.

The other line of research by Beltagy et al. [ 9 ] is training a new model SciBERT from scratch, which is also based on the BERT architecture [ 13 ], using scientific papers as the training data. For SciBERT they construct a new vocabulary SciVocab. An overall improvement of 0.61 F1score on the downstream tasks using SciVocab compared to using the original BERT vocabulary is achieved. Additionally, several SOTA results are reported, surpassing also the BioBERT results on the ChemProt [ 17 ] benchmark by a fairly large margin. A similar strategy is applied in Chalkidis et al. [ 12 ], where a family of LegalBERT models is trained to support legal NLP research, computer-assisted law and legal technology applications.

Webersinke et al. in [ 10 ] train the RoBERTa model [ 18 ], which was adapted using distillation process [ 19 ], on the climate-change domain - ClimateBERT. The model is trained on climaterelated news articles and posts on social media.

In our research we will extend our previous research [ 20 ], as we plan to perform additional training on two models: for SciBERT additional training for the climate-change domain employing scientific papers; and for ClimateBERT extension of parametrized domain knowledge by carefully curated high-quality dataset, surpassing their drawbacks of either out-of-climatechange-domain vocabulary or improving the quality of media collected information with scientifically obtained facts. To this end, in this paper, we propose the construction of a new dataset for the climate-change domain obtained from scientific papers published in high-quality journals.

For joint entity and relation extraction downstream tasks [ 21 ] the model is trained to perform both tasks simultaneously while benefiting from the use of interrelated signals. Relation extraction can be set as a supervised task and requires a huge amount of labelled (i.e. annotated) training data. To speed up the process, many researchers are turning to the idea of distant supervision1 [ 22 ]. This includes datasets such as FewRel [ 23 ] and T-REx [ 24 ] for RE at sentence level and datasets such as DocRED [ 25 ] and Wiki20m [ 26 ] for RE on larger text sections.

Recently, the use of Large Language Models (LLMs) for the annotation of relations and entities has been reported [ 27 ], either to augment and speed up the annotation process for human annotators [ 28, 29 ] or to completely replace human eforts [ 30 ]. Besides annotation, LLMs are considered as synthetic data generators [ 31, 32 ] or for assessing the LLM-annotation quality [ 33 ]. In our research, we plan to engage LLMs for the relation annotation subtask, leveraging of-the-shelf pretrained LLMs to speed up the process, as opposed to training specialised in-house LLMs and using them directly for RE.

3. Dataset Preparation

Adapting one of the BERT models for the RE task for the climate-change domain requires the construction of an appropriate dataset (e.g. scientific and high-quality source). To this end we selected the highest-ranked scientific journals on climate change based on the Scimago 1Distant supervision assumes that the presence of a given entity pair in a given text implies a relation between them such that it is found in a Knowledge Graph/Base.

Journal & Country Rank (SJR)2 and ScienceWatch Rank3 and open access MDPI journals that are associated with the topic of climate change and in a substantial quantity of available papers and consistent format for parsing. The Table 2 (Appendix A) lists information on 194,673 retrieved research papers from selected journals, where 77.35% (150,583) are available in HTML format, while the remaining 22.65% (44,090) are only available in PDF format.

The PDF documents were first processed with pdfminer.six 4 library [ 34 ] for extracting information from PDF documents. They were converted to HTML format retaining the available information for each parsed element, including position, font and font size. This information was obtained with the Layout analysis algorithm5 that groups characters into words and lines, lines into boxes and finally textboxes hierarchically based on the position of each character. Hence, we developed a parser fine-tuned to each journal formatting style and position information, enabling correct and complete text extraction. For navigation through HTML files, we used BeautifulSoup6 library [ 35 ].

As already mentioned, for each journal a specific parser was needed. Next, we draw a random sample of 100 papers for each journal to evaluate the parsing procedure. Based on the random sample, we create a parser that successfully extracts the content of the papers in 100% of the cases, ranging from pure content to metadata such as authors, afiliations, references and DOI information. The parsing procedure allows extracting data to the full extent. This is manually validated on a random sample of 10 papers per journal by comparing the texts from PDF/HTML with the data stored in Pandas dataframes7. Table 3 (Appendix C) lights up some of the most common problems encountered during PDF and HTML parsing. Still, despite many problems, we obtained a well-documented, comprehensive dataset, which is appropriate for further model training. In Table 1 the comparison of the total training data used for each of the neural models (BERT, SciBERT and ClimateBERT) is reported. Our dataset contains ∼35% of tokens used for training of SciBERT, and surpasses the number of tokens for ClimateBERT by six times. The average number of sentences per paper in our dataset is ∼160% of the average reported for SciBERT. These numbers are encouraging, suggesting that we have collected suficient high-quality texts for training of BERT-based model.

To further explore the dataset content we report statistics using a readily available part-ofspeech (POS) tagger and a named entity recognition (NER) model from flair 8 framework [ 36 ]. First, we take a random sample of 10,000 research papers to perform the analysis. Then we tokenize into sentences and perform POS tagging9 and NER. In each POS-tagged sentence, we determine noun- and verb- phrases. Non traditionally, we define heuristic noun- and verbphrases as a sequence of words with specific POS tags as listed: • Noun phrase: Cardinal number (CD), Adjective (JJ), Determiner (DT), Noun (NN),

Foreign word (FW), Possessive ending (POS), Hyphen (HYPH), Symbol (SYM) , 2https://www.scimagojr.com/journalrank.php?category=2306 3http://archive.sciencewatch.com/ana/st/climate/journals/ 4https://github.com/pdfminer/pdfminer.six/tree/master 5https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#id1 6https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 7https://pandas.pydata.org/ 8https://github.com/flairNLP/flair 9The full list of POS tags for the model used can be found here: https://huggingface.co/flair/pos-english.

This modification, despite being imperfect, allows for analysis of the most frequent verb- and noun- phrases, providing insights into possible types of relations between entities, possible named entities and entity types (e.g. person, organization, location, etc.). With this approximation, we further estimated the number of total and unique triples. Figure 1 shows the total number of verb phrases, noun phrases, entities (tagged by the NER model) and possible triples occurring in the sample of 10,000 papers. The sample consists of 2,406,799 sentences, from which we extracted a total of 15,238,265 noun phrases and 1,790,745 entities. The ratio of noun phrases to extracted entities (∼8:1) indicates the need for a NER model, that is better fitted to the climate-change domain vocabulary. Table 4 (Appendix D) lists the top noun phrases consisting of 1, 2 and 3 words respectively. Table 5 (Appendix E) lists the top entities for three entity types: Location Name (LOC), Organization Name (ORG) and Other Name (MISC). Number of entity types will be addressed in the future work, employing more recent methods such as GLiNER [ 37 ]. Since the list contains many acronyms and abbreviations the expansion and disambiguation problem needs to be addressed as well.

Similarly, we analyze the occurrence of verb phrases: a total of 5,934,949 verb phrases forming 486,632 unique expressions. Although this is promising, the number of unique expressions needs to be reduced to a feasible set enabling the training of the classifier to extract relations in downstream tasks. Moreover, this is an indication that many climate-change-specific relations are present, which needs to be addressed in the downstream training as well. Table 6 (Appendix F) reports the 30 most frequently occurring verb phrases by number of words (1, 2 and 3 respectively). We observe a high similarity between many unique verb phrases, such as: ”is shown”, ”shows”, ”are shown” and ”has been shown”; indicating the obvious next step of data quality improvement by deduplication.

4. Relation Annotation

To efectively train and evaluate supervised relation extraction models, the annotated data is needed [ 24 ]. To this end, we plan to engage the advanced LLM possibilities in the context of automatic or enhanced annotation of relation triples. With POS tagging and NER on the sample of 10,000 papers, we have established the foundation for possible triple detection. We anticipate that a relation is expected to exist if there is a verb between two entities, where entities are either approximated by noun phrases that we have heuristically recognised or named entities recognised by the flair model. Moreover, we hypothesize that this will allow guided annotation by providing better context to LLM-enabled annotation. In the remainder of this section, we preview some examples of possible entities and relations in climate change domain10, which remains an open question to be addressed in the future: • ’For example, Atlantic cyclones have been well documented as causing high surge levels and heavy precipitation.’ - (Atlantic cyclones, cause, high surge levels) • ’El Niño–Southern Oscillation (ENSO) is another important factor for winter temperature in China.’ - (ENSO, afects , winter temperature in China) • ’The concentration map captured a significantly high hazard of groundwater arsenic in the north and northeast India, particularly in Assam and West Bengal, ... .’ - (West 10Underlined words are suggested entities in the sentence, where the bold parts are recognized by the flair NER model. Each sentence has a suggested triple in the form: (entity1, relation, entity2)

Bengal, high hazard of, groundwater arsenic)

5. Discussion and Conclusion

In this paper, we report on the first steps towards creating a dataset suitable for training the BERT-like model that will subsequently be used for downstream climate-change relation extraction tasks. We have collected and analyzed a set of 200,000 carefully selected scientific papers as the high-quality content of the climate-change domain. We discuss technical details and common pitfalls in parsing PDF and HTML documents as the first steps needed to obtain a suficient quantity of domain-specific data to train a BERT-based model. Next, we report preliminary statistics of the dataset to ensure its appropriateness for downstream relation extraction. During preliminary analysis, we identified a high number of possible diferent relations, indicating that further distilling of relations and relation types should be implemented. Moreover, our preliminary findings suggest that the new NER model tailored for the vocabulary of the climate-change domain is required.

With these preliminary results, we open several research directions. First, the collected dataset will be used for additional training of the SciBERT and ClimateBERT models involving diferent configurations of masked language modelling (MLM) principles. Second, to reduce the abundance of diferent but similar domain-specific relations we will need to develop a method for fine-tuning annotated relations for training sentence-level relation extraction (RE) model. This will involve the disambiguation of related relations and relation types and LLM-enabled annotation. Finally, as the main goal of this research is the construction and curation of a knowledge graph for the climate-change content captured in a high-quality journal. In future work, we plan to address KG construction-related challenges, relying on existing literature, such as work of Dessi et al [ 38 ] and Chessa et al [ 39 ].

Acknowledgments References

This work has been partially supported by the University of Rijeka under project number uniri-drustv-18-20. Croatian Science Foundation supports AP under the project DOK-2021-02. [1] H.-G. et al, Impacts of 1.5ºc global warming on natural and human systems, in: Global Warming of 1.5°C. An IPCC Special Report on the impacts of global warming of 1.5°C above pre-industrial levels and related global greenhouse gas emission pathways, in the context of strengthening the global response to the threat of climate change, sustainable development, and eforts to eradicate poverty, Cambridge University Press, Cambridge, UK and New York, NY, USA, 2018, pp. 175–312. doi:10.1017/9781009157940.005. [2] C. S. Areni, Motivated reasoning and climate change: Comparing news sources, politicization, intensification, and qualification in denier versus believer subreddit comments, Applied Cognitive Psychology 38 (2024). doi:10.1002/acp.4167, all Open Access, Hybrid Gold Open Access.

A. Data statistics

# 831 3,943 355 387 B. Training data comparison calculations • c: Approximation from

ter/flair/splitter.py) • a: Calculated from reported average number of words [ 10 ]. • b: Approximation from tokenizer trained on 10,000 papers sample according to The Tokenization pipeline (https://huggingface.co/docs/tokenizers/python/latest/pipeline.html). (https://github.com/flairNLP/flair/blob/mas

C. Common extraction problems

First line of paragraph missing

D. Most common noun phrases E. Most common entities F. Most common verb phrases

[3]

Farrell ,

McConnell ,

Brulle , Evidence-based strategies to combat scientific misinformation , Nature Climate Change 9 ( 2019 ) 191 - 195 . doi: 10 .1038/s41558-018-0368-6.

[4]

Andre ,

Boneva ,

Chopra ,

Falk , Globally representative evidence on the actual and perceived support for climate action , Nature Climate Change ( 2024 ). doi: 10 .1038/ s41558-024-01925-3.

[5]

Debnath ,

Ebanks ,

Mohaddes ,

Roulet ,

R. M.

Alvarez , Do fossil fuel firms reframe online climate and sustainability communication? a data-driven analysis , npj Climate Action 2 ( 2023 ) 47 . doi: 10 .1038/s44168-023-00086-x.

[6]

Oreskes ,

E. M.

Conway , Merchants of Doubt: How a Handful of Scientists Obscured the Truth on Issues From Tobacco Smoke to Global Warming , Bloomsbury Press, 2010 .

[7]

Pawar ,

G. K.

Palshikar ,

Bhattacharyya , Relation extraction : A survey, 2017 . arXiv: 1712 . 05191 .

[8]

Lee ,

Yoon ,

Kim ,

C. H.

So ,

Kang , Biobert: a pre-trained biomedical language representation model for biomedical text mining , Bioinformatics 36 ( 2019 ) 1234 - 1240 . doi: 10 .1093/bioinformatics/btz682.

[9]

Beltagy ,

Lo , A . Cohan, SciBERT: A pretrained language model for scientific text , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 3615 - 3620 . doi: 10 .18653/v1/ D19 -1371.

[10]

Webersinke ,

Kraus ,

Bingler ,

Leippold , Climatebert: A pretrained language model for climate-related text , SSRN ( 2022 ). URL: https://ssrn.com/abstract=4229146. doi: 10 .2139/ssrn.4229146.

[11]

Alsentzer ,

J. R.

Murphy ,

Boag ,

W.-H.

Weng ,

Jin ,

Naumann , M. B. A. McDermott , Publicly available clinical bert embeddings , 2019 . arXiv: 1904 .03323.

[12]

Chalkidis ,

Fergadiotis ,

Malakasiotis ,

Aletras , I. Androutsopoulos , Legal-bert: The muppets straight out of law school , 2020 . arXiv: 2010 .02559.

[13]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19-1423. doi: 10 .18653/v1/ N19 -1423.

[14]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad ,

Mohamed ,

Levy ,

Stoyanov , L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 7871 - 7880 . URL: https://aclanthology.org/ 2020 .acl-main. 703 . doi: 10 .18653/v1/ 2020 .acl-main. 703 .

[15]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unified text-to-text transformer , 2023 . arXiv: 1910 .10683.

[16]

L. N.

Phan ,

J. T.

Anibal ,

Tran ,

Chanana ,

Bahadroglu ,

Peltekian , G. AltanBonnet, Scifive: a text-to-text transformer model for biomedical literature , 2021 . arXiv: 2106 . 03598 .

[17]

J. V.

Kringelum ,

S. K.

Kjaerulf ,

Brunak ,

Lund ,

T. I.

Oprea ,

Taboureau , Chemprot3 . 0: a global chemical biology diseases mapping , Database (Oxford) 2016 ( 2016 ) bav123 . doi: 10 .1093/database/bav123.

[18]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , 2019 . arXiv: 1907 .11692.

[19]

Sanh ,

Debut ,

Chaumond , T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter , ArXiv abs/ 1910 .01108 ( 2019 ).

[20]

Poleksić ,

Martinčić-Ipšić , Efects of pretraining corpora on scientific relation extraction using bert and scibert , in: Joint Workshop Proceedings of 5th ( Sem4Tra) and 2nd NLP4KGC: Natural Language Processing for Knowledge Graph Construction co-located with the 19th International Conference on Semantic Systems (SEMANTiCS 2023 ), volume Vol- 3510 of CEUR Workshop Proceedings , Leipzig, Germany, 2023 . URL: https://ceur-ws. org/ Vol- 3510 /paper_nlp_3.pdf.

[21]

Zhao ,

Deng ,

Yang ,

Wang ,

Zhang , H. Cheng, W. Lam,

Shen ,

Xu , A comprehensive survey on deep learning for relation extraction: Recent advances and new frontiers, 2023 . arXiv: 2306 . 02051 .

[22]

Mintz ,

Bills ,

Snow ,

Jurafsky , Distant supervision for relation extraction without labeled data , in: K. -Y. Su , J.

Su , J.

Wiebe , H. Li (Eds.), Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP , Association for Computational Linguistics , Suntec, Singapore, 2009 , pp. 1003 - 1011 . URL: https://aclanthology.org/P09-1113.

[23]

Han ,

Zhu ,

Yu ,

Wang ,

Yao ,

Liu , M. Sun, FewRel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation , in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Brussels, Belgium, 2018 , pp. 4803 - 4809 . URL: https://aclanthology.org/D18-1514. doi: 10 .18653/v1/ D18 - 1514.

[24]

Elsahar ,

Vougiouklis ,

Remaci ,

Gravier ,

Hare ,

Laforest , E. Simperl, T-REx: A large scale alignment of natural language with knowledge base triples , in: N. Calzolari , K.

Choukri , C.

Cieri , T.

Declerck , S.

Goggi , K.

Hasida , H.

Isahara , B.

Maegaard , J.

Mariani , H.

Mazo , A.

Moreno , J.

Odijk , S.

Piperidis , T. Tokunaga (Eds.), Proceedings of the LREC 2018 , European Language Resources Association (ELRA), Miyazaki , Japan, 2018 . URL: https://aclanthology.org/L18-1544.

[25]

Yao ,

Ye ,

Li ,

Han ,

Lin ,

Liu ,

Huang ,

Zhou , M. Sun, DocRED: A large-scale document-level relation extraction dataset, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 764 - 777 . URL: https://aclanthology.org/P19-1074. doi: 10 .18653/v1/ P19 - 1074.

[26]

Han , T . Gao,

Lin ,

Peng ,

Yang ,

Xiao ,

Liu ,

Li ,

Zhou ,

Sun , More data, more relations, more context and more openness: A review and outlook for relation extraction , in: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing , Association for Computational Linguistics, Suzhou, China, 2020 , pp. 745 - 758 . URL: https://aclanthology.org/ 2020 .aacl-main. 75 .

[27]

Tan ,

Beigi ,

Wang ,

Guo ,

Bhattacharjee ,

Jiang ,

Karami ,

Li , L. Cheng, H. Liu, Large language models for data annotation: A survey , 2024 . arXiv: 2402 . 13446 .

[28]

Goel ,

Gueta ,

Gilon , C. Liu,

Erell ,

L. H.

Nguyen ,

Hao ,

Jaber ,

Reddy ,

Kartha ,

Steiner , I. Laish ,

Feder , Llms accelerate annotation for medical information extraction , 2023 . arXiv: 2312 . 02296 .

[29]

Li ,

Jia ,

Zheng , Semi-automatic data enhancement for document-level relation extraction with distant supervision from large language models , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 5495 - 5505 . doi: 10 .18653/v1/ 2023 .emnlp-main. 334 .

[30]

Zhang ,

Li ,

Ma ,

Zhou , L. Zou, LLMaAA: Making large language models as active annotators , in: H. Bouamor , J. Pino , K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , Association for Computational Linguistics , Singapore, 2023 , pp. 13088 - 13103 . doi: 10 .18653/v1/ 2023 .findings-emnlp. 872 .

[31]

Tang ,

Han ,

Jiang ,

Hu , Does synthetic data generation of llms help clinical text mining ?, 2023 . arXiv: 2303 . 04360 .

[32]

Wang ,

Zhou ,

Qiao ,

Li ,

Li , Improving unsupervised relation extraction by augmenting diverse sentence pairs , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 12136 - 12147 . URL: https://aclanthology. org/ 2023 .emnlp-main. 745 . doi: 10 .18653/v1/ 2023 .emnlp-main. 745 .

[33]

Khorashadizadeh ,

Mihindukulasooriya ,

Tiwari ,

Groppe ,

Groppe , Exploring in-context learning capabilities of foundation models for generating knowledge graphs from text , 2023 . arXiv: 2305 . 08804 .

[34]

Shinyama ,

Guglielmetti , P. Marsman, pdfminer.six, 2018 . URL: https://pdfminersix. readthedocs.io/.

[35]

Richardson , Beautiful soup documentation, 2007 . URL: https://www.crummy.com/ software/BeautifulSoup/bs4/doc/.

[36]

Akbik ,

Bergmann ,

Blythe ,

Rasul ,

Schweter ,

Vollgraf , FLAIR:

An easy-to-use framework for state-of-the-

art

NLP

, in: NAACL 2019 , 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations ), 2019 , pp. 54 - 59 .

[37]

Zaratiana ,

Tomeh ,

Holat , T. Charnois, Gliner: Generalist model for named entity recognition using bidirectional transformer , 2023 . arXiv: 2311 . 08526 .

[38]

Dessí ,

Osborne ,

D. Reforgiato

Recupero ,

Buscaldi , E. Motta, Scicero: A deep learning and nlp approach for generating scientific knowledge graphs in the computer science domain , Knowledge-Based Systems 258 ( 2022 ) 109945 . URL: https://www.sciencedirect. com/science/article/pii/S0950705122010383. doi:https://doi.org/10.1016/j.knosys. 2022 . 109945 .

[39]

Chessa ,

Fenu ,

Motta ,

Osborne ,

D. Reforgiato

Recupero ,

Salatino ,

Secchi , Data-driven methodology for knowledge graph generation within the tourism domain , IEEE Access 11 ( 2023 ) 67567 - 67599 . doi: 10 .1109/ACCESS. 2023 . 3292153 .

3 ,825 tEicoonlsogical Applica- 4 , 469 aEncodsSyustsetaminHabeialilttyh 1, 023 Journal of Climate 15,325 Climate Dynamics Journal of Geo- NPJ Climate and 7 ,103 physical Research: 14 ,512 Atmospheric SciAtmospheres ence 12 tNioPnJ Climate Ac- 39 CNhataunrgee Climate 560 PNAS 88,534 MDPI water 18 MDPI Atmosphere 8.705 MDPI Climate 184 MDPI Ecologies 115 MDPI Energies 988 MDPI Forests 10,674 MDPI Fuels 1 ,012 oMgDyPI Meteorol- 57 CMhDePmIisSturystainable 420 MDPI Oceans 126 Total