=Paper=
{{Paper
|id=Vol-3747/paper9
|storemode=property
|title=Towards Dataset for Extracting Relations in the Climate-Change Domain
|pdfUrl=https://ceur-ws.org/Vol-3747/text2kg_paper9.pdf
|volume=Vol-3747
|authors=Andrija Poleksić,Sanda Martinčić-Ipšić
|dblpUrl=https://dblp.org/rec/conf/text2kg/PoleksicM24
}}
==Towards Dataset for Extracting Relations in the Climate-Change Domain==
Towards Dataset for Extracting Relations in the
Climate-Change Domain
Andrija Poleksić1,2,∗ , Sanda Martinčić-Ipšić1,2
1
Faculty of Informatics and Digital Technologies (University of Rijeka), Radmile Matejčić 2, Rijeka, 51000, Croatia
2
Center for Artificial Intelligence and Cybersecurity
Abstract
The impacts of global warming and climate change on ecosystems, weather patterns and human societies
pose a significant threat to biodiversity and the sustainability of our planet. Despite the widespread
scientific consensus, climate change denial persists among a segment of the population, either due to
misconceptions or vested interests. Recent research shows that progress is being made in addressing
climate denial as a majority acknowledges man-made climate change. However, the spread of misinfor-
mation remains a challenge, often perpetuated by corporate interests. To overcome these challenges,
we propose constructing a dataset tailored for automated extraction and structuring of climate change-
related scientific findings, focusing on relation extraction (RE) from scientific papers. Our research
outlines the steps involved, including the preparation of the dataset for further training of the BERT-based
model and downstream relation extraction task formulation. We discuss the process of data collection,
preprocessing techniques and preliminary dataset analysis. Additionally, we highlight the need for a
specialized Named Entity Recognition model for the climate-change domain and underline the need for
annotation of domain-specific relations.
Keywords
dataset, climate change, relation extraction, scientific papers
1. Introduction
Global warming and climate change have profound and far-reaching effects on global ecosystems,
weather patterns, sea levels, and human societies, constituting a critical threat to the planet’s
biodiversity and the prospect of a sustainable future [1]. Despite the widespread acceptance
and scientific backing of climate change concepts, there remains a segment of the population
that denies human impact on climate change, referred to as climate denial. Climate denial is
driven either by misguided beliefs [2] or vested corporate interests [3]. A study by Areni et
al. [2] investigates the dynamics between supporter and denier groups of Reddit users. They
observe that supporters frequently reference scientific work, whereas deniers tend to rely more
on alternative media and sources. Recent comprehensive research conducted by Andre et al. [4]
demonstrates significant strides in addressing the issue of climate change denial. Their findings
reveal that up to 86% of individuals acknowledge the reality of human-induced climate change
and endorse measures aimed at mitigating human impact on the climate. Substantial climate
TEXT2KG 2024: Third International Workshop on Knowledge Graph Generation from Text, May 26 – May 30, 2024,
co-located with Extended Semantic Web Conference (ESWC), Hersonissos, Greece
∗
Corresponding author.
Envelope-Open andrija.poleksic@uniri.hr (A. Poleksić); smarti@uniri.hr (S. Martinčić-Ipšić)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
denial stems from the dissemination of misinformation by large companies, often driven by
vested interests, such as oil companies [5] and false scientific doubt creations, as elaborated by
Oreskes and Conway [6]. Furthermore, the ever-increasing amount of data and information,
including scientific papers, propels the need for automated information processing to speed up
informed research decisions and facilitate fact-checking.
Motivated by both these challenges - information deluge and climate change, in this paper, we
propose steps to construct the dataset that is fit to automatically extract and structure climate
change-related scientific findings using information extraction (IE) methods. Specifically, we
focus on the preliminary steps for relation extraction (RE) from scientific papers. Relation
extraction (RE) is tasked with the identification of relations between entities in sentences,
paragraphs or larger units of text. Sentence-level relation extraction involves identifying and
classifying relations between entities in a single sentence. The goal is to determine the relation
or association between two entities, typically represented by nouns or noun phrases such as
people, organizations, or locations - named entities [7]. Our overall research plan consists of
several steps:
• Preparation of the dataset of scientific papers for a climate-change domain suitable for
the training of a BERT-like model;
• Additional pretraining (training with available pretrained weights) of the BERT-like model
to adapt to the climate-change domain;
• Definition of relation types for relation extraction and construction of the dataset for the
fine-tuning of the newly trained model(s) on the task of sentence-level relation extraction;
• Construction and curation of the climate-change knowledge graph from a high-quality
journal.
In the next Section 2 is a short overview of the related work on pertained language models,
relation extraction datasets and relation annotation. Section 3 elaborates on data collection,
preprocessing and a preliminary analysis of the data. The final Sections 4 and 5, cover the
results, discussion and conclusions respectively.
2. Related Work
Recent research efforts [8, 9, 10, 11, 12] report using pretrained models for text classification
and sequence labelling tasks. One of the prominent ones is BERT (Bidirectional Encoder
Representations from Transformers), an encoder-only transformer model trained on masked
language modelling (MLM) task [13]. Although it is shown that encoder-decoder architecture
models such as BART [14] and T5 [15] provide comparable and sometimes better results [16],
they require the training of a larger number of parameters, which ultimately requires a larger
amount of data and computational resources.
Lee et al. [8] perform additional training of the original BERT𝐵𝐴𝑆𝐸 deep neural model [13]
for the biomedical domain - BioBERT. They report that no new WordPeace vocabulary is
needed, ensuring the compatibility of the two pretrained models (BioBERT and BERT). BioBERT
achieves new SOTA results on benchmarks for relation extraction and named entity recognition.
ClinicalBERT model [11] follows the same principle and further trains the BERT and BioBERT
models on a large multicenter dataset.
The other line of research by Beltagy et al. [9] is training a new model SciBERT from scratch,
which is also based on the BERT architecture [13], using scientific papers as the training data.
For SciBERT they construct a new vocabulary SciVocab. An overall improvement of 0.61 F1-
score on the downstream tasks using SciVocab compared to using the original BERT vocabulary
is achieved. Additionally, several SOTA results are reported, surpassing also the BioBERT
results on the ChemProt [17] benchmark by a fairly large margin. A similar strategy is applied
in Chalkidis et al. [12], where a family of LegalBERT models is trained to support legal NLP
research, computer-assisted law and legal technology applications.
Webersinke et al. in [10] train the RoBERTa model [18], which was adapted using distillation
process [19], on the climate-change domain - ClimateBERT. The model is trained on climate-
related news articles and posts on social media.
In our research we will extend our previous research [20], as we plan to perform additional
training on two models: for SciBERT additional training for the climate-change domain em-
ploying scientific papers; and for ClimateBERT extension of parametrized domain knowledge
by carefully curated high-quality dataset, surpassing their drawbacks of either out-of-climate-
change-domain vocabulary or improving the quality of media collected information with
scientifically obtained facts. To this end, in this paper, we propose the construction of a new
dataset for the climate-change domain obtained from scientific papers published in high-quality
journals.
For joint entity and relation extraction downstream tasks [21] the model is trained to perform
both tasks simultaneously while benefiting from the use of interrelated signals. Relation
extraction can be set as a supervised task and requires a huge amount of labelled (i.e. annotated)
training data. To speed up the process, many researchers are turning to the idea of distant
supervision1 [22]. This includes datasets such as FewRel [23] and T-REx [24] for RE at sentence
level and datasets such as DocRED [25] and Wiki20m [26] for RE on larger text sections.
Recently, the use of Large Language Models (LLMs) for the annotation of relations and entities
has been reported [27], either to augment and speed up the annotation process for human
annotators [28, 29] or to completely replace human efforts [30]. Besides annotation, LLMs are
considered as synthetic data generators [31, 32] or for assessing the LLM-annotation quality
[33]. In our research, we plan to engage LLMs for the relation annotation subtask, leveraging
of-the-shelf pretrained LLMs to speed up the process, as opposed to training specialised in-house
LLMs and using them directly for RE.
3. Dataset Preparation
Adapting one of the BERT models for the RE task for the climate-change domain requires the
construction of an appropriate dataset (e.g. scientific and high-quality source). To this end
we selected the highest-ranked scientific journals on climate change based on the Scimago
1
Distant supervision assumes that the presence of a given entity pair in a given text implies a relation between them
such that it is found in a Knowledge Graph/Base.
Journal & Country Rank (SJR)2 and ScienceWatch Rank3 and open access MDPI journals that are
associated with the topic of climate change and in a substantial quantity of available papers and
consistent format for parsing. The Table 2 (Appendix A) lists information on 194,673 retrieved
research papers from selected journals, where 77.35% (150,583) are available in HTML format,
while the remaining 22.65% (44,090) are only available in PDF format.
The PDF documents were first processed with pdfminer.six4 library [34] for extracting
information from PDF documents. They were converted to HTML format retaining the available
information for each parsed element, including position, font and font size. This information was
obtained with the Layout analysis algorithm5 that groups characters into words and lines, lines
into boxes and finally textboxes hierarchically based on the position of each character. Hence,
we developed a parser fine-tuned to each journal formatting style and position information,
enabling correct and complete text extraction. For navigation through HTML files, we used
BeautifulSoup6 library [35].
As already mentioned, for each journal a specific parser was needed. Next, we draw a random
sample of 100 papers for each journal to evaluate the parsing procedure. Based on the random
sample, we create a parser that successfully extracts the content of the papers in 100% of the
cases, ranging from pure content to metadata such as authors, affiliations, references and DOI
information. The parsing procedure allows extracting data to the full extent. This is manually
validated on a random sample of 10 papers per journal by comparing the texts from PDF/HTML
with the data stored in Pandas dataframes7 . Table 3 (Appendix C) lights up some of the most
common problems encountered during PDF and HTML parsing. Still, despite many problems,
we obtained a well-documented, comprehensive dataset, which is appropriate for further model
training. In Table 1 the comparison of the total training data used for each of the neural models
(BERT, SciBERT and ClimateBERT) is reported. Our dataset contains ∼35% of tokens used
for training of SciBERT, and surpasses the number of tokens for ClimateBERT by six times.
The average number of sentences per paper in our dataset is ∼160% of the average reported
for SciBERT. These numbers are encouraging, suggesting that we have collected sufficient
high-quality texts for training of BERT-based model.
To further explore the dataset content we report statistics using a readily available part-of-
speech (POS) tagger and a named entity recognition (NER) model from flair8 framework [36].
First, we take a random sample of 10,000 research papers to perform the analysis. Then we
tokenize into sentences and perform POS tagging9 and NER. In each POS-tagged sentence, we
determine noun- and verb- phrases. Non traditionally, we define heuristic noun- and verb-
phrases as a sequence of words with specific POS tags as listed:
• Noun phrase: Cardinal number (CD), Adjective (JJ), Determiner (DT), Noun (NN),
Foreign word (FW), Possessive ending (POS), Hyphen (HYPH), Symbol (SYM) ,
2
https://www.scimagojr.com/journalrank.php?category=2306
3
http://archive.sciencewatch.com/ana/st/climate/journals/
4
https://github.com/pdfminer/pdfminer.six/tree/master
5
https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#id1
6
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
7
https://pandas.pydata.org/
8
https://github.com/flairNLP/flair
9
The full list of POS tags for the model used can be found here: https://huggingface.co/flair/pos-english.
Table 1
Training data comparison: Data used for model training, number of tokens (CS), and average number
of sentences (A#S) per paper if applicable.
Model Data used CS A#S
BooksCorpus (800M words) and English Wikipedia
BERT 3.30B /
(2,500M words)
Random sample of 1.14M papers from Semantic
SciBERT 3.17B 154
Scholar
Climate related news articles, climate-related papers
ClimateBERT abstracts and corporate climate and sustainability 0.22B𝑎 /
reports
OUR ∼200,000 climate-related research papers 1.25B𝑏 242𝑐
𝑎,𝑏,𝑐
Calculation is reported in Appendix B
• Verb phrase: Verb (VB), ”to” (TO), Adverb (RB), Modal (MD).
This modification, despite being imperfect, allows for analysis of the most frequent verb- and
noun- phrases, providing insights into possible types of relations between entities, possible
named entities and entity types (e.g. person, organization, location, etc.). With this approxi-
mation, we further estimated the number of total and unique triples. Figure 1 shows the total
number of verb phrases, noun phrases, entities (tagged by the NER model) and possible triples
occurring in the sample of 10,000 papers. The sample consists of 2,406,799 sentences, from
which we extracted a total of 15,238,265 noun phrases and 1,790,745 entities. The ratio of noun
phrases to extracted entities (∼8:1) indicates the need for a NER model, that is better fitted to the
climate-change domain vocabulary. Table 4 (Appendix D) lists the top noun phrases consisting
of 1, 2 and 3 words respectively. Table 5 (Appendix E) lists the top entities for three entity
types: Location Name (LOC), Organization Name (ORG) and Other Name (MISC). Number
of entity types will be addressed in the future work, employing more recent methods such
as GLiNER [37]. Since the list contains many acronyms and abbreviations the expansion and
disambiguation problem needs to be addressed as well.
Similarly, we analyze the occurrence of verb phrases: a total of 5,934,949 verb phrases forming
486,632 unique expressions. Although this is promising, the number of unique expressions
needs to be reduced to a feasible set enabling the training of the classifier to extract relations in
downstream tasks. Moreover, this is an indication that many climate-change-specific relations
are present, which needs to be addressed in the downstream training as well. Table 6 (Appendix
F) reports the 30 most frequently occurring verb phrases by number of words (1, 2 and 3
respectively). We observe a high similarity between many unique verb phrases, such as: ”is
shown”, ”shows”, ”are shown” and ”has been shown”; indicating the obvious next step of data
quality improvement by deduplication.
4. Relation Annotation
To effectively train and evaluate supervised relation extraction models, the annotated data is
needed [24]. To this end, we plan to engage the advanced LLM possibilities in the context of
Figure 1: Counts of noun phrases, entities, triples, and verb phrases: Occurrence of noun phrases
(1a), named entities (1b), possible triples (2a) and verb phrases (2b) with count of unique expressions
(dotted) in the 10,000 papers sample.
automatic or enhanced annotation of relation triples. With POS tagging and NER on the sample
of 10,000 papers, we have established the foundation for possible triple detection. We anticipate
that a relation is expected to exist if there is a verb between two entities, where entities are
either approximated by noun phrases that we have heuristically recognised or named entities
recognised by the flair model. Moreover, we hypothesize that this will allow guided annotation
by providing better context to LLM-enabled annotation. In the remainder of this section, we
preview some examples of possible entities and relations in climate change domain10 , which
remains an open question to be addressed in the future:
• ’For example, Atlantic cyclones have been well documented as causing high surge levels
and heavy precipitation.’ - (Atlantic cyclones, cause, high surge levels)
• ’El Niño–Southern Oscillation (ENSO) is another important factor for
winter temperature in China.’ - (ENSO, affects, winter temperature in China)
• ’The concentration map captured a significantly high hazard of groundwater arsenic in
the north and northeast India, particularly in Assam and West Bengal, ... .’ - (West
10
Underlined words are suggested entities in the sentence, where the bold parts are recognized by the flair NER
model. Each sentence has a suggested triple in the form: (entity1, relation, entity2)
Bengal, high hazard of, groundwater arsenic)
5. Discussion and Conclusion
In this paper, we report on the first steps towards creating a dataset suitable for training
the BERT-like model that will subsequently be used for downstream climate-change relation
extraction tasks. We have collected and analyzed a set of 200,000 carefully selected scientific
papers as the high-quality content of the climate-change domain. We discuss technical details
and common pitfalls in parsing PDF and HTML documents as the first steps needed to obtain
a sufficient quantity of domain-specific data to train a BERT-based model. Next, we report
preliminary statistics of the dataset to ensure its appropriateness for downstream relation
extraction. During preliminary analysis, we identified a high number of possible different
relations, indicating that further distilling of relations and relation types should be implemented.
Moreover, our preliminary findings suggest that the new NER model tailored for the vocabulary
of the climate-change domain is required.
With these preliminary results, we open several research directions. First, the collected
dataset will be used for additional training of the SciBERT and ClimateBERT models involving
different configurations of masked language modelling (MLM) principles. Second, to reduce the
abundance of different but similar domain-specific relations we will need to develop a method
for fine-tuning annotated relations for training sentence-level relation extraction (RE) model.
This will involve the disambiguation of related relations and relation types and LLM-enabled
annotation. Finally, as the main goal of this research is the construction and curation of a
knowledge graph for the climate-change content captured in a high-quality journal. In future
work, we plan to address KG construction-related challenges, relying on existing literature,
such as work of Dessi et al [38] and Chessa et al [39].
Acknowledgments
This work has been partially supported by the University of Rijeka under project number
uniri-drustv-18-20. Croatian Science Foundation supports AP under the project DOK-2021-02.
References
[1] H.-G. et al, Impacts of 1.5ºc global warming on natural and human systems, in: Global
Warming of 1.5°C. An IPCC Special Report on the impacts of global warming of 1.5°C
above pre-industrial levels and related global greenhouse gas emission pathways, in the
context of strengthening the global response to the threat of climate change, sustainable
development, and efforts to eradicate poverty, Cambridge University Press, Cambridge,
UK and New York, NY, USA, 2018, pp. 175–312. doi:10.1017/9781009157940.005 .
[2] C. S. Areni, Motivated reasoning and climate change: Comparing news sources, politi-
cization, intensification, and qualification in denier versus believer subreddit comments,
Applied Cognitive Psychology 38 (2024). doi:10.1002/acp.4167 , all Open Access, Hybrid
Gold Open Access.
[3] J. Farrell, K. McConnell, R. Brulle, Evidence-based strategies to combat scientific misinfor-
mation, Nature Climate Change 9 (2019) 191–195. doi:10.1038/s41558- 018- 0368- 6 .
[4] P. Andre, T. Boneva, F. Chopra, A. Falk, Globally representative evidence on the actual
and perceived support for climate action, Nature Climate Change (2024). doi:10.1038/
s41558- 024- 01925- 3 .
[5] R. Debnath, D. Ebanks, K. Mohaddes, T. Roulet, R. M. Alvarez, Do fossil fuel firms reframe
online climate and sustainability communication? a data-driven analysis, npj Climate
Action 2 (2023) 47. doi:10.1038/s44168- 023- 00086- x .
[6] N. Oreskes, E. M. Conway, Merchants of Doubt: How a Handful of Scientists Obscured the
Truth on Issues From Tobacco Smoke to Global Warming, Bloomsbury Press, 2010.
[7] S. Pawar, G. K. Palshikar, P. Bhattacharyya, Relation extraction : A survey, 2017.
arXiv:1712.05191 .
[8] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical
language representation model for biomedical text mining, Bioinformatics 36 (2019)
1234–1240. doi:10.1093/bioinformatics/btz682 .
[9] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained language model for scientific text,
in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Process-
ing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019,
pp. 3615–3620. doi:10.18653/v1/D19- 1371 .
[10] N. Webersinke, M. Kraus, J. Bingler, M. Leippold, Climatebert: A pretrained language
model for climate-related text, SSRN (2022). URL: https://ssrn.com/abstract=4229146.
doi:10.2139/ssrn.4229146 .
[11] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, M. B. A. McDermott,
Publicly available clinical bert embeddings, 2019. arXiv:1904.03323 .
[12] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, Legal-bert:
The muppets straight out of law school, 2020. arXiv:2010.02559 .
[13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.),
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp.
4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/v1/N19- 1423 .
[14] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
moyer, BART: Denoising sequence-to-sequence pre-training for natural language gen-
eration, translation, and comprehension, in: Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, Association for Computational Lin-
guistics, Online, 2020, pp. 7871–7880. URL: https://aclanthology.org/2020.acl-main.703.
doi:10.18653/v1/2020.acl- main.703 .
[15] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
arXiv:1910.10683 .
[16] L. N. Phan, J. T. Anibal, H. Tran, S. Chanana, E. Bahadroglu, A. Peltekian, G. Altan-
Bonnet, Scifive: a text-to-text transformer model for biomedical literature, 2021.
arXiv:2106.03598 .
[17] J. V. Kringelum, S. K. Kjaerulff, S. Brunak, O. Lund, T. I. Oprea, O. Taboureau, Chemprot-
3.0: a global chemical biology diseases mapping, Database (Oxford) 2016 (2016) bav123.
doi:10.1093/database/bav123 .
[18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
anov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692 .
[19] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter, ArXiv abs/1910.01108 (2019).
[20] A. Poleksić, S. Martinčić-Ipšić, Effects of pretraining corpora on scientific relation
extraction using bert and scibert, in: Joint Workshop Proceedings of 5th (Sem4Tra)
and 2nd NLP4KGC: Natural Language Processing for Knowledge Graph Construction
co-located with the 19th International Conference on Semantic Systems (SEMANTiCS
2023), volume Vol-3510 of CEUR Workshop Proceedings, Leipzig, Germany, 2023. URL:
https://ceur-ws.org/Vol-3510/paper_nlp_3.pdf.
[21] X. Zhao, Y. Deng, M. Yang, L. Wang, R. Zhang, H. Cheng, W. Lam, Y. Shen, R. Xu, A
comprehensive survey on deep learning for relation extraction: Recent advances and new
frontiers, 2023. arXiv:2306.02051 .
[22] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without
labeled data, in: K.-Y. Su, J. Su, J. Wiebe, H. Li (Eds.), Proceedings of the Joint Conference
of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP, Association for Computational Linguistics,
Suntec, Singapore, 2009, pp. 1003–1011. URL: https://aclanthology.org/P09-1113.
[23] X. Han, H. Zhu, P. Yu, Z. Wang, Y. Yao, Z. Liu, M. Sun, FewRel: A large-scale super-
vised few-shot relation classification dataset with state-of-the-art evaluation, in: Pro-
ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 4803–4809. URL:
https://aclanthology.org/D18-1514. doi:10.18653/v1/D18- 1514 .
[24] H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest, E. Simperl, T-REx: A
large scale alignment of natural language with knowledge base triples, in: N. Calzolari,
K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani,
H. Mazo, A. Moreno, J. Odijk, S. Piperidis, T. Tokunaga (Eds.), Proceedings of the LREC
2018, European Language Resources Association (ELRA), Miyazaki, Japan, 2018. URL:
https://aclanthology.org/L18-1544.
[25] Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu, L. Huang, J. Zhou, M. Sun, DocRED: A
large-scale document-level relation extraction dataset, in: Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, Association for Computational
Linguistics, Florence, Italy, 2019, pp. 764–777. URL: https://aclanthology.org/P19-1074.
doi:10.18653/v1/P19- 1074 .
[26] X. Han, T. Gao, Y. Lin, H. Peng, Y. Yang, C. Xiao, Z. Liu, P. Li, J. Zhou, M. Sun, More
data, more relations, more context and more openness: A review and outlook for relation
extraction, in: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the
Association for Computational Linguistics and the 10th International Joint Conference on
Natural Language Processing, Association for Computational Linguistics, Suzhou, China,
2020, pp. 745–758. URL: https://aclanthology.org/2020.aacl-main.75.
[27] Z. Tan, A. Beigi, S. Wang, R. Guo, A. Bhattacharjee, B. Jiang, M. Karami, J. Li, L. Cheng,
H. Liu, Large language models for data annotation: A survey, 2024. arXiv:2402.13446 .
[28] A. Goel, A. Gueta, O. Gilon, C. Liu, S. Erell, L. H. Nguyen, X. Hao, B. Jaber, S. Reddy,
R. Kartha, J. Steiner, I. Laish, A. Feder, Llms accelerate annotation for medical information
extraction, 2023. arXiv:2312.02296 .
[29] J. Li, Z. Jia, Z. Zheng, Semi-automatic data enhancement for document-level relation
extraction with distant supervision from large language models, in: H. Bouamor, J. Pino,
K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing, Association for Computational Linguistics, Singapore, 2023, pp.
5495–5505. doi:10.18653/v1/2023.emnlp- main.334 .
[30] R. Zhang, Y. Li, Y. Ma, M. Zhou, L. Zou, LLMaAA: Making large language models as
active annotators, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association
for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics,
Singapore, 2023, pp. 13088–13103. doi:10.18653/v1/2023.findings- emnlp.872 .
[31] R. Tang, X. Han, X. Jiang, X. Hu, Does synthetic data generation of llms help clinical text
mining?, 2023. arXiv:2303.04360 .
[32] Q. Wang, K. Zhou, Q. Qiao, Y. Li, Q. Li, Improving unsupervised relation extraction by
augmenting diverse sentence pairs, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of
the 2023 Conference on Empirical Methods in Natural Language Processing, Association for
Computational Linguistics, Singapore, 2023, pp. 12136–12147. URL: https://aclanthology.
org/2023.emnlp-main.745. doi:10.18653/v1/2023.emnlp- main.745 .
[33] H. Khorashadizadeh, N. Mihindukulasooriya, S. Tiwari, J. Groppe, S. Groppe, Exploring
in-context learning capabilities of foundation models for generating knowledge graphs
from text, 2023. arXiv:2305.08804 .
[34] Y. Shinyama, P. Guglielmetti, P. Marsman, pdfminer.six, 2018. URL: https://pdfminersix.
readthedocs.io/.
[35] L. Richardson, Beautiful soup documentation, 2007. URL: https://www.crummy.com/
software/BeautifulSoup/bs4/doc/.
[36] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, FLAIR: An easy-to-use
framework for state-of-the-art NLP, in: NAACL 2019, 2019 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (Demonstrations),
2019, pp. 54–59.
[37] U. Zaratiana, N. Tomeh, P. Holat, T. Charnois, Gliner: Generalist model for named entity
recognition using bidirectional transformer, 2023. arXiv:2311.08526 .
[38] D. Dessí, F. Osborne, D. Reforgiato Recupero, D. Buscaldi, E. Motta, Scicero: A deep learning
and nlp approach for generating scientific knowledge graphs in the computer science
domain, Knowledge-Based Systems 258 (2022) 109945. URL: https://www.sciencedirect.
com/science/article/pii/S0950705122010383. doi:https://doi.org/10.1016/j.knosys.
2022.109945 .
[39] A. Chessa, G. Fenu, E. Motta, F. Osborne, D. Reforgiato Recupero, A. Salatino, L. Secchi,
Data-driven methodology for knowledge graph generation within the tourism domain,
IEEE Access 11 (2023) 67567–67599. doi:10.1109/ACCESS.2023.3292153 .
A. Data statistics
Table 2
Number of papers: The number of collected research papers in the climate-change domain according
to the journal/source.
Journal name # Journal name # Journal name #
International Jour- Ecological Applica- Ecosystem Health
3,825 4,469 831
nal of Climatology tions and Sustainability
Energy Policy 1,023 Journal of Climate 15,325 Climate Dynamics 3,943
Journal of Geo- NPJ Climate and
Global Change Bi-
7,103 physical Research: 14,512 Atmospheric Sci- 355
ology
Atmospheres ence
NPJ Ocean Sustain- NPJ Climate Ac- Nature Climate
12 39 387
ability tion Change
Nature Geoscience 560 PNAS 88,534 MDPI water 21,768
MDPI Air 18 MDPI Atmosphere 8.705 MDPI Climate 1,232
MDPI Earth 184 MDPI Ecologies 115 MDPI Energies 8,236
MDPI Hidrology 988 MDPI Forests 10,674 MDPI Fuels 104
MDPI Environ- MDPI Meteorol- MDPI Sustainable
1,012 57 116
ments ogy Chemistry
MDPI Recycling 420 MDPI Oceans 126 Total 194,673
B. Training data comparison calculations
• a: Calculated from reported average number of words [10].
• b: Approximation from tokenizer trained on 10,000 papers sample according to The Tok-
enization pipeline (https://huggingface.co/docs/tokenizers/python/latest/pipeline.html).
• c: Approximation from 𝑆𝑒𝑔𝑡𝑜𝑘𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑆𝑝𝑙𝑖𝑡𝑡𝑒𝑟 (https://github.com/flairNLP/flair/blob/mas-
ter/flair/splitter.py)
C. Common extraction problems
Table 3
Most frequent problems with text extraction: The left-hand column contains a brief description of
the problem, while the explanation or example can be found in the right-hand column. The text in bold
indicates what is actually extracted.
Problem Description Example/Explanation
Text data missing due to unex- Where 2 and 𝑡𝑤𝑜 always makes up five.
pected font size/style Where and always makes up five.
... original BERT𝐵𝐴𝑆𝐸 model with ...
... original BERT model with ...
Wrong ordering of paragraphs Layout algorithm heuristics give wrong conclu-
sions based on distance, e.g. bottom right para-
graph is ”closer” to top right paragraph then to
the top left paragraph due to a figure/table/graph.
Page numbering or similar informa- For navigation through HTML files, we used Beau-
tion abrupt paragraph content tifulSoup library.
For navigation through HTML files, we PAGE
5 AUTHOR ET AL. used BeautifulSoup li-
brary.
Wrong word ordering due to justifi- Nature climate
cation change
Nature change climate
Problems with wrong symbol ex- ... far-reaching effects on global ecosystems ...
traction (Ligatures) ... far-reaching e�ects on global ecosystems
...
First line of paragraph missing
D. Most common noun phrases
Table 4
Most common noun phrases: Top 30 noun phrases by the number of words (1, 2 and 3) with the
corresponding counts (#).
Noun phrase
# Noun phrase (2) # Noun phrase (3) #
(1)
the 205,685 this study 28,914 the other hand 5,417
a 78,714 si appendix 23,542 the study area 4,049
this 51,495 the results 20,506 the present study 3,789
1 37,595 the number 16,034 37 ° c 2,564
2 29,195 the model 15,671 4°c 2,540
data 24,962 the presence 14,187 the same time 2,441
that 23,472 table 1 13,144 an important role 2,336
those 22,015 this work 11,452 the united states 2,065
such 21,800 the data 10,638 the time series 2,056
addition 20,902 the authors 10,144 p < 0.001 1,982
i.e. 20,685 the effect 9,741 the total number 1,847
3 19,837 these results 9,237 a wide range 1,745
one 19,375 this paper 9,146 the national academy 1,646
consistent 18,599 table 2 8,146 the north atlantic 1,629
c 18,403 the case 7,776 the spatial distribution 1,603
p 18,101 the effects 7,701 p < 0.05 1,553
t 17,330 climate change 7,632 a large number 1,376
cells 16,998 an increase 7,187 the standard deviation 1,277
results 16,290 the absence 6,923 the northern hemisphere 1,249
similar 16,193 the use 6,823 the indian ocean 1,095
4 15,813 the difference 6,800 the study period 1,091
changes 15,666 fig. 1 6,747 p < 0.01 1,070
e.g. 15,456 the study 6,624 25 ° c 1,011
precipitation 15,163 a result 6,581 the north pacific 985
5 15,128 figure 1 6,571 the current study 979
example 14,729 the surface 6,556 wang et al 977
contrast 14,082 the impact 6,316 30 ° c 939
water 13,820 the analysis 6,084 the plasma membrane 905
b 13,182 figure 2 5,886 the boundary layer 902
time 12,893 the region 5,876 20 ° c 897
E. Most common entities
Table 5
Most common entities: Top 30 entities for three entity types: Location (LOC) name, Miscellaneous
(MISC) name and Organization (ORG) name with counts (#).
LOC # MISC # ORG #
China 11686 DNA 6599 ENSO 6616
Pacific 9480 Arctic 4182 PNAS 3289
Europe 4213 SI Appendix 3563 EC 2822
United States 4145 F 2786 SST 2775
Atlantic 3653 European 2684 TC 2358
USA 3459 Equation 2603 NAO 2011
North Atlantic 3341 C 2289 ATP 1939
Indian Ocean 2767 Western 1932 N. Institutes of Health 1934
North America 2747 Asian 1930 El Niño 1617
Africa 2665 Chinese 1704 IPCC 1487
CA 2581 SST 1654 NCEP 1457
Australia 2472 Indian 1338 MDPI 1440
Japan 2345 Mediterranean 1308 EU 1396
US 2255 CMIP5 1301 MJO 1326
Germany 2223 Arabidopsis 1246 N. Sci. Fdn. 1236
North Pacific 2098 MJO 1238 SLP 1227
Asia 1987 UTC 1193 WRF 1184
India 1895 Gaussian 1156 NCAR 1181
Canada 1821 African 1148 NOAA 1128
Northern Hemisphere 1810 Bayesian 1089 NIH 1074
South America 1692 North American 973 TP 1040
U.S. 1663 RNA 925 PBL 1010
California 1409 CT 912 PCR 1008
Beijing 1382 GCM 907 Univ. of California 979
Greenland 1290 III 893 The N. Acad. of Sci. 963
MA 1229 ROS 870 ITCZ 941
UK 1228 BC 820 PCA 907
Southern Hemisphere 1212 Eurasian 814 ∇ 905
Eurasia 1208 DEM 768 RMSE 896
Southern Ocean 1196 PDO 766 WNP 872
Abbreviations: N. - National, Sci. - Science, Fdn. - Foundation, Univ. - University,
Acad. - Academy, ∇- N. Nat. Sci. Fdn. of China
F. Most common verb phrases
Table 6
Most common verb phrases: Top 30 occurring verb phrases by number of words (1, 2 and 3) with
counts (#).
Verb phrase
# Verb phrase (2) # Verb phrase (3) #
(1)
is 267,521 as well 20,097 can be seen 3,294
are 108,928 is not 9,363 should be addressed 2,764
using 61,697 may be 8,727 can be found 2,396
was 60,454 was supported 8,307 can be used 1,672
however 58,755 are shown 7,242 should be noted 1,609
were 40,883 to determine 7,061 may be addressed 1,585
respectively 33,872 not shown 6,555 have been deposited 1,463
compared 29,998 was used 6,541 can be obtained 1,139
based 29,822 is also 5,826 appears to be 1,097
used 29,393 were used 5,801 was performed using 1,024
shows 27,179 was observed 5,446 can be explained 970
has 24,828 to be 5,444 may not be 968
thus 24,575 is shown 5,364 can be expressed 887
observed 24,346 were obtained 4,836 can be observed 877
including 23,660 was performed 4,819 has been reported 874
therefore 23,036 to identify 4,339 has been shown 802
showed 22,703 to assess 4,210 did not affect 790
’s 22,429 is more 4,167 can be calculated 790
show 21,582 were performed 4,122 have been identified 779
only 21,147 are not 4,053 can be attributed 740
have 20,796 to test 3,991 were performed using 732
increased 20,556 would be 3,923 can be considered 721
found 20,306 have shown 3,780 have been reported 674
more 19,132 were collected 3,737 to better understand 671
following 18,966 can be 3,698 seems to be 665
to 18,603 could be 3,630 did not show 634
see 17,771 to obtain 3,561 has been observed 606
most 17,398 is based 3,551 should be considered 594
associated 16,963 will be 3,540 has been used 568
had 16,785 that is 3,495 has been suggested 566