-

June

1613-0073

10.1007/s10115-019-01351-4

Hierarchy for Named Entity Type Discovery in the Climate Change Domain

Andrija Poleksić

andrija.poleksic@uniri.hr 0 1

Sanda Martinčić-Ipšić

0 1

Ecosystem

Workshop

1 0 Faculty of Informatics and Digital Technologies (University of Rijeka) , Radmile Matejčić 2, Rijeka, 51000 , Croatia 1 Named Entity Recognition, Information Extraction , Climate Change, Wikidata, Knowledge Graphs, Community

2014

5 2025 1254 1257

Named Entity Recognition (NER) is a fundamental task in information extraction, yet general-purpose NER categories often fail to capture the specificity required for specialized domains such as climate change research. This paper presents a methodology for the automatic construction of a domain-specific NER type set with minimal supervision, leveraging a schema-based bottom-up approach to knowledge graph construction. The process begins with the identification of 655 core climate change-related terms, sourced from authoritative domain-specific resources. These terms are then semi-automatically aligned with Wikidata using SPARQL queries to take advantage of its hierarchical structure. A neighbourhood graph is constructed based on instance of (P31) and subclass of (P279) properties, forming the basis for community detection via the weighted Louvain algorithm. The resulting 59 communities are manually analyzed to derive a final set of 21 NER types, including Energy Source, Natural Disaster, Meteorological Phenomenon, and Chemical. Validation against existing ontologies and terminological knowledge base (SWEET, ENVO, and EcoLexicon) reveals that the SWEET ontology provides the highest coverage, containing 57.25% of core terms and 65.38% of the proposed NER types. The findings demonstrate that integrating knowledge graphs, NLP-based information extraction, and community detection provides an efective approach for domain-specific NER schema construction.

Detection

1. Introduction

CEUR

ceur-ws.org the knowledge graph itself [11, 12]. When building a KG, it is desirable to define classes or types of (named) entities and relations. For example, cumulonimbus and stratocumulus could be combined with the class clouds, further, clouds and the entity El Niño Southern Oscillation could be defined as elements of the class meteorological phenomenon - [METP]. With regard to the newly defined class, it is possible to set specific restrictions for individual relations, e.g. for the relation ” causes” a restriction ([METP], causes, [METP]) can be set.

Named entity recognition (NER) is an information extraction (IE) component that plays a fundamental role in the automated analysis of scientific literature [ 13, 14]. Traditionally framed as a sequence labeling task, NER aims to assign predefined entity types - such as location, organization, and person - to text spans. However, such coarse-grained categories are often insuficient to capture the domain-specific nuances required for specialized domains such as climate change research. To address this issue, this work focuses on refining the NER for the automatic construction of KGs from textual data in the climate change domain. We utilize existing resources (i.e. climate change terminology dictionaries) to develop a domain-specific set of NER types that are consistent with the Wikidata types terminology [ 15]. Our approach grounds derived entity types in a corpus of scientific publications in the climate change domain curated by [16] to ensure consistency with real-world climate change research discourse.

Specifically, the contributions are: • NER types discovery methodology for a selected domain (e.g. climate change) with minimal supervision; • Derived set of NER types for the climate change domain; • An alignment of derived entity types with Wikidata supported by coverage in existing climate change domain ontologies.

The paper is structured as follows. Section 2 discusses the principles of KG construction with a focus on the construction of domain-specific KGs and problems. Section 3 covers related work discussing the use of existing resources (dictionaries and KGs) for various information extraction tasks with a emphasis on NER. Section 4 discusses existing NLP resources in climate change domain that can be utilized. Section 5 follows with entity type discovery methodology, in particular the creation of a core entity set for climate change and the use of the Wikidata hierarchy for (named) entity type discovery. In Sections 6 the results are presented. We conclude with Section 7 and discuss the limitations and future work in Section 8.

2. Knowledge Graph Construction

The creation of general, comprehensive, encyclopedic knowledge graphs is a long-term and continuous process that requires a large amount of resources, and traditionally relies on the scientific research results and projects based on community collaboration. Examples of such knowledge graphs are DBpedia [17] (2007), YAGO [18] (2007), BabelNet [19] (2012), and Wikidata [15] (2014) as the currently largest knowledge graph with 114,097,305 nodes and 24,190 active users1.

In the work of Abu-Salih [9], the creation of a knowledge graph is divided into a schema-based, a schema-free and a hybrid approach, of which the first approach is applicable for the aims of this research. In addition, the schema-based approach can be realized based on two strategies: bottom-up and top-down [10, 20]. The top-down approach implies the initial construction of an ontology/schema or the use of an existing schema and the extraction of knowledge based on a given schema. An example of this approach is the YAGO knowledge graph with strictly defined, non-redundant types of entities and relations and logical constraints on them. In the bottom-up approach, the focus of creation is on the content itself, i.e. the data. Potential entities and relations are first extracted, and the initial knowledge graph schema or ontology is created based on the extracted data. Tamašauskaitė and Groth [10] in a systematic review of 57 scientific papers on the process of creating knowledge graphs, find that 70% of 1https://www.wikidata.org/wiki/Wikidata:Statistics the papers describe a bottom-up approach, an approach that corresponds to the current data-centric trend that we follow in our research as well.

So far, only encyclopedic, (i.e. cross-domain) knowledge graphs (e.g. Wikidata, DBpedia and YAGO) have been mentioned, but there are also increasingly popular domain-specific knowledge graphs such as: KnowLife [4], PaintKG [21] and CS-KG [5] in the fields of health, art and computer science respectively. The creation of knowledge graphs for the selected domain encounters domain-specific challenges in addition to the general problems of building knowledge graphs: • Complexity of domain terminology - a specific domain usually has a specialized vocabulary and technical terms that are not correctly represented in multi-domain (general) knowledge bases; • The need for expert domain knowledge - for the evaluation and validation of knowledge graphs, it is necessary to ensure a domain expert evaluation, and expertise is also required when creating the schema/ontology of the knowledge graph itself; • Limitations of existing models for information extraction - specific domains have their specific entities and relations, which general models fail to extract (i.e. they have not learned the domain-specific relations and entities and are not capable of distinguishing nuanced meanings of domain phrases); • Lack of domain ontology - usually, in a specific domain, there is no clearly defined ontology, which makes it dificult to structure and organize knowledge graph schema. Without an established domain ontology, it becomes dificult to define relations between entities, while ensuring consistency and enabling coherent integration of new information.

To overcome these challenges, automation of the domain knowledge graphs construction, in terms of developing NLP (natural language processing) methods in information extraction, plays a central role.

3. Related Work

The automation of knowledge graph construction is based on unsupervised and/or semi-supervised information extraction procedures, reducing the need for time-consuming and expensive manual data labeling. When building a domain knowledge graph, it is necessary to utilize existing (digital) resources to automate the process and reduce the amount of manual labeling.

Thus, Cai et al. [22] use an existing, more general (coarse), medical domain knowledge graph to create a specific (fine) knowledge graph for the oncology domain. The authors address three types of triples: overlapping triples, where both the coarse and fine domain KGs contain certain triples; triples of new relations but overlapping entities, where the fine domain KG includes both entities but lacks the relation between them; and triples of new entities, where at least one entity does not exist in the coarse KG. To tackle coarse-to-fine KG domain adaptation, they propose an end-to-end KG domain adaptation (KGDA) framework using distant supervision. This framework enables the construction of a KG from fully unlabeled raw text data under the guidance of an existing KG. While this system provides promising results, it relies on the assumption that both KGs have the same types of entities and relations.

Wang et al. [6] use a dictionary and classification of terminology from the geology and mineral resources domain and create a directed graph based on the frequency of bigrams and the order of words in the sentence.

Yuan et al. [7] argue that most existing knowledge graph construction methods are based on large knowledge graphs or existing extensive ontologies/taxonomies, and therefore use the available UMLS thesaurus [23], based on which they recognize domain entities. High-frequency pairs of entities in sentences become potential facts (i.e. triples: entity - relation - entity) for which latent groups (clusters) of relation types are obtained using contextualized embeddings. The clusters of potential relation types are then manually labeled. This significantly reduces annotation cost without loss of quality (instead of labeling each instance of relations, the entire group or all instances of a type are labeled simultaneously).

Frei and Kramer [24] integrate Wikipedia2 and Wikidata to systematically extract text data and annotation information for Named Entity Recognition (NER). Their approach utilizes the graph relations (properties) of Wikidata to derive NER types. In particular, they use properties such as P2176 (drug or therapy used for treatment) to identify entities - e.g. diseases with known treatments - and assign them the NER type TREATABLE_HEALTH_ISSUE. This method shows how structured knowledge graphs can be efectively used to generate domain-specific NER categories and improve the annotation of entities in specialized corpora.

Lippolis et al. [25] introduce two approaches for entity alignment between ArtGraph and Wikidata. The first method, Wikidata Entity Search (WES), uses simple SPARQL queries to establish entity correspondences. The second approach, pArtLink, leverages the generative capabilities of large language models in conjunction with established entity-linking techniques such as GENRE [26] and Wikimapper3 to increase alignment accuracy. ArtGraph, a domain-specific knowledge graph created from WikiArt and DBpedia, encapsulates structured representations of concepts related to works of art.

Nie et al. [13] present the Know-Adapter framework for few-shot NER. The authors emphasize the benefits of incorporating explicit knowledge from external sources, such as knowledge graphs, while addressing the heterogeneity between knowledge graph entity types and NER types. Specifically, for a given mention in a sentence, they build a retriever to find its closest match in Wikidata. They then construct a 3-hop subgraph around the matched entity by traversing Wikidata properties (relations). This approach creates a structured mapping from multiple Wikidata entities that difer in specificity to a single NER type and utilizes the Wikidata hierarchy to improve entity type classification. In contrast to their approach, which expands entity types to improve the few-shot entity classification, our research focuses on the compression and standardization of entity types. By refining a broad and diverse set of entities into a finite set of well-defined NER types. Specifically, we aim to create a structured and domain-relevant taxonomy of the climate change research that ensures consistency and usability in automated knowledge graph construction.

Inspired by these lines of research, we use existing resources such as dictionaries [6, 7], which presumably contain domain entities of diferent granularity, and combine them with a more general knowledge graph (Wikidata) [22, 25] to construct a hierarchy [13] to produce a final set of NER types for the climate change research domain.

4. Existing Resources

As discussed in Section 3, knowledge-intensive research benefits from available resources. In this sense, this section looks at existing sources that have been used directly or as a reference point in this research, especially existing domain dictionaries, terminologies and ontologies.

Full Weather Glossary4 from National Oceanic and Atmospheric Administration (NOAA) - National Weather Service (NWS) contains a total of 355 terms with definitions. There is also an extension of this glossary with more than 2000 terms, phrases and abbreviations used by the NWS5. Glossary of Meteorology6 from American Meteorology Society (AMS) is the authoritative source for definitions of meteorological terms. From the AMS and NWS glossaries we have extracted a total of 9,511 climatechange related terms and corresponding definitions.

Webersinke et al. [27] expand the vocabulary when pretraining their models, they add a list of 255 terms7 (tokens) with the highest frequency in their climate-change related pretraining corpus to the original DistilRoBERTa [28] vocabulary. We add these 255 terms to our dictionary of climate-change related terms. 2https://www.wikipedia.org/ 3https://github.com/jcklie/wikimapper 4https://www.weather.gov/otx/Full_Weather_Glossary 5https://forecast.weather.gov/glossary.php? 6https://glossary.ametsoc.org/wiki/Welcome 7https://huggingface.co/climatebert/distilroberta-base-climate-f

Reimerink et al. [8] construct a new multilingual terminological knowledge base (TKB) on the environment science - EcoLexicon8. The construction of EcoLexicon began in 2003 with a core list of 794 environmental terms in Spanish and English. For each term, definitions were elaborated, reflecting the level of generality or specificity of the concept as well as its relations with other concepts within the same knowledge domain. The original list of terms was enriched by the addition of new terms as well as by its transformation into a conceptual network. Currently, EcoLexicon contains 4,654 concepts of environmental science and 24,968 terms in eight languages (English, Spanish, German, French, Dutch, Modern Greek, Russian and Arabic) [29]. The EcoLexicon data includes concepts, terms, and semantic relations organized within a frame-like structure called the Environmental Event.

The Environment Ontology (ENVO)9 is a community-driven ontology that supports the representation of environments beyond the biological and biomedical domains [30, 31]. ENVO consists of classes (terms) that refer to the main types of environments and can facilitate the retrieval and integration of a wide range of biological data. The authors follow the principles of the Open Biomedical and Biological Ontologies (OBO) Foundry and align their ontology with the Basic Formal Ontology (BFO) [32]. ENVO consists of 7,030 classes (terms), such as ENVO’s biome, environmental feature, and environmental material hierarchies – the ontology’s most developed branches and of the greatest interest to annotators. Recently, when adapting to BFO, some of the hierarchies were revised and made obsolete, such as environmental features.

Semantic Web for Earth and Environmental Terminology (SWEET)10 [33] is a highly modular ontology suite with 10,23911 concepts (classes) in 200 separate ontologies covering Earth system science. SWEET is a mid-level ontology and consists of nine top-level concepts that can be used as a foundation for deriving domain-specific ontologies that start from extending these top-level SWEET components.

In [16] we elaborate upon our climate research corpus, consisting of research papers from renowned journals on climate change, that we use in this work. We showed an exploratory prestudy in which we applied a readily available NER model and a POS tagger from flair 12 on a sample of 10,000 research papers (∼ 5% of the corpus). With the insights gained from this preliminary experiment, we have decided to experiment with LLM-assisted annotation; in particular, using Phi-3-mini-4k-instruct13 deployed locally for sentence-level triple extraction task.

5. Entity Discovery 5.1. Core Entity Set

Building upon authoritative sources, including the Full Weather Glossary, the Glossary of Meteorology, Wikipedia glossaries and term expansions in ClimateBERT (dictionary), as well as our prior research [16], which includes NER results (NER), exploratory LLM-based annotations (Phi3) and extracted keywords (keywords), we systematically construct a core entity set for the climate change domain. This selection process is based on a majority overlap criterion that requires an exact match of at least three out of four sources. In the initial experiments, we include POS tagging results (POS), treating noun phrases as candidate entity terms. However, this approach resulted in a noisy set of instances, which did not contribute to the expansion of the core set, therefore POS-derived votes are excluded. In refinement steps, we experimented with diferent overlap ratios and case sensitivity. Ultimately, with a majority (three out of four) votes, we settled on a case-sensitive overlap strategy that balances corpus-driven entity selection (NER, Phi3 and keywords) with the integration of terminologies from authoritative sources (dictionary).

This process results in a set of 818 core terms, which subsequently undergo cleaning and deduplication. 8https://ecolexicon.ugr.es/en/index.htm 9https://sites.google.com/site/environmentontology/ 10https://github.com/ESIPFed/sweet 11https://bioportal.bioontology.org/ontologies/SWEET 12https://github.com/flairNLP/flair 13https://huggingface.co/microsoft/Phi-3-mini-4k-instruct After removing duplicates, 766 unique terms remain. These terms are then validated against entire corpus [16] by computing the occurrence frequency. Terms that occur less than 10 times are excluded from further analysis. This process corresponds to entity detection in phase one of building a knowledge graph, corresponding to the discovery section proposed in [34].

Next, inspired by [25], we perform an automatic alignment of the core terms with Wikidata using three SPARQL queries: exact match, case-invariant match, and a substring-based (”contains”) query (see Appendix A). This automated process yields preliminary results, which are then manually curated. During curation, the results are categorized into four distinct groups: ( 1 ) Out of scope: 4 terms; ( 2 ) Requires disambiguation: 144 terms; ( 3 ) Manually corrected (fixed item): 255 terms; (4) Good match: 363 terms. We successfully matched 47.39% of the terms with Wikidata using a simple automatic comparison. The subsequent manual alignment corrects an additional 33.29%, bringing the total number of aligned terms to 618 (80.68%). For the ambiguous group, we align relevant climate-change related terms from Wikidata that are similar to the ambiguous entries and add 37 more terms to the set. As a result, we obtained a final set of 655 core terms aligned with Wikidata items. An example of the alignment is in Table 1, with some terms that have an inherent domain-specific contextualization. For instance, the term Barber, which is conventionally associated with an occupational role, is instead categorized within the meteorological domain as a specific type of wind.

5.2. Wikidata Subgraph

Wikidata incorporates several hierarchical (vertical) relations, referred to as properties, such as instance of (P31) and subclass of (P279). Using the core terms aligned with Wikidata items and these two relations, we construct a neighbourhood graph. In this graph, for each core term, we identify (, − )hop neighbours in each direction, where , ∈ ℕ , with representing height and representing depth. Height refers to the number of hops in the abstraction direction (towards top), while depth refers to the number of hops in the concretization direction (towards bottom). Specifically, for each core term, we recursively search for items that are instances of or subclasses of the given term. Conversely, we also search for items that the given term is an instance of or a subclass of, based on the P31 and P279 relations. This process enables us to capture the hierarchical structure and the relationships between

Q107198 the terms within the graph.

Figure 1 illustrates a neighbourhood graph for five terms - mistral, jet stream, sea breeze, westerlies and katabatic wind - with height = 2 and depth = 1 . In this graph, the instance of (P31) relations are represented by solid lines, while the subclass of (P279) relations are shown with dashed lines. In this case, the concretization direction is not relevant, as the starting terms (i.e. at level 0) are already suficiently specific. However, moving in the direction of abstraction (i.e. towards the top) reveals a wealth of valuable instances. In particular, the level 2 instance wind serves as a direct abstraction for two starting terms (sea breeze and westerlies), while indirectly encompassing the remaining three terms (jet stream via thermal wind, katabatic wind via fall wind and air current, and mistral via katabatic wind). The wind efectively encapsulates the meaning of all starting terms in this context, suggesting that it could serve as a representative entity type. A further step in the abstraction can be a viable solution in the form of meteorological phenomenon. In this way, we proceed to identify potential Named Entity Recognition (NER) types for identified core entity set (i.e. 655 detected core terms) by utilising the hierarchical structure of the Wikidata graph that guides the discovery of relevant entity categories. Note that Figure 1 is a simplification of the original structure that would be created based on five terms used, a full preview is in the Appendix E.

The hierarchical structure of the Wikidata subgraph is rich and valuable. Still, it contains a large number of nodes and edges, making it dificult to manually navigate and identify an optimal representative node (i.e. a Wikidata item) for NER classification. To overcome this challenge, we utilized Graphia14, an open-source visual analytics application designed to facilitate the interpretation of large and complex datasets. By leveraging Graphia’s graph analysis and transformation capabilities, we refine the subgraph to improve its interpretability. To achieve this, we apply the following preprocessing steps: • Removal of leaf nodes - not candidates for NER types; • Filtering based on node height- removing all nodes with a height of ≥ 4 - height value indicates a term that is too abstract, e.g. metaclass (Q19478619); • Removal of nodes with in-degree ≤ 1- terms do not contribute to the abstraction.

The height of the node is determined depending on its position to the initial core term. Specifically, for each term, we compute its outgoing -hop neighbourhood using the instance of (P31) and the subclass of (P279) relations, as well as its incoming -hop neighbourhood. Each term that appears in the neighbourhood is assigned a value based on the number of hops from the initial term. These assigned values are then averaged across occurrences to obtain a measure of overall height, which quantifies the level of abstraction of a given term (see Appendix C).

After these preprocessing steps, we perform a weighted Louvain algorithm [ 35] with a granularity parameter set to 1, using edge weights to reflect relation importance. We argue that the instance of (P31) should be considered more significant than the subclass of (P279) relation and, therefore assign it weights of 1.0 and 0.5, respectively. This weighting ensures that the communities formed by the Louvain algorithm better reflect meaningful entity groupings for NER classification. In this way, we obtain 59 components (i.e. communities) that are potential NER types for the climate change research domain. After manual inspection of each community we identified a central node (i.e. the node that has a high in-degree centrality), with many connected terms abstracting to it. We also favor nodes with a lower height value whenever possible, as this provides an optimal balance between over-abstraction and over-specificity. This ensures that the selected node serves as a well-generalized yet meaningful representative term within its community, making it a suitable candidate for NER type determination. Examples with the five highest in-degree values in three communities are in Table 2.

After acquiring 59 community or cluster representatives, we conducted a manual inspection to refine the selection. First, we merge similar classes, such as mathematical expression and mathematical concept. Additionally, we eliminate community representatives that are either overly abstract or unrelated to the ifeld, including metaclass, telecommunication network and second-order class (refer to central row Table 2). Finally, we review and remove the majority of communities containing only a single instance, as they do not contribute to the overall classification structure. After this step, we retain 26 representative terms as potential NER types (classes). In the results Section (6), we ensure alignment with existing domain-specific classifications by manually comparing the extracted terms with established ontologies and terminological knowledge bases, including EcoLexicon, ENVO, and SWEET (see Section 4). Further, we compute the number of terms occurring in each domain-related KG, and we validate the NER types by counting the number of instances under each category. Finally, we preview Louvain clustering results with community statistics.

6. Results

As mentioned in Section 5.1, we calculated the frequency of occurrence for 766 unique terms (including the final 655 core entity terms) in the entire corpus. The top 10 most frequently occurring terms are water, model, Time, temperature, analysis, precipitation, climate, low, soil and level. The bottom 10 are Advanced Weather Interactive Processing System, dry line, red beds, pseudoboehmite, Tramontana, geomagnetism, North Greenland Ice Core Project, Advanced Baseline Imager, small hail and pressure jump. The full list is reported in Table 5 (Appendix B).

Further, we perform a case-insensitive match of identified 655 core terms to other ontologies. In particular, we search for the core term in two available ontologies SWEET and ENVO, excluding EcoLexicon as it is not accessible via the API and can not be used locally. For the SWEET ontology, we find a match for 375 core terms (57.25 %), and for ENVO we find a match for 117 (17.86 %). Of the 117 terms that match in ENVO, 105 (89.74 %) are in the SWEET ontology. This limited alignment indicates that the SWEET ontology is a better candidate for future development, as in [22], where a coarse domain knowledge graph (i.e. SWEET) could be used to construct a more specific fine domain KG (i.e. KG for climate change research domain).

As elaborated in Section 5.2, we apply the Louvain algorithm for community detection, yielding a total of 59 communities. For each identified community, we designate a representative node as a potential NER type. The community size distribution is as follows: four large communities contain more than 20 nodes, 19 medium-sized communities have between 10 and 20 nodes, and 34 small communities consist of fewer than 10 nodes. Notably, half of the smallest communities are singleton nodes, that are omitted for further processing. Details are listed in Table 6 (Apendix D). Next, we compare the selected 26 communities (i.e. their representative terms) with SWEET, ENVO and EcoLexicon. The comparison results are shown together with the final selected class names (i.e. NER types) in Table 3. This process was carried out by manual examination of two ontologies (SWEET and ENVO) as well as a terminological knowledge base (EcoLexicon). SWEET and EcoLexicon have a better coverage of 26 representative terms (17 out of 26). Based on the occurrence of representative terms in other knowledge bases, we retain terms that occur at least once, with the exception of Natural Phenomena, which we believe is important for the climate change domain. We also merge several similar classes; in particular, geographic region, geographic location and geographic entity are merged into a single class Location. In this way, we create a final set of 21 NER types with the following classes: Ecosystem, Energy Source, Natural Disaster, Meteorological Phenomenon, Quantity, Astronomical Object, Body of Water, Disease, Location, Measurement Unit, Physical Phenomenon, Chemical, Time Period, Organization, Natural Phenomenon, Field of Study, Mathematical Expression, Measuring Device, Geographical Feature, System and Satellite.

For each NER type, we calculate the number of core entity terms that have a path in the Wikidata subgraph (Section 5.2) to Wikidata items corresponding to that NER type. The results are presented in Table 4. Note that we allow each term to have paths to multiple representative Wikidata items (NER types). In this way, we also gain insight into possible redundant classes. The top five class pairs in terms of overlap are: Geographical Feature - Location (77), Field of Study - Quantity (71), Meteorological Phenomenon - Natural Phenomenon (65), Natural Phenomenon - Physical Phenomenon (45) and Field of Study - Physical Phenomenon (37). On the other hand, we can also observe the terms with the largest number of classes to which they belong. The top five are: typhoon and tropical cyclone with six and upwelling, cyclone and polar vortex, all of which have five classes (types) to which they correspond.

7. Conclusion

This paper proposes a methodology for discovery of Named Entity Recognition (NER) types tailored to the climate change domain with minimal supervision, leveraging a schema-based bottom-up approach to knowledge graph construction. We use existing resources such as dictionaries [6, 7], which presumably contain domain entities of diferent granularity, and combine them with a more general knowledge graph (Wikidata) [22, 25] to construct a hierarchy [13] to produce a final set of NER types for the climate change research domain. This process begins with the identification of 655 core climate-change related terms, sourced from authoritative domain-specific resources. These terms are then semi-automatically aligned with Wikidata to fertilize from its hierarchical structure. The weighted Louvain algorithm is engaged for the community detection on a neighbourhood graph constructed from instance of (P31) and subclass of (P279) Wikidata properties. The resulting 59 communities are manually analyzed to derive a final set of 21 NER types in the climate change domain, including Ecosystem, Energy Source, Natural Disaster, Meteorological Phenomenon, and Chemical.

Validation against existing ontologies and terminological knowledge base (SWEET, ENVO, and EcoLexicon) reveals that the SWEET ontology provides the highest coverage, containing 57.25% of core terms. Similarly, SWEET also demonstrates strong alignment with the candidate NER types, covering 17 out of 26 types (65.38%). The final set of 21 NER types for the climate change research domain includes: Ecosystem, Energy Source, Natural Disaster, Meteorological Phenomenon, Quantity, Astronomical Object, Body of Water, Disease, Location, Measurement Unit, Physical Phenomenon, Chemical, Time Period, Organization, Natural Phenomenon, Field of Study, Mathematical Expression, Measuring Device, Geographical Feature, System, and Satellite. Finally, we report the occurrence frequency of core entities in the climate change research corpus. The cutof threshold of 10 is an indicator that corpus will be well suited for downstream training of domain NER model. The findings demonstrate that refining a broad and diverse set of entities into a finite set of well-defined NER types can contribute to alignment with existing climate ontologies and subsequently to automated climate change knowledge graph construction.

8. Limitations and Future Work

As described in Section 5.2, we construct a neighbourhood graph based on two Wikidata properties - instance of (P31) and subclass of (P279). This construction is based on the assumption of Wikidata completeness, i.e. if information on these two relations is not available in the Wikidata knowledge graph, terms remain unused and thus potentially impact the overall quality of the results. Some exemplary terms from our core entity set that have neither P31 nor P279 properties are absolute humidity, Action for climate empowerment, Shortwave radiation and pressure jump. This problem can be tackled in two ways: ifrstly, by manually adding the missing Wikidata hierarchical properties (relations), thereby contributing to a valuable community-maintained resource, and secondly, by exploring other hierarchical relations such as part of (P361), has part (Q65964571), facet of (P1269) and broader concept (P4900). Incorporating these alternative properties could enhance the representation of hierarchical structures for a given domain.

Additionally, the results are potentially sensitive to parameter choices, such as the granularity parameter (set to 1) and the weighting of the instance of (1.0) and subclass of (0.5) relations in the weighted Louvain algorithm. Exploring alternative granularity values or diferent weighting schemes may lead to diferent community detection results and consequently to diferent NER types. The introduction of additional hierarchical relations further amplifies this sensitivity.

Finally, for future work, we plan to integrate the GLiNER model [36] with our generated NER types. This integration will facilitate the labeling of a larger corpus within the climate change research domain, further refining entity classification and improving automated knowledge extraction.

Acknowledgments

Croatian Science Foundation supports AP under the project DOK-2021-02.

Declaration on Generative AI

During the preparation of this work, the authors used InstaText to improve grammar, check spelling and reword. After using this tool, the authors have reviewed and edited the content as needed and take full responsibility for the content of the publication. open data, in: K. Aberer, K.-S. Choi, N. Noy, D. Allemang, K.-I. Lee, L. Nixon, J. Golbeck, P. Mika, D. Maynard, R. Mizoguchi, G. Schreiber, P. Cudré-Mauroux (Eds.), The Semantic Web, Springer Berlin Heidelberg, Berlin, Heidelberg, 2007, pp. 722–735. [18] F. M. Suchanek, G. Kasneci, G. Weikum, Yago: A core of semantic knowledge, in: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, ACM, New York, NY, USA, 2007, pp. 697–706. URL: http://doi.acm.org/10.1145/1242572.1242667. doi:10.1145/1242572.1242667. [19] R. Navigli, S. P. Ponzetto, Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network, Artif. Intell. 193 (2012) 217–250. URL: https: //api.semanticscholar.org/CorpusID:6063065. [20] Z. Zhao, S.-K. Han, I.-M. So, Architecture of knowledge graph construction techniques, 2018. URL: https://api.semanticscholar.org/CorpusID:207900787. [21] H. Wu, S. Y. Liu, W. Zheng, Y. Yang, H. Gao, Paintkg: the painting knowledge graph using bilstm-crf, in: 2020 International Conference on Information Science and Education (ICISE-IE), 2020, pp. 412–417. doi:10.1109/ICISE51755.2020.00094. [22] H. Cai, W. Liao, Z. Liu, X. Huang, Y. Zhang, S. Ding, S. Li, Q. Li, T. Liu, X. Li, Coarse-to-fine knowledge graph domain adaptation based on distantly-supervised iterative training, 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (2022) 1294–1299. URL: https://api.semanticscholar.org/CorpusID:253383970. [23] A. Siu, Fast entity recognition in biomedical text, 2013. URL: https://api.semanticscholar.org/

CorpusID:39345437. [24] J. Frei, F. Kramer, Creating ontology-annotated corpora from Wikipedia for medical namedentity recognition, in: D. Demner-Fushman, S. Ananiadou, M. Miwa, K. Roberts, J. Tsujii (Eds.), Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 570–579. URL: https://aclanthology.org/ 2024.bionlp-1.47/. doi:10.18653/v1/2024.bionlp- 1.47. [25] A. Lippolis, A. Klironomos, D. Milon-Flores, H. Zheng, A. Jouglar, E. Norouzi, A. Hogan, Enhancing entity alignment between wikidata and artgraph using llms, in: Semantic Web and Ontology Design for Cultural Heritage 2023, volume 3540 of CEUR Workshop Proceedings, CEUR-WS, 2023. Publisher Copyright: © 2023 Copyright for this paper by its authors.; 2023 International Workshop on Semantic Web and Ontology Design for Cultural Heritage, SWODCH 2023 ; Conference date: 07-11-2023. [26] N. De Cao, G. Izacard, S. Riedel, F. Petroni, Autoregressive entity retrieval, in: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, OpenReview.net, 2021. URL: https://openreview.net/forum?id=5k8F6UU39V. [27] N. Webersinke, M. Kraus, J. A. Bingler, M. Leippold, Climatebert: A pretrained language model for climate-related text, 2022. arXiv:2110.12010. [28] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, ArXiv abs/1910.01108 (2019). [29] P. Faber, P. León-Araúz, R. Resi, P. ten Hacken, From specialized knowledge frames to linguistically based ontologies, Applied Ontology 19 (2024) 23–45. URL: https://doi.org/10.3233/AO-230033. doi:10.3233/AO- 230033. arXiv:https://doi.org/10.3233/AO-230033. [30] P. L. Buttigieg, N. Morrison, B. Smith, C. J. Mungall, S. E. Lewis, the ENVO Consortium, The environment ontology: contextualising biological and biomedical entities, Journal of Biomedical Semantics 4 (2013) 43. URL: https://doi.org/10.1186/2041-1480-4-43. doi:10.1186/2041- 1480- 4- 43. [31] P. L. Buttigieg, E. Pafilis, S. E. Lewis, M. P. Schildhauer, R. L. Walls, C. J. Mungall, The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation, Journal of Biomedical Semantics 7 (2016) 57. URL: https://doi.org/10.1186/s13326-016-0097-6. doi:10.1186/s13326- 016- 0097- 6. [32] J. N. Otte, J. Beverley, A. Ruttenberg, Bfo: Basic formal ontology, Applied ontology 17 (2022) 17–43. doi:10.3233/ao- 220262. [33] R. G. Raskin, M. J. Pan, Knowledge representation in the semantic web for earth and environmental terminology (sweet), Computers & Geosciences 31 (2005) 1119–1125. URL: https:// www.sciencedirect.com/science/article/pii/S0098300405001020. doi:https://doi.org/10.1016/ j.cageo.2004.12.004, application of XML in the Geosciences. [34] G. Weikum, X. L. Dong, S. Razniewski, F. Suchanek, Machine knowledge: Creation and curation of comprehensive knowledge bases, Found. Trends Databases 10 (2021) 108–490. URL: https: //doi.org/10.1561/1900000064. doi:10.1561/1900000064. [35] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment 2008 (2008) P10008. URL: http://dx.doi.org/10.1088/1742-5468/2008/10/P10008. doi:10.1088/1742-5468/2008/10/p10008. [36] U. Zaratiana, N. Tomeh, P. Holat, T. Charnois, GLiNER: Generalist model for named entity recognition using bidirectional transformer, in: K. Duh, H. Gomez, S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics, Mexico City, Mexico, 2024, pp. 5364–5376. URL: https://aclanthology.org/2024.naacl-long.300. doi:10.18653/v1/2024.naacl-long.300.

A. SPARQL queries

Inspired by the Wikidata Entity Search (WES) approach from [25] we construct three Wikidata SPARQL queries for automatic alignment of Wikidata items to our dictionary terms. For this task, we use the library SPARQLWrapper15, which serves as a SPARQL endpoint interface to Python. Three queries exact match, case-invariant match and substring-based (”contains”) match - are each listed below.

Listing 1: Exact Match: Exact match SPARQL query used for automatic alignment. SELECT ? i t e m ? i t e m L a b e l ? i t e m D e s c r i p t i o n ( GROUP_CONCAT ( DISTINCT ? i t e m T y p e ; s e p a r a t o r = ” , ␣ ” ) AS ? i t e m T y p e s ) ( GROUP_CONCAT ( DISTINCT ? i t e m S u b c l a s s ; s e p a r a t o r = ” , ␣ ” ) AS ? i t e m S u b c l a s s e s ) WHERE {

SERVICE w i k i b a s e : mwapi { bd : s e r v i c e P a r a m w i k i b a s e : e n d p o i n t ”www. w i k i d a t a . o r g ” ; w i k i b a s e : a p i ” E n t i t y S e a r c h ” ; mwapi : s e a r c h ” { i n p u t _ t e x t } ” ; mwapi : l a n g u a g e ” en ” .

? i t e m w i k i b a s e : a p i O u t p u t I t e m mwapi : i t e m . Listing 2: Case-Invariant Match: Case-invariant match SPARQL query used for automatic alignment. } OPTIONAL { ? i t e m wdt : P31 ? itemType . ? itemType r d f s : l a b e l ? i t e m T y p e L a b e l . FILTER ( lang ( ? i t e m T y p e L a b e l ) = ” en ” ) } OPTIONAL { ? i t e m wdt : P279 ? i t e m S u b c l a s s . ? i t e m S u b c l a s s r d f s : l a b e l ? i t e m S u b c l a s s L a b e l . FILTER ( lang ( ? i t e m S u b c l a s s L a b e l ) = ” en ” ) } OPTIONAL { ? i t e m schema : d e s c r i p t i o n ? i t e m D e s c r i p t i o n . FILTER ( lang ( ? i t e m D e s c r i p t i o n ) = ” en ” ) } OPTIONAL { ? i t e m r d f s : l a b e l ? i t e m L a b e l FILTER ( lang ( ?

i t e m L a b e l ) = ” en ” ) }

FILTER ( regex ( ? i t e m L a b e l , ” ^ { i n p u t _ t e x t } $ ” , ” i ” ) ) } GROUP BY ? i t e m ? i t e m L a b e l ? i t e m D e s c r i p t i o n

LIMIT 10 Listing 3: Substring-Based (”contains”) Match: Substring-based (”contains”) query match SPARQL query used for automatic alignment.

SELECT ? i t e m ? i t e m L a b e l ? i t e m D e s c r i p t i o n ( GROUP_CONCAT ( DISTINCT ? itemType ; s e p a r a t o r = ” , ␣ ” ) AS ? itemTypes ) ( GROUP_CONCAT ( DISTINCT ? i t e m S u b c l a s s ; s e p a r a t o r = ” , ␣ ” ) AS ? i t e m S u b c l a s s e s ) WHERE {

SERVICE w i k i b a s e : mwapi { bd : s e r v i c e P a r a m w i k i b a s e : e n d p o i n t ”www. w i k i d a t a . org ” ; w i k i b a s e : a p i ” E n t i t y S e a r c h ” ; mwapi : s e a r c h ” { i n p u t _ t e x t } ” ; mwapi : l a n g u a g e ” en ” .

? i t e m w i k i b a s e : a p i O u t p u t I t e m mwapi : i t e m . } OPTIONAL { ? i t e m wdt : P31 ? itemType . } # R e t r i e v e e n t i t y t y p e ( i n s t a n c e o f ) OPTIONAL { ? i t e m wdt : P279 ? i t e m S u b c l a s s . } # R e t r i e v e

s u b c l a s s o f OPTIONAL { ? i t e m schema : d e s c r i p t i o n ? i t e m D e s c r i p t i o n . FILTER ( lang ( ? i t e m D e s c r i p t i o n ) = ” en ” ) } # R e t r i e v e d e s c r i p t i o n OPTIONAL { ? i t e m r d f s : l a b e l ? i t e m L a b e l FILTER ( lang ( ?

i t e m L a b e l ) = ” en ” ) } # R e t r i e v e l a b e l s FILTER ( CONTAINS ( LCASE ( ? i t e m L a b e l ) , LCASE ( ” { i n p u t _ t e x t } ” ) ) ) # E n s u r e t h e l a b e l c o n t a i n s t h e i n p u t term } GROUP BY ? i t e m ? i t e m L a b e l ? i t e m D e s c r i p t i o n

LIMIT 10

B. Core Entity Terms

# 222,831 216,572 212,304 208,916 206,552 195,902 188,300 187,837 183,995 178,900 178,387 173,728 165,731 161,853 161,770 158,667 Q1312 155,658 Q1063 153,963 Q1298969 152,585 Q1997 148,561 Q66486514 147,941 Q7561 147,532 Q12453 140,638 Q11663 140,578 Q36933 136,774 Q14524818 136,475 Q175199 135,933 Q568 135,068 Q173582 131,432 Q2325497 130,248 Q29051774 130,137 Q129129 128,726 Q627 128,278 Q13471665 126,526 Q58734 122,509 Q116844065 121,632 Q21893647 120,674 Q42967 118,045 Q123304503 117,212 Q41719 115,864 Q7925 112,331 Q11402 112,030 Q180600 110,184 Q567555 109,919 Q871279 109,532 Q160329 109,012 Q159190 105,901 Q2633778 105,434 Q34027 105,131

Term

APS Stratocumulus sea breeze precipitable water MOC Nevada AMV internet accretion p300 deuterium brightness temperature cloud amount sublimation LLJ European Centre for Medium-Range Weather Forecasts trade winds North Atlantic Oscillation Alkanes PMC tornado storm surge specific heat plankton planetary boundary layer adenovirus desertification Kuroshio CFC Cretaceous power spectrum glia desiccation response time GOME carbon monoxide Hadley Circulation coalescence treeline Gulf Stream monsoon climate photochemistry CGCM nitric oxide Newton cyclogenesis drainage area SPCZ AGL # 5,950 5,873 5,765 5,742 5,718 5,690 5,624 5,560 5,395 5,367 5,284 5,139 5,121 5,091 5,069 5,037 Q160603 5,035 Q1137345 5,027 Q41581 5,018 Q7209090 5,008 Q8081 4,944

Q121742 4,941 Q487756 4,935

Q25367 4,897 Q1757268 4,886

Q193447 4,769 Q183481 4,737

Q53842 4,725 Q23748224 4,630 Q44626 4,629 Q1331626 4,611 Q177105 4,581 Q903071 4,551 Q578372 4,467 Q1425042 4,426 Q2025 4,317 Q338589 4,315 Q2071902 4,292 Q207762 4,233 Q130905 4,221 Q122933063 4,157 Q188651 4,152 Q650994 4,145 Q207843 4,127 Q12438 4,098 Q245472 3,963 Q166620 3,933 Q5977788 3,920 Q323170 3,904 El Niño Divergence front vortex Streamflow climatology

Q7939 Q85900110 Q189796

Q732722 Q29425295

Q52139 nitrogen oxides Q424418 3,101 critical point Q111059 3,084 cold pool Q104862831 3,066 Firn Q828861 3,054 Headwaters Q7376362 2,989 LIS Q128405384 2,913 nitrous oxide Q905750 2,871 avalanche Q7935 2,838 tsunami Q8070 2,836 swell Q185411 2,831 World Meteorological Orga- Q170424 2,827 nization phase change Q185357 2,804 Berg Q8502 2,786 sprite Q904961 2,778 Pliocene Q76259 2,768 AOGCM Q650994 2,749 Pacific Decadal Oscillation Q2033747 2,729

Q504027 Q55919789 Q185789 Q994122 Q1540250 Q83216 Q190132 Q1966999 Q3373825 Q33741 Q25261 Q25276 Q677070 Q37129 Q85248618 Q355730 Q79602 Q214700 Q11388 Q922395 dry season 27,064

Term

continental shelf SPC aegypti ice shelf Deconvolution STP Q46966 Q1183700 Q102145 # 2,708 2,655 2,645 2,619 2,595 2,589 SSI Q81382741 2,587 Arctic Oscillation Q674041 2,465 SEVIRI Q117778573 2,465 ocean acidification Q855711 2,455 iflopodia Q14859810 2,396 Jacobian Q506041 2,287 ONI Q117235275 2,264 Paris Agreement Q21707860 2,224 arid climate Q190946 2,185 GMS Q2246672 2,175 greenhouse efect Q41560 2,175 stratopause Q205397 2,147 TOGA Q3540622 2,134 hydrologic cycle Q81041 2,129 glomeruli Q909882 2,118 NLDN Q28458090 2,100 climate simulation Q117829810 2,090 global radiation Q1531731 2,090 zonal flow Q219838 2,087 photosynthetically active ra- Q900892 2,060 diation tropical climate Q135712 2,028 inversion layer Q25615856 2,026 low-level jet Q11850562 2,008 synoptic scale Q1233837 1,977 thermohaline circulation Q463223 1,964 ODS Q16607840 1,947 QuikSCAT Q1734511 1,937 Meteosat Q1429889 1,925 Indian Ocean Dipole Q1574518 1,901 laminar flow Q189452 1,878 AABW Q3913650 1,815 continental climate Q185005 1,807 levoglucosan Q6535767 1,789 ozone hole Q183140 1,789 carbon tax Q288401 1,773 foehn Q12314 1,753 melting point Q15318 1,730 nitrogen dioxide Q207895 1,717 ceilometer Q1027486 1,659 convective available poten- Q1129355 1,591 tial energy xenon Q1106 1,586 Q11998503 Q7150 Q1463606 Q4087 Q674041 Q2414143 Q12432978 Q3342203 Q786970 Q167336 Q215915 Q8024052 Q130768 Q62932

Q1584373 Q133900

Q209190 Q185113 Q37477 Q214881 Q42250 Q167980 Q642867

Q2357982 Q38178665

Q7561 Q1757268

Q140588 Q165939

Q2001116 Q4652675 Q179918 Q171293 Q62008854 Q67657988 # 26,844 26,827 26,773 26,492 26,132 25,778 25,384 24,941 24,806 24,776 24,750 24,281 24,038 23,632 23,347 23,267 23,079 22,305 21,961 21,840 21,576 21,408 21,373 21,099 20,878 20,620 20,588 20,533 20,465 sensible heat Q1480581 peat Q184624 CAT Q1101409 Landsat Q849791 influenza Q2840 GPS Q18822 entrainment Q15733549 turbidity Q898574 rainy season Q3117517 PAR Q900892 air mass Q216823 surge Q287381 thermocline Q849599 wet season Q3117517 subsidence Q2091656 hurricane Q34439356 soil temperature Q889769 carbon dioxide dissolution meteorology GOES ablation AMO VOC specific humidity agarose Isoprene zebrafish Holocene radiosonde anticyclone

Sahel

kinetic energy MCS frost hydroxyl water table Cumulus pandemic

Radiance termination Hf visibility

Q1997 Q3133701 Q25261 Q976688 Q322177 Q756835 Q910267 Q2253551 Q390697

Q271943 Q169444

Q25445 Q852817 Q177414 Q66065 Q46276 Q660968 Q4590598 Q104116 Q3342272 Q14189 Q12184 Q1411145 Q23582432 Q15115271 Q654068 # 17,852 17,815 17,445 17,300 17,278 17,214 17,019 16,872 16,787 16,778 16,681 16,675 16,651 16,640 16,550 16,499 16,487 16,480 16,426 16,303 16,188 16,031 15,972 15,801 15,773 15,693 15,396 15,010 15,000 14,764 14,745 14,724 14,589 14,479 14,406 14,254 14,093 14,089 13,943 13,843 13,821 13,809 Q7846140

Q211433 Q1507383 Q182311 Q11120024 Q654 Q421968 Q1063457 Q38178435 Q1048298 Q898732 Q740686 Q846837 Q1468684 Q692163 Q47209908 Q5977788 900 898 882 871 863 853 850 848 845 836 828 818 796 776 766 752 739 Q5133390 737 Q175973 722 Q122574051 706 Q177414 685 Q420436 670 Q756307 670 Q2336098 659 Q80205 646

California Current Q281655

anvil cloud Q1358304 ensemble forecasting Q3433888 heat index Q2141844 Agulhas Current Q398548 Antarctic Circumpolar Cur- Q55828 rent carbon capture and storage Q41491 North Atlantic Current Q211798 hypothermia Q1036696 supercooling Q213659 magnetosphere Q6915 North Atlantic Deep Water Q921070 Atlantic Niño Q4816419 coupled general circulation Q650994 model speleothems time-series analysis planetary scale Mistral Q154507 Q11850042 Q124101881 Q193742 623 621 618 606 601 598 596 593 587 582 560 557 546 524

Q16305538 Q14946524 Q1197111

Q190052 carbon seques- Q15305550 tration AGCM Q650994 ACE Q30717004 return period SAR Lf insolation tundra cloudiness adiabatic radon mantle tilt Skewness CERES gyre CCS NWP half-life

Q2627230 Q740686 Q17156810 Q216973 Q43262 Q830457 Q182453 Q1133 Q101949 Q179745 Q9051521 Q1102659 Q1250263 Q41491 Q837552 Q47270 12,152 11,921 11,745 11,704 11,642 11,108 11,003 10,894 10,639 10,591 10,432 10,417 10,313 10,252 10,221 10,196 10,041 9,972 9,943 9,937 9,856 9,263 9,252 9,179 9,156 9,127 8,881 409 395 394 366 360 348 345

Term

AATSR mass balance model downburst frost heave Northern Annular Mode Maunder Minimum katabatic wind mesoscale convective system Q660968 Antarctic Oscillation Q3288815 sudden stratospheric warm- Q1583422 ing bombykol Q425845 gamma radiation olaparib Q11523 Q7083106 global dimming Q211627 Advanced Microwave Sound- Q4686237 ing Unit Nimbostratus Oceanic Niño Index cut-of low Q60967643 316 plate tectonics Q7950 302 ifbrillin-1 Q17927651 299 Global Ozone Monitoring Ex- Q1425042 296 periment Upper Atmosphere Research Q534401 287 Satellite Loop Current Q377116 275 National Lightning Detec- Q28458090 253 tion Network CYGNSS Q5198802 250 Equatorial Undercurrent Q1190478 248 Tropical Rainfall Measure- Q2001116 240 ment Mission mesocyclone Q2002856 227 dendroclimatology Q2294113 215 South Equatorial Current Q1072306 202 Benguela Current Q59676 200 ketoconazole Q407883 171 synoptic meteorology Q130221760 157 pollen analysis Q2737544 153 Jason-1 Q1970012 150 COP26 Q7888355 141 Universal Thermal Climate Q30347503 137 Index glaciology Q52120 126 iridescence Q957208 123 turbidity current Q1756774 120

Term biosphere Acetone Cal

Aqua black carbon hydrological cy- Q81041 cle mass spectrome- Q1327691 ter hail Q37602 Terra Q584697 harmonics Q1148098 SOI Q1550887 jet stream

Q202325 Q114591 Q453122 pseudoboehmite Q2115715 red beds Q2065586 dry line Q2742789 Advanced Weather Interac- Q4686330 tive Processing System TOTAL: 36,516,003 85 67 63 49 12

C. Node Depth and Node Height

Building upon the examples provided in this work, we consider five initial Wikidata terms: mistral, katabatic wind, jet stream, sea breeze, and westerlies. We perform a recursive search with a maximum height of = 2 (two hops upward along instance of (P31) and subclass of (P279)) and a maximum depth of = 1 (one hop downward along these relations).

For example, starting from jet stream, we identify air current as a one-hop neighbour. In turn, wind is a one-hop neighbour of air current, reaching the two-hop limit. Conversely, in the opposite direction (where jet stream is the object of P31 or P279 relations), we find jet streak as a direct neighbour. This procedure is applied to all starting terms, producing the following exemplary results: a) (-1) jet streak -> (0) jet stream -> ( 1 ) air current -> ( 2 ) wind b) (-1) ______ -> (0) mistral -> ( 1 ) katabatic wind -> ( 2 ) fall wind c) (-1) mistral -> (0) katabatic wind -> ( 1 ) fall wind -> ( 2 ) air current d) (-1) Sundowner -> (0) sea breeze -> ( 1 ) wind -> ( 2 ) meteorological phenomena e) (-1) Shrieking Sixties -> (0) westerlies -> ( 1 ) west wind -> ( 2 ) wind

From this limited set of terms, we can compute each node’s overall height as the average of all depths (or heights) at which it appears. For example, consider the node katabatic wind, which appears as a starting term at height 0 (example a) and as a one-hop neighbour at height 1 (example b). Its overall height is thus calculated as: 0+1 = 0.5.

D. Louvain Algorithm Results E. Neighbourhood Graph

[1]

Abbass ,

M. Z.

Qasim ,

Song ,

Murshed ,

Mahmood , I. Younis , A review of the global climate change impacts, adaptation, and sustainable mitigation measures , Environmental Science and Pollution Research 29 ( 2022 ) 42539 - 42559 . URL: https://doi.org/10.1007/s11356-022-19718-6. doi: 10 .1007/s11356- 022- 19718- 6.

[2]

Saleem ,

Anwar ,

Nawaz ,

Fahad ,

Saud , T. U. Rahman,

M. N. R.

Khan , T. Nawaz, Securing a sustainable future: the climate change threat to agriculture, food security, and sustainable development goals , Journal of Umm Al-Qura University for Applied Sciences ( 2024 ). URL: https: //doi.org/10.1007/s43994-024-00177-3. doi: 10 .1007/s43994- 024- 00177- 3.

[3]

K. L.

Ebi ,

Vanos ,

J. W.

Baldwin ,

J. E.

Bell ,

D. M.

Hondula ,

N. A.

Errett ,

Hayes ,

C. E.

Reid ,

Saha ,

Spector ,

Berry , Extreme weather and climate change: Population health and health system implications , Annual Review of Public Health 42 ( 2021 ) 293 - 315 . doi: 10 .1146/ annurev- publhealth - 012420- 105026, epub 2021 Jan 6.

Albedo Q101038 surface tempera- Q56297886 ture scattering Q210028 Probe Q96093522 oscillation Q170475 p53 Q14818098 autumn Q1314 MJO Q1170041 Nitrate Q49916468 Stratosphere Q108376 NAO Q1137345 boundary layer Q752193 advection Q379788