<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/s10115-019-01351-4</article-id>
      <title-group>
        <article-title>Hierarchy for Named Entity Type Discovery in the Climate Change Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrija Poleksić</string-name>
          <email>andrija.poleksic@uniri.hr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanda Martinčić-Ipšić</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ecosystem</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Informatics and Digital Technologies (University of Rijeka)</institution>
          ,
          <addr-line>Radmile Matejčić 2, Rijeka, 51000</addr-line>
          ,
          <country country="HR">Croatia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Named Entity Recognition, Information Extraction</institution>
          ,
          <addr-line>Climate Change, Wikidata, Knowledge Graphs, Community</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>5</volume>
      <issue>2025</issue>
      <fpage>1254</fpage>
      <lpage>1257</lpage>
      <abstract>
        <p>Named Entity Recognition (NER) is a fundamental task in information extraction, yet general-purpose NER categories often fail to capture the specificity required for specialized domains such as climate change research. This paper presents a methodology for the automatic construction of a domain-specific NER type set with minimal supervision, leveraging a schema-based bottom-up approach to knowledge graph construction. The process begins with the identification of 655 core climate change-related terms, sourced from authoritative domain-specific resources. These terms are then semi-automatically aligned with Wikidata using SPARQL queries to take advantage of its hierarchical structure. A neighbourhood graph is constructed based on instance of (P31) and subclass of (P279) properties, forming the basis for community detection via the weighted Louvain algorithm. The resulting 59 communities are manually analyzed to derive a final set of 21 NER types, including Energy Source, Natural Disaster, Meteorological Phenomenon, and Chemical. Validation against existing ontologies and terminological knowledge base (SWEET, ENVO, and EcoLexicon) reveals that the SWEET ontology provides the highest coverage, containing 57.25% of core terms and 65.38% of the proposed NER types. The findings demonstrate that integrating knowledge graphs, NLP-based information extraction, and community detection provides an efective approach for domain-specific NER schema construction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Detection</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>CEUR</p>
      <p>ceur-ws.org
the knowledge graph itself [11, 12]. When building a KG, it is desirable to define classes or types of
(named) entities and relations. For example, cumulonimbus and stratocumulus could be combined with
the class clouds, further, clouds and the entity El Niño Southern Oscillation could be defined as elements
of the class meteorological phenomenon - [METP]. With regard to the newly defined class, it is possible
to set specific restrictions for individual relations, e.g. for the relation ” causes” a restriction ([METP],
causes, [METP]) can be set.</p>
      <p>Named entity recognition (NER) is an information extraction (IE) component that plays a fundamental
role in the automated analysis of scientific literature [ 13, 14]. Traditionally framed as a sequence labeling
task, NER aims to assign predefined entity types - such as location, organization, and person - to text
spans. However, such coarse-grained categories are often insuficient to capture the domain-specific
nuances required for specialized domains such as climate change research. To address this issue, this
work focuses on refining the NER for the automatic construction of KGs from textual data in the climate
change domain. We utilize existing resources (i.e. climate change terminology dictionaries) to develop
a domain-specific set of NER types that are consistent with the Wikidata types terminology [ 15]. Our
approach grounds derived entity types in a corpus of scientific publications in the climate change
domain curated by [16] to ensure consistency with real-world climate change research discourse.</p>
      <p>Specifically, the contributions are:
• NER types discovery methodology for a selected domain (e.g. climate change) with minimal
supervision;
• Derived set of NER types for the climate change domain;
• An alignment of derived entity types with Wikidata supported by coverage in existing climate
change domain ontologies.</p>
      <p>The paper is structured as follows. Section 2 discusses the principles of KG construction with a focus
on the construction of domain-specific KGs and problems. Section 3 covers related work discussing
the use of existing resources (dictionaries and KGs) for various information extraction tasks with a
emphasis on NER. Section 4 discusses existing NLP resources in climate change domain that can be
utilized. Section 5 follows with entity type discovery methodology, in particular the creation of a core
entity set for climate change and the use of the Wikidata hierarchy for (named) entity type discovery.
In Sections 6 the results are presented. We conclude with Section 7 and discuss the limitations and
future work in Section 8.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Knowledge Graph Construction</title>
      <p>The creation of general, comprehensive, encyclopedic knowledge graphs is a long-term and continuous
process that requires a large amount of resources, and traditionally relies on the scientific research
results and projects based on community collaboration. Examples of such knowledge graphs are
DBpedia [17] (2007), YAGO [18] (2007), BabelNet [19] (2012), and Wikidata [15] (2014) as the currently
largest knowledge graph with 114,097,305 nodes and 24,190 active users1.</p>
      <p>In the work of Abu-Salih [9], the creation of a knowledge graph is divided into a schema-based,
a schema-free and a hybrid approach, of which the first approach is applicable for the aims of this
research. In addition, the schema-based approach can be realized based on two strategies: bottom-up
and top-down [10, 20]. The top-down approach implies the initial construction of an ontology/schema
or the use of an existing schema and the extraction of knowledge based on a given schema. An example
of this approach is the YAGO knowledge graph with strictly defined, non-redundant types of entities
and relations and logical constraints on them. In the bottom-up approach, the focus of creation is on the
content itself, i.e. the data. Potential entities and relations are first extracted, and the initial knowledge
graph schema or ontology is created based on the extracted data. Tamašauskaitė and Groth [10] in a
systematic review of 57 scientific papers on the process of creating knowledge graphs, find that 70% of
1https://www.wikidata.org/wiki/Wikidata:Statistics
the papers describe a bottom-up approach, an approach that corresponds to the current data-centric
trend that we follow in our research as well.</p>
      <p>So far, only encyclopedic, (i.e. cross-domain) knowledge graphs (e.g. Wikidata, DBpedia and YAGO)
have been mentioned, but there are also increasingly popular domain-specific knowledge graphs such as:
KnowLife [4], PaintKG [21] and CS-KG [5] in the fields of health, art and computer science respectively.
The creation of knowledge graphs for the selected domain encounters domain-specific challenges in
addition to the general problems of building knowledge graphs:
• Complexity of domain terminology - a specific domain usually has a specialized vocabulary
and technical terms that are not correctly represented in multi-domain (general) knowledge bases;
• The need for expert domain knowledge - for the evaluation and validation of knowledge
graphs, it is necessary to ensure a domain expert evaluation, and expertise is also required when
creating the schema/ontology of the knowledge graph itself;
• Limitations of existing models for information extraction - specific domains have their
specific entities and relations, which general models fail to extract (i.e. they have not learned the
domain-specific relations and entities and are not capable of distinguishing nuanced meanings of
domain phrases);
• Lack of domain ontology - usually, in a specific domain, there is no clearly defined ontology,
which makes it dificult to structure and organize knowledge graph schema. Without an
established domain ontology, it becomes dificult to define relations between entities, while ensuring
consistency and enabling coherent integration of new information.</p>
      <p>To overcome these challenges, automation of the domain knowledge graphs construction, in terms of
developing NLP (natural language processing) methods in information extraction, plays a central role.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Related Work</title>
      <p>The automation of knowledge graph construction is based on unsupervised and/or semi-supervised
information extraction procedures, reducing the need for time-consuming and expensive manual data
labeling. When building a domain knowledge graph, it is necessary to utilize existing (digital) resources
to automate the process and reduce the amount of manual labeling.</p>
      <p>Thus, Cai et al. [22] use an existing, more general (coarse), medical domain knowledge graph to
create a specific (fine) knowledge graph for the oncology domain. The authors address three types of
triples: overlapping triples, where both the coarse and fine domain KGs contain certain triples; triples
of new relations but overlapping entities, where the fine domain KG includes both entities but lacks
the relation between them; and triples of new entities, where at least one entity does not exist in the
coarse KG. To tackle coarse-to-fine KG domain adaptation, they propose an end-to-end KG domain
adaptation (KGDA) framework using distant supervision. This framework enables the construction
of a KG from fully unlabeled raw text data under the guidance of an existing KG. While this system
provides promising results, it relies on the assumption that both KGs have the same types of entities
and relations.</p>
      <p>Wang et al. [6] use a dictionary and classification of terminology from the geology and mineral
resources domain and create a directed graph based on the frequency of bigrams and the order of words
in the sentence.</p>
      <p>Yuan et al. [7] argue that most existing knowledge graph construction methods are based on large
knowledge graphs or existing extensive ontologies/taxonomies, and therefore use the available UMLS
thesaurus [23], based on which they recognize domain entities. High-frequency pairs of entities in
sentences become potential facts (i.e. triples: entity - relation - entity) for which latent groups (clusters)
of relation types are obtained using contextualized embeddings. The clusters of potential relation types
are then manually labeled. This significantly reduces annotation cost without loss of quality (instead of
labeling each instance of relations, the entire group or all instances of a type are labeled simultaneously).</p>
      <p>Frei and Kramer [24] integrate Wikipedia2 and Wikidata to systematically extract text data and
annotation information for Named Entity Recognition (NER). Their approach utilizes the graph relations
(properties) of Wikidata to derive NER types. In particular, they use properties such as P2176 (drug or
therapy used for treatment) to identify entities - e.g. diseases with known treatments - and assign them
the NER type TREATABLE_HEALTH_ISSUE. This method shows how structured knowledge graphs can
be efectively used to generate domain-specific NER categories and improve the annotation of entities
in specialized corpora.</p>
      <p>Lippolis et al. [25] introduce two approaches for entity alignment between ArtGraph and Wikidata.
The first method, Wikidata Entity Search (WES), uses simple SPARQL queries to establish entity
correspondences. The second approach, pArtLink, leverages the generative capabilities of large language
models in conjunction with established entity-linking techniques such as GENRE [26] and Wikimapper3
to increase alignment accuracy. ArtGraph, a domain-specific knowledge graph created from WikiArt
and DBpedia, encapsulates structured representations of concepts related to works of art.</p>
      <p>Nie et al. [13] present the Know-Adapter framework for few-shot NER. The authors emphasize the
benefits of incorporating explicit knowledge from external sources, such as knowledge graphs, while
addressing the heterogeneity between knowledge graph entity types and NER types. Specifically, for a
given mention in a sentence, they build a retriever to find its closest match in Wikidata. They then
construct a 3-hop subgraph around the matched entity by traversing Wikidata properties (relations).
This approach creates a structured mapping from multiple Wikidata entities that difer in specificity to a
single NER type and utilizes the Wikidata hierarchy to improve entity type classification. In contrast to
their approach, which expands entity types to improve the few-shot entity classification, our research
focuses on the compression and standardization of entity types. By refining a broad and diverse set
of entities into a finite set of well-defined NER types. Specifically, we aim to create a structured and
domain-relevant taxonomy of the climate change research that ensures consistency and usability in
automated knowledge graph construction.</p>
      <p>Inspired by these lines of research, we use existing resources such as dictionaries [6, 7], which
presumably contain domain entities of diferent granularity, and combine them with a more general
knowledge graph (Wikidata) [22, 25] to construct a hierarchy [13] to produce a final set of NER types
for the climate change research domain.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Existing Resources</title>
      <p>As discussed in Section 3, knowledge-intensive research benefits from available resources. In this sense,
this section looks at existing sources that have been used directly or as a reference point in this research,
especially existing domain dictionaries, terminologies and ontologies.</p>
      <p>Full Weather Glossary4 from National Oceanic and Atmospheric Administration (NOAA) - National
Weather Service (NWS) contains a total of 355 terms with definitions. There is also an extension of
this glossary with more than 2000 terms, phrases and abbreviations used by the NWS5. Glossary of
Meteorology6 from American Meteorology Society (AMS) is the authoritative source for definitions of
meteorological terms. From the AMS and NWS glossaries we have extracted a total of 9,511
climatechange related terms and corresponding definitions.</p>
      <p>Webersinke et al. [27] expand the vocabulary when pretraining their models, they add a list of 255
terms7 (tokens) with the highest frequency in their climate-change related pretraining corpus to the
original DistilRoBERTa [28] vocabulary. We add these 255 terms to our dictionary of climate-change
related terms.
2https://www.wikipedia.org/
3https://github.com/jcklie/wikimapper
4https://www.weather.gov/otx/Full_Weather_Glossary
5https://forecast.weather.gov/glossary.php?
6https://glossary.ametsoc.org/wiki/Welcome
7https://huggingface.co/climatebert/distilroberta-base-climate-f</p>
      <p>Reimerink et al. [8] construct a new multilingual terminological knowledge base (TKB) on the
environment science - EcoLexicon8. The construction of EcoLexicon began in 2003 with a core list of
794 environmental terms in Spanish and English. For each term, definitions were elaborated, reflecting
the level of generality or specificity of the concept as well as its relations with other concepts within the
same knowledge domain. The original list of terms was enriched by the addition of new terms as well
as by its transformation into a conceptual network. Currently, EcoLexicon contains 4,654 concepts of
environmental science and 24,968 terms in eight languages (English, Spanish, German, French, Dutch,
Modern Greek, Russian and Arabic) [29]. The EcoLexicon data includes concepts, terms, and semantic
relations organized within a frame-like structure called the Environmental Event.</p>
      <p>The Environment Ontology (ENVO)9 is a community-driven ontology that supports the representation
of environments beyond the biological and biomedical domains [30, 31]. ENVO consists of classes
(terms) that refer to the main types of environments and can facilitate the retrieval and integration of a
wide range of biological data. The authors follow the principles of the Open Biomedical and Biological
Ontologies (OBO) Foundry and align their ontology with the Basic Formal Ontology (BFO) [32]. ENVO
consists of 7,030 classes (terms), such as ENVO’s biome, environmental feature, and environmental
material hierarchies – the ontology’s most developed branches and of the greatest interest to annotators.
Recently, when adapting to BFO, some of the hierarchies were revised and made obsolete, such as
environmental features.</p>
      <p>Semantic Web for Earth and Environmental Terminology (SWEET)10 [33] is a highly modular ontology
suite with 10,23911 concepts (classes) in 200 separate ontologies covering Earth system science. SWEET
is a mid-level ontology and consists of nine top-level concepts that can be used as a foundation for
deriving domain-specific ontologies that start from extending these top-level SWEET components.</p>
      <p>In [16] we elaborate upon our climate research corpus, consisting of research papers from renowned
journals on climate change, that we use in this work. We showed an exploratory prestudy in which we
applied a readily available NER model and a POS tagger from flair 12 on a sample of 10,000 research
papers (∼ 5% of the corpus). With the insights gained from this preliminary experiment, we have
decided to experiment with LLM-assisted annotation; in particular, using Phi-3-mini-4k-instruct13
deployed locally for sentence-level triple extraction task.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Entity Discovery</title>
      <sec id="sec-6-1">
        <title>5.1. Core Entity Set</title>
        <p>Building upon authoritative sources, including the Full Weather Glossary, the Glossary of Meteorology,
Wikipedia glossaries and term expansions in ClimateBERT (dictionary), as well as our prior research [16],
which includes NER results (NER), exploratory LLM-based annotations (Phi3) and extracted keywords
(keywords), we systematically construct a core entity set for the climate change domain. This selection
process is based on a majority overlap criterion that requires an exact match of at least three out of
four sources. In the initial experiments, we include POS tagging results (POS), treating noun phrases
as candidate entity terms. However, this approach resulted in a noisy set of instances, which did not
contribute to the expansion of the core set, therefore POS-derived votes are excluded. In refinement
steps, we experimented with diferent overlap ratios and case sensitivity. Ultimately, with a majority
(three out of four) votes, we settled on a case-sensitive overlap strategy that balances corpus-driven
entity selection (NER, Phi3 and keywords) with the integration of terminologies from authoritative
sources (dictionary).</p>
        <p>This process results in a set of 818 core terms, which subsequently undergo cleaning and deduplication.
8https://ecolexicon.ugr.es/en/index.htm
9https://sites.google.com/site/environmentontology/
10https://github.com/ESIPFed/sweet
11https://bioportal.bioontology.org/ontologies/SWEET
12https://github.com/flairNLP/flair
13https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
After removing duplicates, 766 unique terms remain. These terms are then validated against entire
corpus [16] by computing the occurrence frequency. Terms that occur less than 10 times are excluded
from further analysis. This process corresponds to entity detection in phase one of building a knowledge
graph, corresponding to the discovery section proposed in [34].</p>
        <p>
          Next, inspired by [25], we perform an automatic alignment of the core terms with Wikidata using
three SPARQL queries: exact match, case-invariant match, and a substring-based (”contains”) query
(see Appendix A). This automated process yields preliminary results, which are then manually curated.
During curation, the results are categorized into four distinct groups: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Out of scope: 4 terms; (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
Requires disambiguation: 144 terms; (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Manually corrected (fixed item): 255 terms; (4) Good match: 363
terms. We successfully matched 47.39% of the terms with Wikidata using a simple automatic comparison.
The subsequent manual alignment corrects an additional 33.29%, bringing the total number of aligned
terms to 618 (80.68%). For the ambiguous group, we align relevant climate-change related terms from
Wikidata that are similar to the ambiguous entries and add 37 more terms to the set. As a result, we
obtained a final set of 655 core terms aligned with Wikidata items. An example of the alignment is in
Table 1, with some terms that have an inherent domain-specific contextualization. For instance, the
term Barber, which is conventionally associated with an occupational role, is instead categorized within
the meteorological domain as a specific type of wind.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Wikidata Subgraph</title>
        <p>Wikidata incorporates several hierarchical (vertical) relations, referred to as properties, such as instance
of (P31) and subclass of (P279). Using the core terms aligned with Wikidata items and these two
relations, we construct a neighbourhood graph. In this graph, for each core term, we identify (, −
)hop neighbours in each direction, where ,  ∈ ℕ , with  representing height and  representing depth.
Height refers to the number of hops in the abstraction direction (towards top), while depth refers to
the number of hops in the concretization direction (towards bottom). Specifically, for each core term,
we recursively search for items that are instances of or subclasses of the given term. Conversely, we
also search for items that the given term is an instance of or a subclass of, based on the P31 and P279
relations. This process enables us to capture the hierarchical structure and the relationships between</p>
        <p>Q107198
the terms within the graph.</p>
        <p>Figure 1 illustrates a neighbourhood graph for five terms - mistral, jet stream, sea breeze, westerlies
and katabatic wind - with height  = 2 and depth  = 1 . In this graph, the instance of (P31) relations
are represented by solid lines, while the subclass of (P279) relations are shown with dashed lines. In
this case, the concretization direction is not relevant, as the starting terms (i.e. at level 0) are already
suficiently specific. However, moving in the direction of abstraction (i.e. towards the top) reveals a
wealth of valuable instances. In particular, the level 2 instance wind serves as a direct abstraction for
two starting terms (sea breeze and westerlies), while indirectly encompassing the remaining three terms
(jet stream via thermal wind, katabatic wind via fall wind and air current, and mistral via katabatic wind).
The wind efectively encapsulates the meaning of all starting terms in this context, suggesting that it
could serve as a representative entity type. A further step in the abstraction can be a viable solution in
the form of meteorological phenomenon. In this way, we proceed to identify potential Named Entity
Recognition (NER) types for identified core entity set (i.e. 655 detected core terms) by utilising the
hierarchical structure of the Wikidata graph that guides the discovery of relevant entity categories.
Note that Figure 1 is a simplification of the original structure that would be created based on five terms
used, a full preview is in the Appendix E.</p>
        <p>The hierarchical structure of the Wikidata subgraph is rich and valuable. Still, it contains a large
number of nodes and edges, making it dificult to manually navigate and identify an optimal
representative node (i.e. a Wikidata item) for NER classification. To overcome this challenge, we utilized
Graphia14, an open-source visual analytics application designed to facilitate the interpretation of large
and complex datasets. By leveraging Graphia’s graph analysis and transformation capabilities, we refine
the subgraph to improve its interpretability. To achieve this, we apply the following preprocessing
steps:
• Removal of leaf nodes - not candidates for NER types;
• Filtering based on node height- removing all nodes with a height of  ≥ 4 - height value indicates
a term that is too abstract, e.g. metaclass (Q19478619);
• Removal of nodes with in-degree ≤ 1- terms do not contribute to the abstraction.</p>
        <p>The height of the node is determined depending on its position to the initial core term. Specifically,
for each term, we compute its outgoing  -hop neighbourhood using the instance of (P31) and the
subclass of (P279) relations, as well as its incoming  -hop neighbourhood. Each term that appears
in the neighbourhood is assigned a value based on the number of hops from the initial term. These
assigned values are then averaged across occurrences to obtain a measure of overall height, which
quantifies the level of abstraction of a given term (see Appendix C).</p>
        <p>After these preprocessing steps, we perform a weighted Louvain algorithm [ 35] with a granularity
parameter set to 1, using edge weights to reflect relation importance. We argue that the instance of
(P31) should be considered more significant than the subclass of (P279) relation and, therefore assign
it weights of 1.0 and 0.5, respectively. This weighting ensures that the communities formed by the
Louvain algorithm better reflect meaningful entity groupings for NER classification. In this way, we
obtain 59 components (i.e. communities) that are potential NER types for the climate change research
domain. After manual inspection of each community we identified a central node (i.e. the node that has
a high in-degree centrality), with many connected terms abstracting to it. We also favor nodes with a
lower height value whenever possible, as this provides an optimal balance between over-abstraction
and over-specificity. This ensures that the selected node serves as a well-generalized yet meaningful
representative term within its community, making it a suitable candidate for NER type determination.
Examples with the five highest in-degree values in three communities are in Table 2.</p>
        <p>After acquiring 59 community or cluster representatives, we conducted a manual inspection to refine
the selection. First, we merge similar classes, such as mathematical expression and mathematical concept.
Additionally, we eliminate community representatives that are either overly abstract or unrelated to the
ifeld, including metaclass, telecommunication network and second-order class (refer to central row Table
2). Finally, we review and remove the majority of communities containing only a single instance, as
they do not contribute to the overall classification structure. After this step, we retain 26 representative
terms as potential NER types (classes). In the results Section (6), we ensure alignment with existing
domain-specific classifications by manually comparing the extracted terms with established ontologies
and terminological knowledge bases, including EcoLexicon, ENVO, and SWEET (see Section 4). Further,
we compute the number of terms occurring in each domain-related KG, and we validate the NER types
by counting the number of instances under each category. Finally, we preview Louvain clustering
results with community statistics.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Results</title>
      <p>As mentioned in Section 5.1, we calculated the frequency of occurrence for 766 unique terms (including
the final 655 core entity terms) in the entire corpus. The top 10 most frequently occurring terms
are water, model, Time, temperature, analysis, precipitation, climate, low, soil and level. The bottom 10
are Advanced Weather Interactive Processing System, dry line, red beds, pseudoboehmite, Tramontana,
geomagnetism, North Greenland Ice Core Project, Advanced Baseline Imager, small hail and pressure jump.
The full list is reported in Table 5 (Appendix B).</p>
      <p>Further, we perform a case-insensitive match of identified 655 core terms to other ontologies. In
particular, we search for the core term in two available ontologies SWEET and ENVO, excluding
EcoLexicon as it is not accessible via the API and can not be used locally. For the SWEET ontology,
we find a match for 375 core terms (57.25 %), and for ENVO we find a match for 117 (17.86 %). Of
the 117 terms that match in ENVO, 105 (89.74 %) are in the SWEET ontology. This limited alignment
indicates that the SWEET ontology is a better candidate for future development, as in [22], where a
coarse domain knowledge graph (i.e. SWEET) could be used to construct a more specific fine domain
KG (i.e. KG for climate change research domain).</p>
      <p>As elaborated in Section 5.2, we apply the Louvain algorithm for community detection, yielding
a total of 59 communities. For each identified community, we designate a representative node as a
potential NER type. The community size distribution is as follows: four large communities contain more
than 20 nodes, 19 medium-sized communities have between 10 and 20 nodes, and 34 small communities
consist of fewer than 10 nodes. Notably, half of the smallest communities are singleton nodes, that
are omitted for further processing. Details are listed in Table 6 (Apendix D). Next, we compare the
selected 26 communities (i.e. their representative terms) with SWEET, ENVO and EcoLexicon. The
comparison results are shown together with the final selected class names (i.e. NER types) in Table 3.
This process was carried out by manual examination of two ontologies (SWEET and ENVO) as well
as a terminological knowledge base (EcoLexicon). SWEET and EcoLexicon have a better coverage
of 26 representative terms (17 out of 26). Based on the occurrence of representative terms in other
knowledge bases, we retain terms that occur at least once, with the exception of Natural Phenomena,
which we believe is important for the climate change domain. We also merge several similar classes; in
particular, geographic region, geographic location and geographic entity are merged into a single class
Location. In this way, we create a final set of 21 NER types with the following classes: Ecosystem,
Energy Source, Natural Disaster, Meteorological Phenomenon, Quantity, Astronomical Object, Body of
Water, Disease, Location, Measurement Unit, Physical Phenomenon, Chemical, Time Period, Organization,
Natural Phenomenon, Field of Study, Mathematical Expression, Measuring Device, Geographical Feature,
System and Satellite.</p>
      <p>For each NER type, we calculate the number of core entity terms that have a path in the Wikidata
subgraph (Section 5.2) to Wikidata items corresponding to that NER type. The results are presented in
Table 4. Note that we allow each term to have paths to multiple representative Wikidata items (NER
types). In this way, we also gain insight into possible redundant classes. The top five class pairs in
terms of overlap are: Geographical Feature - Location (77), Field of Study - Quantity (71), Meteorological
Phenomenon - Natural Phenomenon (65), Natural Phenomenon - Physical Phenomenon (45) and Field of
Study - Physical Phenomenon (37). On the other hand, we can also observe the terms with the largest
number of classes to which they belong. The top five are: typhoon and tropical cyclone with six and
upwelling, cyclone and polar vortex, all of which have five classes (types) to which they correspond.</p>
    </sec>
    <sec id="sec-8">
      <title>7. Conclusion</title>
      <p>This paper proposes a methodology for discovery of Named Entity Recognition (NER) types tailored to
the climate change domain with minimal supervision, leveraging a schema-based bottom-up approach to
knowledge graph construction. We use existing resources such as dictionaries [6, 7], which presumably
contain domain entities of diferent granularity, and combine them with a more general knowledge
graph (Wikidata) [22, 25] to construct a hierarchy [13] to produce a final set of NER types for the climate
change research domain. This process begins with the identification of 655 core climate-change related
terms, sourced from authoritative domain-specific resources. These terms are then semi-automatically
aligned with Wikidata to fertilize from its hierarchical structure. The weighted Louvain algorithm is
engaged for the community detection on a neighbourhood graph constructed from instance of (P31)
and subclass of (P279) Wikidata properties. The resulting 59 communities are manually analyzed to
derive a final set of 21 NER types in the climate change domain, including Ecosystem, Energy Source,
Natural Disaster, Meteorological Phenomenon, and Chemical.</p>
      <p>Validation against existing ontologies and terminological knowledge base (SWEET, ENVO, and
EcoLexicon) reveals that the SWEET ontology provides the highest coverage, containing 57.25% of
core terms. Similarly, SWEET also demonstrates strong alignment with the candidate NER types,
covering 17 out of 26 types (65.38%). The final set of 21 NER types for the climate change research
domain includes: Ecosystem, Energy Source, Natural Disaster, Meteorological Phenomenon, Quantity,
Astronomical Object, Body of Water, Disease, Location, Measurement Unit, Physical Phenomenon, Chemical,
Time Period, Organization, Natural Phenomenon, Field of Study, Mathematical Expression, Measuring
Device, Geographical Feature, System, and Satellite. Finally, we report the occurrence frequency of core
entities in the climate change research corpus. The cutof threshold of 10 is an indicator that corpus
will be well suited for downstream training of domain NER model. The findings demonstrate that
refining a broad and diverse set of entities into a finite set of well-defined NER types can contribute to
alignment with existing climate ontologies and subsequently to automated climate change knowledge
graph construction.</p>
    </sec>
    <sec id="sec-9">
      <title>8. Limitations and Future Work</title>
      <p>As described in Section 5.2, we construct a neighbourhood graph based on two Wikidata properties
- instance of (P31) and subclass of (P279). This construction is based on the assumption of Wikidata
completeness, i.e. if information on these two relations is not available in the Wikidata knowledge graph,
terms remain unused and thus potentially impact the overall quality of the results. Some exemplary
terms from our core entity set that have neither P31 nor P279 properties are absolute humidity, Action for
climate empowerment, Shortwave radiation and pressure jump. This problem can be tackled in two ways:
ifrstly, by manually adding the missing Wikidata hierarchical properties (relations), thereby contributing
to a valuable community-maintained resource, and secondly, by exploring other hierarchical relations
such as part of (P361), has part (Q65964571), facet of (P1269) and broader concept (P4900). Incorporating
these alternative properties could enhance the representation of hierarchical structures for a given
domain.</p>
      <p>Additionally, the results are potentially sensitive to parameter choices, such as the granularity
parameter (set to 1) and the weighting of the instance of (1.0) and subclass of (0.5) relations in the
weighted Louvain algorithm. Exploring alternative granularity values or diferent weighting schemes
may lead to diferent community detection results and consequently to diferent NER types. The
introduction of additional hierarchical relations further amplifies this sensitivity.</p>
      <p>Finally, for future work, we plan to integrate the GLiNER model [36] with our generated NER types.
This integration will facilitate the labeling of a larger corpus within the climate change research domain,
further refining entity classification and improving automated knowledge extraction.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>Croatian Science Foundation supports AP under the project DOK-2021-02.</p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used InstaText to improve grammar, check spelling
and reword. After using this tool, the authors have reviewed and edited the content as needed and take
full responsibility for the content of the publication.
open data, in: K. Aberer, K.-S. Choi, N. Noy, D. Allemang, K.-I. Lee, L. Nixon, J. Golbeck, P. Mika,
D. Maynard, R. Mizoguchi, G. Schreiber, P. Cudré-Mauroux (Eds.), The Semantic Web, Springer
Berlin Heidelberg, Berlin, Heidelberg, 2007, pp. 722–735.
[18] F. M. Suchanek, G. Kasneci, G. Weikum, Yago: A core of semantic knowledge, in: Proceedings of
the 16th International Conference on World Wide Web, WWW ’07, ACM, New York, NY, USA, 2007,
pp. 697–706. URL: http://doi.acm.org/10.1145/1242572.1242667. doi:10.1145/1242572.1242667.
[19] R. Navigli, S. P. Ponzetto, Babelnet: The automatic construction, evaluation and application of
a wide-coverage multilingual semantic network, Artif. Intell. 193 (2012) 217–250. URL: https:
//api.semanticscholar.org/CorpusID:6063065.
[20] Z. Zhao, S.-K. Han, I.-M. So, Architecture of knowledge graph construction techniques, 2018. URL:
https://api.semanticscholar.org/CorpusID:207900787.
[21] H. Wu, S. Y. Liu, W. Zheng, Y. Yang, H. Gao, Paintkg: the painting knowledge graph using
bilstm-crf, in: 2020 International Conference on Information Science and Education (ICISE-IE),
2020, pp. 412–417. doi:10.1109/ICISE51755.2020.00094.
[22] H. Cai, W. Liao, Z. Liu, X. Huang, Y. Zhang, S. Ding, S. Li, Q. Li, T. Liu, X. Li, Coarse-to-fine
knowledge graph domain adaptation based on distantly-supervised iterative training, 2023 IEEE
International Conference on Bioinformatics and Biomedicine (BIBM) (2022) 1294–1299. URL:
https://api.semanticscholar.org/CorpusID:253383970.
[23] A. Siu, Fast entity recognition in biomedical text, 2013. URL: https://api.semanticscholar.org/</p>
      <p>CorpusID:39345437.
[24] J. Frei, F. Kramer, Creating ontology-annotated corpora from Wikipedia for medical
namedentity recognition, in: D. Demner-Fushman, S. Ananiadou, M. Miwa, K. Roberts, J. Tsujii (Eds.),
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, Association for
Computational Linguistics, Bangkok, Thailand, 2024, pp. 570–579. URL: https://aclanthology.org/
2024.bionlp-1.47/. doi:10.18653/v1/2024.bionlp- 1.47.
[25] A. Lippolis, A. Klironomos, D. Milon-Flores, H. Zheng, A. Jouglar, E. Norouzi, A. Hogan, Enhancing
entity alignment between wikidata and artgraph using llms, in: Semantic Web and Ontology
Design for Cultural Heritage 2023, volume 3540 of CEUR Workshop Proceedings, CEUR-WS, 2023.
Publisher Copyright: © 2023 Copyright for this paper by its authors.; 2023 International Workshop
on Semantic Web and Ontology Design for Cultural Heritage, SWODCH 2023 ; Conference date:
07-11-2023.
[26] N. De Cao, G. Izacard, S. Riedel, F. Petroni, Autoregressive entity retrieval, in: 9th
International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021,
OpenReview.net, 2021. URL: https://openreview.net/forum?id=5k8F6UU39V.
[27] N. Webersinke, M. Kraus, J. A. Bingler, M. Leippold, Climatebert: A pretrained language model for
climate-related text, 2022. arXiv:2110.12010.
[28] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, ArXiv abs/1910.01108 (2019).
[29] P. Faber, P. León-Araúz, R. Resi, P. ten Hacken, From specialized knowledge frames to linguistically
based ontologies, Applied Ontology 19 (2024) 23–45. URL: https://doi.org/10.3233/AO-230033.
doi:10.3233/AO- 230033. arXiv:https://doi.org/10.3233/AO-230033.
[30] P. L. Buttigieg, N. Morrison, B. Smith, C. J. Mungall, S. E. Lewis, the ENVO Consortium, The
environment ontology: contextualising biological and biomedical entities, Journal of Biomedical Semantics
4 (2013) 43. URL: https://doi.org/10.1186/2041-1480-4-43. doi:10.1186/2041- 1480- 4- 43.
[31] P. L. Buttigieg, E. Pafilis, S. E. Lewis, M. P. Schildhauer, R. L. Walls, C. J. Mungall, The environment
ontology in 2016: bridging domains with increased scope, semantic density, and interoperation,
Journal of Biomedical Semantics 7 (2016) 57. URL: https://doi.org/10.1186/s13326-016-0097-6.
doi:10.1186/s13326- 016- 0097- 6.
[32] J. N. Otte, J. Beverley, A. Ruttenberg, Bfo: Basic formal ontology, Applied ontology 17 (2022)
17–43. doi:10.3233/ao- 220262.
[33] R. G. Raskin, M. J. Pan, Knowledge representation in the semantic web for earth and
environmental terminology (sweet), Computers &amp; Geosciences 31 (2005) 1119–1125. URL: https://
www.sciencedirect.com/science/article/pii/S0098300405001020. doi:https://doi.org/10.1016/
j.cageo.2004.12.004, application of XML in the Geosciences.
[34] G. Weikum, X. L. Dong, S. Razniewski, F. Suchanek, Machine knowledge: Creation and curation
of comprehensive knowledge bases, Found. Trends Databases 10 (2021) 108–490. URL: https:
//doi.org/10.1561/1900000064. doi:10.1561/1900000064.
[35] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large
networks, Journal of Statistical Mechanics: Theory and Experiment 2008 (2008) P10008. URL:
http://dx.doi.org/10.1088/1742-5468/2008/10/P10008. doi:10.1088/1742-5468/2008/10/p10008.
[36] U. Zaratiana, N. Tomeh, P. Holat, T. Charnois, GLiNER: Generalist model for named entity
recognition using bidirectional transformer, in: K. Duh, H. Gomez, S. Bethard (Eds.), Proceedings of the
2024 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies (Volume 1: Long Papers), Association for Computational
Linguistics, Mexico City, Mexico, 2024, pp. 5364–5376. URL: https://aclanthology.org/2024.naacl-long.300.
doi:10.18653/v1/2024.naacl-long.300.</p>
    </sec>
    <sec id="sec-12">
      <title>A. SPARQL queries</title>
      <p>Inspired by the Wikidata Entity Search (WES) approach from [25] we construct three Wikidata SPARQL
queries for automatic alignment of Wikidata items to our dictionary terms. For this task, we use the
library SPARQLWrapper15, which serves as a SPARQL endpoint interface to Python. Three queries
exact match, case-invariant match and substring-based (”contains”) match - are each listed below.</p>
      <p>Listing 1: Exact Match: Exact match SPARQL query used for automatic alignment.
SELECT ? i t e m ? i t e m L a b e l ? i t e m D e s c r i p t i o n ( GROUP_CONCAT ( DISTINCT
? i t e m T y p e ; s e p a r a t o r = ” , ␣ ” ) AS ? i t e m T y p e s ) ( GROUP_CONCAT (
DISTINCT ? i t e m S u b c l a s s ; s e p a r a t o r = ” , ␣ ” ) AS ? i t e m S u b c l a s s e s )
WHERE {</p>
      <p>SERVICE w i k i b a s e : mwapi {
bd : s e r v i c e P a r a m w i k i b a s e : e n d p o i n t ”www. w i k i d a t a . o r g ” ;
w i k i b a s e : a p i ” E n t i t y S e a r c h ” ;
mwapi : s e a r c h ” { i n p u t _ t e x t } ” ;
mwapi : l a n g u a g e ” en ” .</p>
      <p>? i t e m w i k i b a s e : a p i O u t p u t I t e m mwapi : i t e m .
Listing 2: Case-Invariant Match: Case-invariant match SPARQL query used for automatic alignment.
}
OPTIONAL { ? i t e m wdt : P31 ? itemType . ? itemType r d f s : l a b e l ?
i t e m T y p e L a b e l . FILTER ( lang ( ? i t e m T y p e L a b e l ) = ” en ” ) }
OPTIONAL { ? i t e m wdt : P279 ? i t e m S u b c l a s s . ? i t e m S u b c l a s s r d f s :
l a b e l ? i t e m S u b c l a s s L a b e l . FILTER ( lang ( ? i t e m S u b c l a s s L a b e l
) = ” en ” ) }
OPTIONAL { ? i t e m schema : d e s c r i p t i o n ? i t e m D e s c r i p t i o n . FILTER
( lang ( ? i t e m D e s c r i p t i o n ) = ” en ” ) }
OPTIONAL { ? i t e m r d f s : l a b e l ? i t e m L a b e l FILTER ( lang ( ?</p>
      <p>i t e m L a b e l ) = ” en ” ) }</p>
      <p>FILTER ( regex ( ? i t e m L a b e l , ” ^ { i n p u t _ t e x t } $ ” , ” i ” ) )
}
GROUP BY ? i t e m ? i t e m L a b e l ? i t e m D e s c r i p t i o n</p>
      <p>LIMIT 10
Listing 3: Substring-Based (”contains”) Match: Substring-based (”contains”) query match SPARQL
query used for automatic alignment.</p>
      <p>SELECT ? i t e m ? i t e m L a b e l ? i t e m D e s c r i p t i o n ( GROUP_CONCAT ( DISTINCT
? itemType ; s e p a r a t o r = ” , ␣ ” ) AS ? itemTypes ) ( GROUP_CONCAT (
DISTINCT ? i t e m S u b c l a s s ; s e p a r a t o r = ” , ␣ ” ) AS ? i t e m S u b c l a s s e s )
WHERE {</p>
      <p>SERVICE w i k i b a s e : mwapi {
bd : s e r v i c e P a r a m w i k i b a s e : e n d p o i n t ”www. w i k i d a t a . org ” ;
w i k i b a s e : a p i ” E n t i t y S e a r c h ” ;
mwapi : s e a r c h ” { i n p u t _ t e x t } ” ;
mwapi : l a n g u a g e ” en ” .</p>
      <p>? i t e m w i k i b a s e : a p i O u t p u t I t e m mwapi : i t e m .
}
OPTIONAL { ? i t e m wdt : P31 ? itemType . } # R e t r i e v e e n t i t y
t y p e ( i n s t a n c e o f )
OPTIONAL { ? i t e m wdt : P279 ? i t e m S u b c l a s s . } # R e t r i e v e</p>
      <p>s u b c l a s s o f
OPTIONAL { ? i t e m schema : d e s c r i p t i o n ? i t e m D e s c r i p t i o n . FILTER
( lang ( ? i t e m D e s c r i p t i o n ) = ” en ” ) } # R e t r i e v e
d e s c r i p t i o n
OPTIONAL { ? i t e m r d f s : l a b e l ? i t e m L a b e l FILTER ( lang ( ?</p>
      <p>i t e m L a b e l ) = ” en ” ) } # R e t r i e v e l a b e l s
FILTER ( CONTAINS ( LCASE ( ? i t e m L a b e l ) , LCASE ( ” { i n p u t _ t e x t } ” ) ) )
# E n s u r e t h e l a b e l c o n t a i n s t h e i n p u t term
}
GROUP BY ? i t e m ? i t e m L a b e l ? i t e m D e s c r i p t i o n</p>
      <p>LIMIT 10</p>
    </sec>
    <sec id="sec-13">
      <title>B. Core Entity Terms</title>
      <p>#
222,831
216,572
212,304
208,916
206,552
195,902
188,300
187,837
183,995
178,900
178,387
173,728
165,731
161,853
161,770
158,667
Q1312 155,658
Q1063 153,963
Q1298969 152,585
Q1997 148,561
Q66486514 147,941
Q7561 147,532
Q12453 140,638
Q11663 140,578
Q36933 136,774
Q14524818 136,475
Q175199 135,933
Q568 135,068
Q173582 131,432
Q2325497 130,248
Q29051774 130,137
Q129129 128,726
Q627 128,278
Q13471665 126,526
Q58734 122,509
Q116844065 121,632
Q21893647 120,674
Q42967 118,045
Q123304503 117,212
Q41719 115,864
Q7925 112,331
Q11402 112,030
Q180600 110,184
Q567555 109,919
Q871279 109,532
Q160329 109,012
Q159190 105,901
Q2633778 105,434
Q34027 105,131</p>
      <sec id="sec-13-1">
        <title>Term</title>
        <p>APS
Stratocumulus
sea breeze
precipitable water
MOC
Nevada
AMV
internet
accretion
p300
deuterium
brightness temperature
cloud amount
sublimation
LLJ
European Centre for
Medium-Range Weather
Forecasts
trade winds
North Atlantic Oscillation
Alkanes
PMC
tornado
storm surge
specific heat
plankton
planetary boundary layer
adenovirus
desertification
Kuroshio
CFC
Cretaceous
power spectrum
glia
desiccation
response time
GOME
carbon monoxide
Hadley Circulation
coalescence
treeline
Gulf Stream
monsoon climate
photochemistry
CGCM
nitric oxide
Newton
cyclogenesis
drainage area
SPCZ
AGL
#
5,950
5,873
5,765
5,742
5,718
5,690
5,624
5,560
5,395
5,367
5,284
5,139
5,121
5,091
5,069
5,037
Q160603 5,035
Q1137345 5,027
Q41581 5,018
Q7209090 5,008
Q8081 4,944</p>
        <p>Q121742 4,941
Q487756 4,935</p>
        <p>Q25367 4,897
Q1757268 4,886</p>
        <p>Q193447 4,769
Q183481 4,737</p>
        <p>Q53842 4,725
Q23748224 4,630
Q44626 4,629
Q1331626 4,611
Q177105 4,581
Q903071 4,551
Q578372 4,467
Q1425042 4,426
Q2025 4,317
Q338589 4,315
Q2071902 4,292
Q207762 4,233
Q130905 4,221
Q122933063 4,157
Q188651 4,152
Q650994 4,145
Q207843 4,127
Q12438 4,098
Q245472 3,963
Q166620 3,933
Q5977788 3,920
Q323170 3,904
El Niño
Divergence
front
vortex
Streamflow
climatology</p>
        <p>Q7939
Q85900110
Q189796</p>
        <p>Q732722
Q29425295</p>
        <p>Q52139
nitrogen oxides Q424418 3,101
critical point Q111059 3,084
cold pool Q104862831 3,066
Firn Q828861 3,054
Headwaters Q7376362 2,989
LIS Q128405384 2,913
nitrous oxide Q905750 2,871
avalanche Q7935 2,838
tsunami Q8070 2,836
swell Q185411 2,831
World Meteorological Orga- Q170424 2,827
nization
phase change Q185357 2,804
Berg Q8502 2,786
sprite Q904961 2,778
Pliocene Q76259 2,768
AOGCM Q650994 2,749
Pacific Decadal Oscillation Q2033747 2,729</p>
        <p>Q504027
Q55919789
Q185789
Q994122
Q1540250
Q83216
Q190132
Q1966999
Q3373825
Q33741
Q25261
Q25276
Q677070
Q37129
Q85248618
Q355730
Q79602
Q214700
Q11388
Q922395
dry season
27,064</p>
      </sec>
      <sec id="sec-13-2">
        <title>Term</title>
        <p>continental shelf
SPC
aegypti
ice shelf
Deconvolution
STP
Q46966
Q1183700
Q102145
#
2,708
2,655
2,645
2,619
2,595
2,589
SSI Q81382741 2,587
Arctic Oscillation Q674041 2,465
SEVIRI Q117778573 2,465
ocean acidification Q855711 2,455
iflopodia Q14859810 2,396
Jacobian Q506041 2,287
ONI Q117235275 2,264
Paris Agreement Q21707860 2,224
arid climate Q190946 2,185
GMS Q2246672 2,175
greenhouse efect Q41560 2,175
stratopause Q205397 2,147
TOGA Q3540622 2,134
hydrologic cycle Q81041 2,129
glomeruli Q909882 2,118
NLDN Q28458090 2,100
climate simulation Q117829810 2,090
global radiation Q1531731 2,090
zonal flow Q219838 2,087
photosynthetically active ra- Q900892 2,060
diation
tropical climate Q135712 2,028
inversion layer Q25615856 2,026
low-level jet Q11850562 2,008
synoptic scale Q1233837 1,977
thermohaline circulation Q463223 1,964
ODS Q16607840 1,947
QuikSCAT Q1734511 1,937
Meteosat Q1429889 1,925
Indian Ocean Dipole Q1574518 1,901
laminar flow Q189452 1,878
AABW Q3913650 1,815
continental climate Q185005 1,807
levoglucosan Q6535767 1,789
ozone hole Q183140 1,789
carbon tax Q288401 1,773
foehn Q12314 1,753
melting point Q15318 1,730
nitrogen dioxide Q207895 1,717
ceilometer Q1027486 1,659
convective available poten- Q1129355 1,591
tial energy
xenon Q1106 1,586
Q11998503
Q7150
Q1463606
Q4087
Q674041
Q2414143
Q12432978
Q3342203
Q786970
Q167336
Q215915
Q8024052
Q130768
Q62932</p>
        <p>Q1584373
Q133900</p>
        <p>Q209190
Q185113
Q37477
Q214881
Q42250
Q167980
Q642867</p>
        <p>Q2357982
Q38178665</p>
        <p>Q7561
Q1757268</p>
        <p>Q140588
Q165939</p>
        <p>Q2001116
Q4652675
Q179918
Q171293
Q62008854
Q67657988
#
26,844
26,827
26,773
26,492
26,132
25,778
25,384
24,941
24,806
24,776
24,750
24,281
24,038
23,632
23,347
23,267
23,079
22,305
21,961
21,840
21,576
21,408
21,373
21,099
20,878
20,620
20,588
20,533
20,465
sensible heat Q1480581
peat Q184624
CAT Q1101409
Landsat Q849791
influenza Q2840
GPS Q18822
entrainment Q15733549
turbidity Q898574
rainy season Q3117517
PAR Q900892
air mass Q216823
surge Q287381
thermocline Q849599
wet season Q3117517
subsidence Q2091656
hurricane Q34439356
soil temperature Q889769
carbon dioxide
dissolution
meteorology
GOES
ablation
AMO
VOC
specific
humidity
agarose
Isoprene
zebrafish
Holocene
radiosonde
anticyclone</p>
      </sec>
      <sec id="sec-13-3">
        <title>Sahel</title>
        <p>kinetic energy
MCS
frost
hydroxyl
water table
Cumulus
pandemic</p>
      </sec>
      <sec id="sec-13-4">
        <title>Radiance termination Hf visibility</title>
        <p>Q1997
Q3133701
Q25261
Q976688
Q322177
Q756835
Q910267
Q2253551
Q390697</p>
        <p>Q271943
Q169444</p>
        <p>Q25445
Q852817
Q177414
Q66065
Q46276
Q660968
Q4590598
Q104116
Q3342272
Q14189
Q12184
Q1411145
Q23582432
Q15115271
Q654068
#
17,852
17,815
17,445
17,300
17,278
17,214
17,019
16,872
16,787
16,778
16,681
16,675
16,651
16,640
16,550
16,499
16,487
16,480
16,426
16,303
16,188
16,031
15,972
15,801
15,773
15,693
15,396
15,010
15,000
14,764
14,745
14,724
14,589
14,479
14,406
14,254
14,093
14,089
13,943
13,843
13,821
13,809
Q7846140</p>
        <p>Q211433
Q1507383
Q182311
Q11120024
Q654
Q421968
Q1063457
Q38178435
Q1048298
Q898732
Q740686
Q846837
Q1468684
Q692163
Q47209908
Q5977788
900
898
882
871
863
853
850
848
845
836
828
818
796
776
766
752
739
Q5133390 737
Q175973 722
Q122574051 706
Q177414 685
Q420436 670
Q756307 670
Q2336098 659
Q80205 646</p>
      </sec>
      <sec id="sec-13-5">
        <title>California Current Q281655</title>
        <p>anvil cloud Q1358304
ensemble forecasting Q3433888
heat index Q2141844
Agulhas Current Q398548
Antarctic Circumpolar Cur- Q55828
rent
carbon capture and storage Q41491
North Atlantic Current Q211798
hypothermia Q1036696
supercooling Q213659
magnetosphere Q6915
North Atlantic Deep Water Q921070
Atlantic Niño Q4816419
coupled general circulation Q650994
model
speleothems
time-series analysis
planetary scale
Mistral
Q154507
Q11850042
Q124101881
Q193742
623
621
618
606
601
598
596
593
587
582
560
557
546
524</p>
        <p>Q16305538
Q14946524
Q1197111</p>
        <p>Q190052
carbon seques- Q15305550
tration
AGCM Q650994
ACE Q30717004
return period
SAR
Lf
insolation
tundra
cloudiness
adiabatic
radon
mantle
tilt
Skewness
CERES
gyre
CCS
NWP
half-life</p>
        <p>Q2627230
Q740686
Q17156810
Q216973
Q43262
Q830457
Q182453
Q1133
Q101949
Q179745
Q9051521
Q1102659
Q1250263
Q41491
Q837552
Q47270
12,152
11,921
11,745
11,704
11,642
11,108
11,003
10,894
10,639
10,591
10,432
10,417
10,313
10,252
10,221
10,196
10,041
9,972
9,943
9,937
9,856
9,263
9,252
9,179
9,156
9,127
8,881
409
395
394
366
360
348
345</p>
      </sec>
      <sec id="sec-13-6">
        <title>Term</title>
        <p>AATSR
mass balance model
downburst
frost heave
Northern Annular Mode
Maunder Minimum
katabatic wind
mesoscale convective system Q660968
Antarctic Oscillation Q3288815
sudden stratospheric warm- Q1583422
ing
bombykol Q425845
gamma radiation
olaparib
Q11523
Q7083106
global dimming Q211627
Advanced Microwave Sound- Q4686237
ing Unit
Nimbostratus
Oceanic Niño Index
cut-of low Q60967643 316
plate tectonics Q7950 302
ifbrillin-1 Q17927651 299
Global Ozone Monitoring Ex- Q1425042 296
periment
Upper Atmosphere Research Q534401 287
Satellite
Loop Current Q377116 275
National Lightning Detec- Q28458090 253
tion Network
CYGNSS Q5198802 250
Equatorial Undercurrent Q1190478 248
Tropical Rainfall Measure- Q2001116 240
ment Mission
mesocyclone Q2002856 227
dendroclimatology Q2294113 215
South Equatorial Current Q1072306 202
Benguela Current Q59676 200
ketoconazole Q407883 171
synoptic meteorology Q130221760 157
pollen analysis Q2737544 153
Jason-1 Q1970012 150
COP26 Q7888355 141
Universal Thermal Climate Q30347503 137
Index
glaciology Q52120 126
iridescence Q957208 123
turbidity current Q1756774 120</p>
      </sec>
      <sec id="sec-13-7">
        <title>Term biosphere Acetone Cal</title>
        <p>Aqua
black carbon
hydrological cy- Q81041
cle
mass spectrome- Q1327691
ter
hail Q37602
Terra Q584697
harmonics Q1148098
SOI Q1550887
jet stream</p>
        <p>Q202325
Q114591
Q453122
pseudoboehmite Q2115715
red beds Q2065586
dry line Q2742789
Advanced Weather Interac- Q4686330
tive Processing System
TOTAL: 36,516,003
85
67
63
49
12</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>C. Node Depth and Node Height</title>
      <p>Building upon the examples provided in this work, we consider five initial Wikidata terms: mistral,
katabatic wind, jet stream, sea breeze, and westerlies. We perform a recursive search with a maximum
height of  = 2 (two hops upward along instance of (P31) and subclass of (P279)) and a maximum depth
of  = 1 (one hop downward along these relations).</p>
      <p>
        For example, starting from jet stream, we identify air current as a one-hop neighbour. In turn, wind is
a one-hop neighbour of air current, reaching the two-hop limit. Conversely, in the opposite direction
(where jet stream is the object of P31 or P279 relations), we find jet streak as a direct neighbour. This
procedure is applied to all starting terms, producing the following exemplary results:
a) (-1) jet streak -&gt; (0) jet stream -&gt; (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) air current -&gt; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) wind
b) (-1) ______ -&gt; (0) mistral -&gt; (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) katabatic wind -&gt; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) fall wind
c) (-1) mistral -&gt; (0) katabatic wind -&gt; (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) fall wind -&gt; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) air current
d) (-1) Sundowner -&gt; (0) sea breeze -&gt; (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) wind -&gt; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) meteorological phenomena
e) (-1) Shrieking Sixties -&gt; (0) westerlies -&gt; (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) west wind -&gt; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) wind
      </p>
      <p>From this limited set of terms, we can compute each node’s overall height as the average of all depths
(or heights) at which it appears. For example, consider the node katabatic wind, which appears as a
starting term at height 0 (example a) and as a one-hop neighbour at height 1 (example b). Its overall
height is thus calculated as: 0+1 = 0.5.</p>
      <p>2</p>
    </sec>
    <sec id="sec-15">
      <title>D. Louvain Algorithm Results</title>
    </sec>
    <sec id="sec-16">
      <title>E. Neighbourhood Graph</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Abbass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Qasim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Murshed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mahmood</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Younis</surname>
          </string-name>
          ,
          <article-title>A review of the global climate change impacts, adaptation, and sustainable mitigation measures</article-title>
          ,
          <source>Environmental Science and Pollution Research</source>
          <volume>29</volume>
          (
          <year>2022</year>
          )
          <fpage>42539</fpage>
          -
          <lpage>42559</lpage>
          . URL: https://doi.org/10.1007/s11356-022-19718-6. doi:
          <volume>10</volume>
          .1007/s11356- 022- 19718- 6.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nawaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fahad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saud</surname>
          </string-name>
          , T. U. Rahman,
          <string-name>
            <given-names>M. N. R.</given-names>
            <surname>Khan</surname>
          </string-name>
          , T. Nawaz,
          <article-title>Securing a sustainable future: the climate change threat to agriculture, food security, and sustainable development goals</article-title>
          ,
          <source>Journal of Umm</source>
          Al-Qura University for Applied Sciences (
          <year>2024</year>
          ). URL: https: //doi.org/10.1007/s43994-024-00177-3. doi:
          <volume>10</volume>
          .1007/s43994- 024- 00177- 3.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K. L.</given-names>
            <surname>Ebi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Hondula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Errett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Spector</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Berry</surname>
          </string-name>
          ,
          <article-title>Extreme weather and climate change: Population health and health system implications</article-title>
          ,
          <source>Annual Review of Public Health</source>
          <volume>42</volume>
          (
          <year>2021</year>
          )
          <fpage>293</fpage>
          -
          <lpage>315</lpage>
          . doi:
          <volume>10</volume>
          .1146/ annurev- publhealth
          <string-name>
            <surname>-</surname>
          </string-name>
          012420- 105026, epub 2021 Jan 6.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>Albedo Q101038 surface tempera- Q56297886 ture scattering Q210028 Probe Q96093522 oscillation Q170475 p53 Q14818098 autumn Q1314 MJO Q1170041 Nitrate Q49916468 Stratosphere Q108376 NAO Q1137345 boundary layer Q752193 advection Q379788</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>