<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Jun</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Library for Text Mapping to the Sustainable Development Goals</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ioanna Mandilara</string-name>
          <email>ioannamand@netmode.ntua.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eleni Fotopoulou</string-name>
          <email>efotopoulou@netmode.ntua.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christina Maria Androna</string-name>
          <email>andronaxm@netmode.ntua.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasios Zafeiropoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papavassiliou</string-name>
          <email>papavass@mail.ntua.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Communication and Computer Systems, National Technical University of Athens</institution>
          ,
          <addr-line>Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Knowledge Graph, Text classification, Sustainable Development Goals</institution>
          ,
          <addr-line>Natural Language Processing</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <issue>2023</issue>
      <abstract>
        <p>Over the last few years, there has been a significant increase in the release of massive amounts of data related to the Sustainable Development Goals (SDGs). This has been driven by the recognition that data is critical for monitoring progress towards the SDGs, identifying areas that require attention, and informing policy decisions. Given that a significant percentage of the provided information is made available in documents, the availability of software libraries that can enable scientists to easily extract information regarding the importance given to the SDGs in the documents' text is considered crucial. Towards this direction, we have developed an open-source Python software library that provides the mapping between text and the SDGs, taking advantage of novel machine learning techniques. The provided software library is modular and permits the dynamic selection of trained machine learning models for analysis purposes. The outcomes from the usage of the software library are fed for data population of a knowledge graph that is targeted to the tracking of information around the SDGs. The overall approach is open and can be easily adopted by scientists and policy makers to support participatory modeling processes, as well as participatory decision making and action planning for the development of solutions for climate-resilient regions.</p>
      </abstract>
      <kwd-group>
        <kwd>Sustainable</kwd>
        <kwd>extraction</kwd>
        <kwd>co-located with Extended Semantic Web Conference (ESWC)</kwd>
        <kwd>Hersonissos</kwd>
        <kwd>Greece</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The Sustainable Development Goals (SDGs) are a set of 17 interconnected global goals that aim
to provide a universal framework for countries, organizations, and individuals to work together
towards achieving a sustainable future for all [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. They aim to address the most pressing global
challenges, including poverty, inequality, climate change, environmental degradation, and social
injustice.
nEvelop-O
(S. Papavassiliou)
      </p>
      <p>A lot of information exists around the SDGs, but much of it is scattered across multiple data
silos and made available in diferent formats, making it dificult to gain a complete picture
of progress. While quantitative data such as statistics and numerical indicators (e.g., in time
series databases) are essential for measuring progress, qualitative data contained within text
documents (e.g., international reports, policies, recommendations) is equally important for
providing context and understanding the factors that contribute to progress or hinder it. To
fully understand the progress being made towards the SDGs, it is therefore necessary to analyze
and classify text documents based on their relevance to the SDGs.</p>
      <p>To achieve so, novel machine learning techniques can be adopted such an Natural Language
Processing (NLP) techniques. NLP techniques can help to overcome some of the challenges
associated with manual analysis of text documents. For example, identifying the relevance
of a particular policy or report to the SDGs can be a time-consuming and resource-intensive
process, requiring a significant amount of manual efort. NLP techniques can automate this
process, making it faster and more accurate, while also reducing the potential for human bias.
By taking advantage of NLP techniques, it becomes possible to gain a more comprehensive and
nuanced understanding of the factors that contribute to progress towards the SDGs, enabling
policymakers and other stakeholders to make informed decisions and take targeted actions.</p>
      <p>
        In the work presented in this manuscript, we provide details for the development of
SDGDetector, an open-source software library that takes as input text and provides as output the
association of the text description with the various SDGs. SDGDetector is based on two NLP
techniques. It combines a traditional machine learning technique based on keywords detection
and a deep learning technique that uses a transformer-based model. The produced mappings
are fed as input for data population of the SustainGraph [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that is an open-source Knowledge
Graph (KG) that aims to track the progress achieved towards the targets defined for the various
SDGs. In this way, we consider the fusion of information produced by the text analysis process
with information that is made available in the SustainGraph. The latter includes time-series
data for the various SDG indicators in global, regional and local level; time-series data for social,
economic or environmental indicators; and data associated with the implementation of case
studies focusing on the development of climate-resilient regions across Europe [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Over the
information made available in the SustainGraph, analysis pipelines can be developed.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Text Analysis and Classification</title>
      <sec id="sec-2-1">
        <title>2.1. Policies Overview</title>
        <p>
          Overall, the policies around the SDGs are diverse and varied, reflecting the complex and
interconnected nature of the challenges they seek to address and the need for the provision
of automated text analysis tools to scientists and policy makers. The Paris Agreement is a
legally binding international treaty on climate change adopted in 2015 by 196 parties [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Its
goal is to limit global warming to well below 2 degrees Celsius above pre-industrial levels, with
an aspiration to limit it to 1.5 degrees Celsius. To achieve this, countries pledge to nationally
determine and communicate their own climate actions, known as Nationally Determined
Contributions (NDCs), and to regularly report on their progress [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Such progress is in multiple
cases associated with the SDGs, based on the 2030 Agenda for Sustainable Development that
sets out 17 goals, 169 targets, and 232 indicators, covering a range of economic, social, and
environmental issues [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. At the European Union (EU) level, the European Green Deal (EGD) is
a comprehensive plan to transform the region into a sustainable, carbon-neutral economy by
2050. The EGD identifies several priority areas, while multiple documents are produced per
year for specifying the action plan per priority area.
        </p>
        <p>
          From an economic development perspective, the Country Specific Recommendations (CSRs)
issued by the European Commission to individual EU member states aim to address a wide range
of policy areas, including climate change. The CSRs relevant to climate change typically focus
on increasing renewable energy sources, improving energy eficiency, promoting sustainable
transport, and reducing greenhouse gas emissions in various sectors. Furthermore, the EU
taxonomy is a classification system that defines environmentally sustainable economic activities
and sets out criteria for determining whether an economic activity contributes to environmental
objectives [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The EU taxonomy is closely linked to the EGD, as it aims to support the transition
to a sustainable, low-carbon economy by providing a common language for investors, companies,
and policymakers to identify and promote sustainable investments.
        </p>
        <p>
          In our work in this manuscript, we focus on analysis of documents coming from the EGD and
the CSRs with the usage of the SDGDetector. The produced mappings are introduced into the
SustainGraph [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], where further information is made available and can be jointly analyzed. Such
information regards the status of the SDG indicators in national and regional level, the mapping
of the NDCs with the SDGs, as well as the classification of activities of main stakeholders in the
case studies of the ARSINOE H2020 project [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] according to the EU taxonomy.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Overview of Natural Language Processing Tools focused to SDGs</title>
        <p>
          Various NLP mechanisms are made available for examining the association between text
documents and the SDGs. In [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], NLP methods are applied in combination with network analysis
techniques to measure overlaps in international policy discourse around the SDGs. The
produced results identify a strong discursive divide between environmental goals and all other
SDGs, as well as the appearance of unexpected interdependencies between SDGs in diferent
areas. In [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], a deep-learning natural language processing model in Japanese is applied based
on bidirectional encoder representations from transformers (BERT) to support the mapping of
text documents with the SDGs.
        </p>
        <p>
          A set of NLP tools have been also made available to map text to the SDGs, such as SDG-meter
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], text2sdg R package [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], OSDG-ai [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], LinkedSDG [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], SDG-tracker [13], SDGMapper [14]
and SDG Pathfinder [ 15]. SDG-meter is proposed as an open-source online tool able to indicate
the SDGs linked to an input text, taking advantage of a multi-label classification of texts using
BERT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and allowing users to compare the accuracy of diferent mapping algorithms. The
text2sdg R library [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] open-source package detects SDGs in text data using diferent existing or
custom-made query systems, while the OSDG-ai [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and LinkedSDG [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] tools take advantage
of ontologies and keywords matching techniques. SDG-tracker [13], SDGMapper [14] and SDG
Pathfinder [ 15] regard online platforms or tools managed by diferent organizations that ofer
SDG mapping services.
        </p>
        <p>Given the existence of such tools, the main motivation for the development of SDGDetector
stems for the need to provide an open-source software library that can be easily accessible
and extensible by software developers. SDGDetector also provides easy parameterization
capabilities and options for selection of the methods (keyword extraction and text classification)
to be applied for the text analysis processes. SDGDetector is developed in Python and can be
easily integratable in Python-based workflows that are used in data population processes of
knowledge graphs.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Approach for mapping text to SDGs</title>
      <sec id="sec-3-1">
        <title>3.1. Methodology</title>
        <p>We propose two diferent NLP techniques to interlink text with the SDGs. The main idea is to
combine a traditional machine learning technique, which uses keywords to find the relevance
of texts with the SDGs, and a deep learning technique, namely transfer learning, which is based
on a transformer-based model. In the first case, the linkage between the texts and the SDGs can
be made by computing the cosine similarity scores between the text’s keywords and the SDG’s
keywords. In the second case, the linkage can be expressed as the probability that the texts are
related to the SDGs by using a transformer-based classifier. In the upcoming subsections we
provide details for both techniques, while both of them are made openly available in a GitLab
repository [16].</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Text mapping to SDGs based on Keywords Extraction</title>
          <p>To support text mapping to the SDGs based on keywords extraction, we have adopted a taxonomy
of keywords per SDG, as it is provided by the Committee on the Environment, Climate Change,
and Sustainability of the University of Toronto [17] for the SDGs from 1 to 16, as well as the
taxonomy provided by the Monash University and the Sustainable Development Solutions
Network (SDSN) for Australia, New Zealand and Pacific area for the SDG 17 [ 18]. Based on
these taxonomies, we examine the matching among the extracted keywords from the applied
process with the already classified keywords per SDG. The overall process is depicted in Figure 1.</p>
          <p>The keywords extraction process consists of the following steps:
Data cleaning: The data quality of the text is improved by removing digits and special characters
from the given text.</p>
          <p>Candidate keywords extraction: The candidate n-gram words and/or phrases in the text
are extracted by using the bag-of-n-grams representation [19]. This representation maps a text
document as an unordered collection of its n-grams and is able to eliminate stop words and
tokenize plain texts.</p>
          <p>Embeddings: The candidate keywords/key phrases and the entire document are converted
into numerical data using embeddings. For this purpose, sentence transformers are used based
on the pre-trained all-mpnet-base-v2 sentence transformer model [20], which has shown high
performance for sentence embeddings and semantic search and it is recommended from the
python library sentence-transformers.</p>
          <p>Representative keywords identification : The most representative keywords/ key phrases
of the text are extracted based on a cosine similarity score. The Maximal Margin Relevance
(MMR) algorithm [21] is applied to generate keywords/key phrases based on cosine similarity.</p>
          <p>The MMR attempts to decrease the redundancy while increasing the diversity of outputs. The
keywords or key phrases that most closely match the document are first chosen. Then, we
iteratively choose new candidates that are both similar to the document and not identical to the
previously selected keywords/keywords.</p>
          <p>By having extracted the representative keywords from the text, the cosine similarity matrix
is computed between their embeddings and the embeddings of the classified keywords per SDG.
The cosine similarity matrix is an (n x m) matrix, where n is the number of the top exemplary
keywords and m is the number of the SDG’s keywords. The higher the value of the average
cosine similarity between two terms, the greater their relevance.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Multi-label Classification using pre-trained Transformer-based Models</title>
          <p>A deep learning technique has been developed to find the similarities between the texts and
the SDGs, considering the tackled problem as a multi-label classification problem. The overall
classification process is depicted in Figure 2.</p>
          <p>The training and validation datasets for the model are based on the OSDG Community
Dataset[22]. This dataset is made up of paragraph-length text samples obtained from publicly
available publications such as reports, policy documents, and publication abstracts. It consists of
37575 text excerpts and includes the related SDG, the number of volunteers who voted against
the proposed SDG label (labels_negative), the number of volunteers who voted in favor of the
proposed SDG label (labels_positive), and the agreement score based on the formula:
  =
|



− 
+ 


|
(1)</p>
          <p>The text classification process consists of the following steps:
Data preparation: Only data with an accepted SDG label (labels_positive &gt; labels_negative)
and a score of agreement greater than 0.55 were chosen for the current study. The decision
to only include samples with an agreement over 0.55 aims to ensure reliable and consistent
labeling of text data by minimizing subjective interpretation among volunteers. In this manner,
22523 out of 37575 samples were retained. Figure 3 presents the number of text excerpts used
for the SDGs from 1 to 16 upon the filtering process.</p>
          <p>Data preprocessing: The data preprocessing phase involves the splitting of the dataset into
training, testing, and validation sets and the tokenization of each set to ensure it meets the
expected format for the pre-trained models. We have first splitted the dataset into training and
testing sets, by using the 80% and 20% of the samples, respectively. Following, we have further
divided the training set into training and validation sets, with 80% and 20% of the training
samples, respectively. The validation set is reserved for hyperparameter optimization and
performance evaluation. To handle any dataset imbalance, we have applied a stratified split
that maintained the same class distribution percentages in each split.</p>
          <p>Moreover, we have used data augmentation techniques to generate synthetic training data for
classes containing fewer than 1000 texts. This process involved drawing a random sample from
each minority class and sequentially introducing it into four augmenters in a data augmentation
pipeline. The pipeline starts by injecting new words into a random position(insertion) based
on the BERT model’s contextual word embeddings calculation and then replacing diferent
words by their contextual embeddings (substitution). Next, we replaced the text’s words using
WordNet synonyms (synonym replacement) and added additional text (sentence augmentation)
based on the XLNet model’s contextual word embeddings.</p>
          <p>To prepare the textual data for the deep learning model, we used appropriate tokenization
based on the model being trained, which converted the unstructured text strings into a numerical
data structure.</p>
          <p>Model training: At this phase, two diferent fine-tuning techniques are used:
• Train the entire architecture: In this case, we train the model by unfreezing all the
layers of its architecture, i.e. the pre-trained weights of the model are updated based on
the new dataset, namely the OSDG Community Dataset.
• Train some layers while freezing others: In this case, we train the model partially by
freezing its initial layers and training only the last ones with the new dataset.</p>
          <p>We experiment with five diferent transformer-based language models by using one or both
of the fine-tuning techniques. Namely, we consider the base version of BERT (Bidirectional
Encoder Representations from Transformers) [23] that is trained on large amounts of text
data and can be fine-tuned for a variety of NLP tasks; the base version of RoBERTa (Robustly
Optimized BERT Pretraining Approach) model that is a modified version of BERT [ 24], uses a
larger amount of training data and a longer pre-training process to achieve better performance;
the XLNet (eXtreme MultiLingual pretraining for language understanding) model [25] that
uses an autoregressive pre-training method and is designed to improve on the limitations of
BERT and other transformer-based models; the GPT-2 (Generative Pre-trained Transformer
2) model that uses the transformer architecture like BERT, but the decoder part instead of
the encoder part; and the GPT-Neo (Generative Pre-trained Transformer-Neo) model that is a
community-driven efort to replicate the success of GPT models with a focus on open-source
and democratizing access to large-scale NLP models.</p>
          <p>Diferent NLP tasks can be accomplished using the aforementioned models. For the
classification task, we add an extra layer of untrained linear neurons on top of these models. During
training, these neurons are updated to map the samples to one of the 16 classes. Afterwards,
optimizing of the hyper-parameters is crucial for both fine-tuning techniques. Optimal values
are selected for the batch size (number of training samples), the max sequence length (maximum
length of the sequence in tokens), the optimizer learning rate, and the number of epochs.
Model evaluation: The final stage of the classification process regards the evaluation part that
is based on diferent metrics to understand the fine-tuned model’s performance, as well as its
strengths and weaknesses. The classification report, evaluation plots, and confusion matrix of
the testing set are commonly used for this purpose.</p>
          <p>The classification report is used to measure the quality of predictions. The report shows the
main classification metrics precision, recall, and f1-score on a per-class basis and in total. For
our classification problem, the metric f1-score is considered more suitable than the accuracy,
since the dataset is imbalanced. The confusion matrix is a 16 x 16 matrix that compares the
actual and predicted values, providing a tabular way of visualizing the model performance. The
training.
formula:</p>
          <p>The  
evaluation plots show the loss/ f1-score of the training and the validation samples during the</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Combinatory evaluation index</title>
          <p>Based on the outcomes provided by the aforementioned techniques, we consider the evaluation
of an index that combines both of them. The index is called as  
index and is based on the


= {
0.7 ∗   + 0.3 ∗  
0.5 ∗  
_
_ 
_
_ 
, for SDGs 1-16
, for SDG 17
(2)
index has two branches since there is no available text classification dataset for
the SDG 17. Thus, for this case, the evaluation is based exclusively on the applied keywords
matching technique. The probability coeficient exhibits a higher value than the corresponding
coeficient of the average cosine similarity, as the models used in the text classification technique
are able to better capture the complex relationships between words in a sentence.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. SDGDetector Software Library</title>
        <p>In this section, we present SDGDetector, an open-source Python library that we have developed
that streamlines the process of mapping textual data to the Sustainable Development Goals
(SDGs). SDGDetector is made openly available in a GitLab repository [26]. The library consists
of three primary classes that provide a powerful set of tools for automated SDG classification,
as detailed in Table 1.
each text belonging to the first 16 SDGs. This class is flexible enough to work with a user’s
ifne-tuned model or our pre-trained XLNet or RoBERTa model with a high f1-score of 0.90.</p>
        <p>The class SDG_classifier_using_keywords_extraction incorporates the method delineated in
Section 3.1.1. This class ofers two methods, the find_top_keywords and the predict methods.
The find_top_keywords method extracts the top n keywords from the input texts using the
embeddings generated by the sentence transformer model. This library features the Mpnet-base,
MiniLM: 6 Layer Version, and the DistilBert-base sentence transformer models. The first two
models yield the highest quality of embeddings, where the MiniLM: 6 Layer Version being 5
times faster than the Mpnet-base, while the latter ofers the best quality of embeddings. The
user can select the model to be used, as well as the diversity and range of keywords. The predict
method returns the average cosine similarity with the SDGs.</p>
        <p>The class SDG_classifier combines the previous methods and returns the   index for the
given texts.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Results</title>
      <p>In this section, we provide a set of results based on the application of the developed approach and
software library for the classification of texts coming from the Country Specific Recommendations
and European Green Deal Policies, and the data population of SustainGraph based on the provided
mappings.</p>
      <sec id="sec-4-1">
        <title>4.1. Keywords Extraction and Similarity Score</title>
        <p>Regarding the traditional machine learning method, the top 10 and 5 representative keywords for
each section of the European Green Deal strategies and the Country specific Recommendations
were produced respectively. For the sentence embeddings, the model All-Mpnet-Base-v2 was
used, and the divergence of the algorithm MMR was set to 0.3.</p>
        <p>The cosine similarity matrix was computed between the top 10 keywords of each section of
the European Green Deal (EGD) strategies and the SDG’s keywords. Similarity scores of less
than 30% were not taken into consideration, similarity scores in the range of 30% to 50% were
considered as medium, while similarity scores of more than 50% were considered as high. By
taking the percentage of high and medium similarity scores for each EGD strategy and SDG
respectively, we were able to get an overview of the association between them, as depicted
in Figure 4. Depending on the EGD strategy, the most relevant SDG is identified with high
score (e.g., for the Solar Energy EGD strategy, the SDG #7 (Clean and afordable energy) is
dominant). The smaller association values (yellow bars) are shown for the SDGs #1 (No poverty),
#10 (Reduce Inequalities), and #16 (Promote just, peaceful, and inclusive societies).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Transformer-Based model Fine-Tuning and Inference</title>
        <p>Following, we conducted several experiments between the transformer-based models by using
the OSDG community dataset, as mentioned in Section 3.1.2. The experiments were carried
out using the Kaggle notebook’s GPU hardware accelerator platform. Additionally, we used
the Huggin Face’s Transformers library as the source for all the transformer-based models,
implemented in PyTorch.</p>
        <p>Throughout our experiments, we used the Adam optimizer with weight decay fix [ 27] and
a linear scheduler for 10% of the total steps. For each model’s training, we selected the best
ifne-tuning learning rates (among 5e-5, 3e-5, and 2e-5), as suggested in [ 23]. In the case of the
Bert, RoBERTa, and XLNet models, the batch size and maximum length were set to 32 and 512,
respectively. These models were trained for 3, 4, and 5 epochs. Since GPT models have higher
memory requirements than the other models, we experimented with various combinations of
maximum length and batch size to prevent memory issues.</p>
        <p>As for the loss function, we optimized the Bert and RoBERTa models using the Cross-Entropy
loss function, while the XLNet model was optimized with the BCE-with-logits loss function.
In addition, we investigated the second fine-tuning approach for Bert, RoBERTa, and XLNet
models, which involves the partial training of the model as discussed in Section 3.1.2, alongside
the first fine-tuning technique, i.e. the training of the complete model, for the GPT models. In
more detail, these 3 models consist of 12 layers with an added single linear layer on top, acting
as the classifier. In our experiments, the pre-trained model parameters in the 1st to 11th layer
were frozen, while the last 12th and the classifier layer were set as trainable.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Model training and evaluation</title>
          <p>The Bert, RoBERTa, and XLNet models showed their best performance after 3 fine-tuning
epochs and learning 5e-5, as demonstrated in Figure 5, while lower performance was noticed
in case of the GPT2 and GPTNeo models. The three models (Bert, RoBERTa, XLNet) achieved
similar f1-scores of approximately 0.90. The Bert model yielded the highest f1-score of 0.91,
but the RoBERTa model had a slightly lower f1-score of 0.90, with less divergence between
the evaluation and training curves. The XLNet model also achieved an f1-score of 0.90 with a
smaller divergence between the training and evaluation curves than the Bert model. Therefore,
we concluded that the RoBERTa model outperforms the Bert model and achieves comparable
outcomes to the XLNet model. In 6, the confusion matrices are presented that illustrate the
robust performance of the models, where all models yielded an f1-score of 0.90 over the testing
set.</p>
          <p>Furthermore, the data augmentation technique, described in Section 3.1.2, was used for the
Bert, RoBERTa, and XLNet models, but all the models achieved training/validation/testing
f1-scores of around 0.90, similar to the results obtained without it. Thus, we concluded that
these models could handle imbalanced datasets.</p>
          <p>!
!
"
!
"
!
"
!
!
!
!</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Model inference</title>
          <p>Following the training of the models, we fed the recommendations of each country and the
sections of each European Green Deal Strategy into our fine-tuned XLNet model to create
predictions for the first 16 SDGs. The sigmoid function is applied to each raw output value. We
considered probabilities greater than or equal to 10%, as probabilities less than that indicate
shaky connections between the texts and the SDGs. Figure 7 displays the average probability
per Goal for each EGD Strategy, with the majority of the strategies being linked to Goals
#7 (Afordable and Clean Energy) and #13 (Climate Action). Notably, Goals #1 (No poverty),
#10 (Reduce Inequalities) and #16 (Promote just, peaceful, and inclusive societies) are not
represented in the EGD Strategies. As mentioned previously, the same SDGs are associated
with only medium similarity scores with the EGD Strategies. These two approaches arrive at
comparable conclusions, strengthening one another. Combining these two techniques may
result in a more robust measure of association.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4.3. Evaluation based on the</title>
      <p>index
Following, we compared the efectiveness of the two NLP approaches for a text classification
problem related to the SDGs. According to the SDGDetector, we may combine these two
methodologies to discover the relationship between EGD strategies and CSRs with the SDGs.
The method predict of the class SDG_classifier returns the   index, as explained in Section
3.1.3. The parameters used in the predict method are shown in Table 2, where the number of
keywords extracted was chosen based on the text’s length.</p>
      <p>The plots presented in Figure 8 illustrate the   values for each EGD Strategy with respect
to the SDGs. The results show that the Goal #7 (Afordable and Clean Energy) has the strongest
connection with the Renovation Wave Strategy, Hydrogen Strategy, Solar Energy Strategy,
Ofshore Renewable Energy Strategy, Solar Energy Strategy, Methane Strategy, and Energy
System Integration Strategy. In contrast, the Goal #13 (Climate Action) is mostly related to the</p>
      <p>Climate Adaptation Strategy and Sustainable Finance Strategy. The Forest and Biodiversity
Strategy is highly associated with Goal #15 (Life on Land).</p>
      <p>To validate the outcomes provided by the SDGDetector with alternative tools that exist, a
comparison has been made with the outcomes provided by the SDGMapper [14] tool, that is
developed by the European Commission under the KnowSDGs web platform. The SDGMapper
tool expresses the linkage between a document and an SDG as the ratio of keywords in each goal
to the total number of keywords detected. By uploading the EGD Strategies into this tool we
are able to compare our findings with those from a tool provided by the European Commission.</p>
    </sec>
    <sec id="sec-6">
      <title>5. SustainGraph Enrichment</title>
      <p>
        SustainGraph is a Knowledge Graph (KG) that tracks information related to the progress towards
the achievement of targets defined in the SDGs at national and regional levels, as well as further
social, economic and environmental indicators that may be proven useful to inter-disciplinary
scientists in their modeling and analysis processes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The objective is to serve as an open and
comprehensive knowledge source for information related to the SDGs, utilizing graph databases
and NLP techniques for data analysis and population. It is developed as a labeled property
graph based on the Neo4j technology and is made available in a GitLab repository [28].
      </p>
      <p>The overall structure of SustainGraph is depicted in Figure 10. At the left part of the figure,
various policies frameworks are listed, where information coming from policies documents,
strategies and directives is introduced. The SDGDetector library is used for the development of
data population pipelines for automating the population of the SustainGraph with data coming
from the strategies defined in the EGD and the CSRs per country. Such information can be
combined with data coming from the tracking of the SDG targets and indicators, data coming
from third-party sources, as well as data coming from the implementation of cases studies. A
closer view on the structure of the conceptualization of the SustainGraph regarding the EGD
and CSR entities is made available in Figure 11. For both EGD and CSR documents, we keep
track of the year that they are issued and their association with the SDGs, while for the CSRs
we add information regarding the country (geoarea) that they are issued for.</p>
      <p>The association is based on the weight and the description properties. The weight regards the
exact association value, as provided by the   index of SDGDetector, while the description is
produced based on a rule-based labeling approach. Relationships with weights less than 10%
suggest poor linkages between policies and the SDGs and are classified as very low, whereas
those with values in the range [10, 30) indicate a more powerful but insuficient linkage and are
classified as low. Additionally, the relationships with weight in the range [30, 60) are classified
as medium and the relationships with weights in the range [60, 100) are classified as high.</p>
      <p>Based on the data population of the SustainGraph, indicative visualisations are produced.
The pie chart in Figure 12 illustrates the association of the CSRs with the SDGs in the years 2011
and 2022, that correspond to the first year where the CSRs were issued and the year where their
latest version was made available. The chart shows the percentage of CSRs that are associated
with high and medium weights in the relationship ”ASSOCIATED_WITH” with the SDGs, per
Goal. It can be claimed that Europe’s focus in 2011 was primarily on the areas of economic
growth and educational quality (Goals 8 and 4), while in 2022, the majority of recommendations
are targeted at the areas of clean energy and climate action (Goals 7 and 13).</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions and Future Work</title>
      <p>Upon highlighting the need for the provision of open-source tools to analyze the relevance
of documents with the SDGs, SDGDetector is detailed as an open-source software library in
Python that supports such an analysis. SDGDetector is modular, easily adoptable and extensible
by software developers, while it permits the selection of various models for doing the analysis. A
composite index is produced for the analysis results, combining input coming from a traditional
machine learning technique based on keywords matching and a deep learning technique that
takes advantage of transformer-based models. The produced outcome is used for data population
of an open-source Knowledge Graph for tracking the progress towards the achievement of
the SDGs. A set of evaluation results are made available based on analysis of documents
coming from the CSRs and strategies from the EGD, showcasing the suitability of the proposed
approach for the identification of the association between the text and the SDGs. Based on
the presented work, future research and development areas are identified. These include the
support of participatory analysis processes based on the heterogeneous data made available in
the SustainGraph, and the performance evaluation with models like GPT, given that access to a
bigger infrastructure with GPU (graphics processing unit) support can be provided.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This research work has received funding from the European Union’s Horizon 2020 research and
innovation programme under grant agreement No 101037424.
organization system for the united nations sustainable development goals, in: R. Verborgh,
K. Hose, H. Paulheim, P.-A. Champin, M. Maleshkova, O. Corcho, P. Ristoski, M. Alam
(Eds.), The Semantic Web, Springer International Publishing, Cham, 2021, pp. 548–564.
[13] Ritchie, Roser, Mispy, Ortiz-Ospina., Measuring progress towards the Sustainable
Development Goals. (2023). Available at https://sdg-tracker.org/.
[14] European Commission, SDG Mapper. (2023). Available at https://knowsdgs.jrc.ec.europa.</p>
      <p>eu/sdgmapper.
[15] Organization for Economic Cooperation and Development, SDG Pathfinder. (2023).
Available at https://sdg-pathfinder.org/.
[16] I. Mandilara, E. Fotopoulou, C.M Androna, A. Zafeiropoulos, SustainNLP Gitlab Repository.</p>
      <p>(2023). Available at https://gitlab.com/netmode/sdg-text2kg.
[17] Committee on the Environment, Climate Change, and Sustainability, University of Toronto.,
Sustainable Development Goals (SDGs) Keywords. (2023). Available at https://sustainability.
utoronto.ca/inventories/sustainable-development-goals-sdgs-keywords/.
[18] SDSN Australia, New Zealand and Pacific area., Sustainable Development Goals (SDGs)
Keywords. (2023). Available at https://ap-unsdsn.org/regional-initiatives/universities-sdgs/.
[19] K. Juluru, H.-H. Shih, K. Murthy, P. Elnajjar, Bag-of-words technique in natural language
processing: A primer for radiologists, RadioGraphics 41 (2021) 210025. doi:10.1148/rg.
2021210025.
[20] K. Song, X. Tan, T. Qin, J. Lu, T. Liu, Mpnet: Masked and permuted pre-training for
language understanding, CoRR abs/2004.09297 (2020). URL: https://arxiv.org/abs/2004.
09297. arXiv:2004.09297.
[21] K. Bennani-Smires, C. Musat, A. Hossmann, M. Baeriswyl, M. Jaggi, Simple unsupervised
keyphrase extraction using sentence embeddings, in: Proceedings of the 22nd Conference
on Computational Natural Language Learning, Association for Computational Linguistics,
Brussels, Belgium, 2018, pp. 221–229. URL: https://aclanthology.org/K18-1022. doi:10.
18653/v1/K18- 1022.
[22] OSDG, U. I. S. A. Lab, PPMI, Osdg community dataset (osdg-cd), 2022. URL: https://doi.</p>
      <p>org/10.5281/zenodo.7136826. doi:10.5281/zenodo.7136826.
[23] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, ArXiv abs/1810.04805 (2019).
[24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V.
Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
(2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[25] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized
autoregressive pretraining for language understanding, CoRR abs/1906.08237 (2019). URL:
http://arxiv.org/abs/1906.08237. arXiv:1906.08237.
[26] I. Mandilara, E. Fotopoulou, C.M Androna, A. Zafeiropoulos, SDGDetector Gitlab
Repository. (2023). Available at https://gitlab.com/netmode/sdg-detector.
[27] I. Loshchilov, F. Hutter, Fixing weight decay regularization in adam, CoRR abs/1711.05101
(2017). URL: http://arxiv.org/abs/1711.05101. arXiv:1711.05101.
[28] I. Mandilara, E. Fotopoulou, C.M Androna, A. Zafeiropoulos, SustainGraph Gitlab
Repository. (2023). Available at https://gitlab.com/netmode/sustaingraph.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kjaerulf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Donnelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Muggah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Realini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kieselbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Snyder</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Waller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moloney-Kitts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gilligan</surname>
          </string-name>
          ,
          <article-title>Transforming our world: Implementing the 2030 agenda through sustainable development goal indicators</article-title>
          ,
          <source>Journal of public health policy 37</source>
          (
          <year>2016</year>
          )
          <fpage>13</fpage>
          -
          <lpage>31</lpage>
          . doi:
          <volume>10</volume>
          .1057/ s41271-016-0002-7.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fotopoulou</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mandilara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zafeiropoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Laspidou</surname>
          </string-name>
          , G. Adamos,
          <string-name>
            <given-names>P.</given-names>
            <surname>Koundouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papavassiliou</surname>
          </string-name>
          ,
          <article-title>Sustaingraph: A knowledge graph for tracking the progress and the interlinking among the sustainable development goals' targets</article-title>
          ,
          <source>Frontiers in Environmental Science</source>
          <volume>10</volume>
          (
          <year>2022</year>
          ). URL: https://www.frontiersin.org/articles/10.3389/fenvs.
          <year>2022</year>
          .
          <volume>1003599</volume>
          . doi:
          <volume>10</volume>
          .3389/fenvs.
          <year>2022</year>
          .
          <volume>1003599</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] ARSINOE project, ARSINOE H2020 project: Climate Resilient Regions Through Systemic Solutions and Innovations</article-title>
          . (
          <year>2023</year>
          ). Available at https://arsinoe-project.
          <source>eu/.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Horowitz</surname>
          </string-name>
          , Paris agreement,
          <source>International Legal Materials</source>
          <volume>55</volume>
          (
          <year>2016</year>
          )
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          . doi:
          <volume>10</volume>
          . 1017/S0020782900004253.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>United</given-names>
            <surname>Nations</surname>
          </string-name>
          , Climate Action, United Nations,
          <article-title>All About the NDCs</article-title>
          . (
          <year>2023</year>
          ). Available at https://www.un.org/en/climatechange/all-about-ndcs.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dusík</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bond</surname>
          </string-name>
          ,
          <article-title>Environmental assessments and sustainable finance frameworks: will the eu taxonomy change the mindset over the contribution of eia to sustainable development?</article-title>
          ,
          <source>Impact Assessment and Project Appraisal</source>
          <volume>40</volume>
          (
          <year>2022</year>
          )
          <fpage>90</fpage>
          -
          <lpage>98</lpage>
          . URL: https://doi.org/10.1080/14615517.
          <year>2022</year>
          .
          <volume>2027609</volume>
          . doi:
          <volume>10</volume>
          .1080/14615517.
          <year>2022</year>
          .
          <volume>2027609</volume>
          . arXiv:https://doi.org/10.1080/14615517.
          <year>2022</year>
          .
          <volume>2027609</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vacca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mantegazza</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Capua</surname>
          </string-name>
          ,
          <article-title>Natural language processing and network analysis provide novel insights on policy and scientific discourse around Sustainable Development Goals</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <article-title>22427</article-title>
          . URL: https://www.nature.com/ articles/s41598-021
          <article-title>-01801-6</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41598-021-01801-6, number: 1 Publisher: Nature Publishing Group.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Matsui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Suzuki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kitai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Haga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Masuhara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kawakubo</surname>
          </string-name>
          ,
          <article-title>A natural language processing model for supporting sustainable development goals: translating semantics, visualizing nexus, and connecting stakeholders</article-title>
          ,
          <source>Sustainability Science</source>
          <volume>17</volume>
          (
          <year>2022</year>
          )
          <fpage>969</fpage>
          -
          <lpage>985</lpage>
          . URL: https://doi.org/10.1007/s11625-022-01093-3. doi:
          <volume>10</volume>
          .1007/ s11625-022-01093-3.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Guisiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chiky</surname>
          </string-name>
          , J. de Mello,
          <article-title>SDG-Meter : a deep learning based tool for automatic text classification of the Sustainable Development Goals</article-title>
          ,
          <source>in: ACIIDS :14th Asian Conference on Intelligent Information and Database Systems</source>
          , Ho Chi Minh, Vietnam,
          <year>2022</year>
          . URL: https://hal.science/hal-03738404.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Meier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mata</surname>
          </string-name>
          , D. U. Wulf, text2sdg:
          <article-title>An r package to monitor sustainable development goals from text</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2110.05856. doi:
          <volume>10</volume>
          .48550/ARXIV.2110. 05856.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pukelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bautista-Puig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Statulevičiūtė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stančiauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dikmener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Akylbekova</surname>
          </string-name>
          ,
          <article-title>Osdg 2.0: a multilingual tool for classifying text data by un sustainable development goals (sdgs</article-title>
          ),
          <year>2022</year>
          . URL: https://arxiv.org/abs/2211.11252. doi:
          <volume>10</volume>
          .48550/ARXIV.2211.11252.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Morales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Klarman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Helton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lovell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Haczek</surname>
          </string-name>
          , A knowledge
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>