<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. A. Mousavian Anaraki);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Automatic GRI-SDG Annotation and LLM-Based Filtering for Sustainability Reports</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Seyed Alireza Mousavian Anaraki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danilo Croce</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Basili</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Enterprise Engineering, University of Rome Tor Vergata</institution>
          ,
          <addr-line>Via del Politecnico 1, 00133, Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Sustainability reports are often aligned with frameworks such as the Global Reporting Initiative (GRI) and the Sustainable Development Goals (SDGs), but large-scale, paragraph-level annotation remains a challenge. This paper introduces a fully automated pipeline that generates weak supervision by linking report paragraphs to GRI and SDG categories using structured content indices, oficial GRI-SDG mappings, and semantic similarity scoring. To mitigate the noise inherent in automatic annotation, we employ an instruction-tuned large language model (LLaMA 3.1) to filter assigned labels based on paragraph relevance. We evaluate the quality of our annotations through downstream SDG classification tasks on the OSDG Community Dataset, showing that LLM-based filtering aligns closely with human consensus and significantly improves model performance. Our results demonstrate that combining pruned, automatically annotated data with human-labeled examples leads to more accurate and robust SDG classification, supporting scalable, interpretable sustainability analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sustainability Reporting</kwd>
        <kwd>Sustainable Development Goals</kwd>
        <kwd>Global Reporting Initiative</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>salary and remuneration of women to men.” ing human-annotated data with pruned automatically
annotated examples consistently improves classification
This example illustrates how individual report para- accuracy, particularly for challenging or ambiguous texts.
graphs can be meaningfully aligned with both the SDG We further evaluate the efectiveness of our pruning
and GRI frameworks; however, performing this mapping strategy through two complementary analyses. First, we
at scale is non-trivial. The full task involves 17 SDGs and leverage the structure of the OSDG Community Dataset,
33 GRI standard codes (each with multiple disclosures), in which each text is associated not only with an SDG
yielding hundreds of potential (GRI, SDG) combina- label but also with an agreement score, reflecting the
protions and significant ambiguity in narrative text. Address- portion of annotators who endorsed the assigned label.
ing this challenge requires a systematic approach that By applying our LLM-based filtering method to OSDG,
can constrain the search space while preserving semantic we examine the correlation between human consensus
relevance. and the LLM’s filtering decisions. Intuitively, a reliable</p>
      <p>
        Our method bridges the gap between structured sus- pruning system should tend to retain annotations with
tainability frameworks and unstructured report narra- high human agreement and filter more aggressively when
tives, enabling large-scale and systematic annotation of annotator consensus is low, as these instances are more
disclosures. Concretely, we restrict the annotation search likely to be ambiguous or noisy. Our results show a clear
space by focusing on report pages linked to GRI standards alignment: paragraphs with high agreement scores are
in the content index, and further constrain possible anno- more frequently retained, while those with lower
contations using established mappings between GRI codes sensus are more likely to be discarded. Inspired by this
and SDGs. This substantially reduces ambiguity and the analysis, we also examine the pruning behavior on
aucombinatorial complexity inherent in considering all pos- tomatically annotated data. We find a consistent trend:
sible code pairs. To assign labels at the paragraph level, as the semantic similarity between a paragraph and its
we compute semantic similarity between each paragraph paired GRI-SDG labels increases, a larger proportion of
and the textual definitions of GRI disclosures and SDG annotations is retained. This suggests that LLaMA’s
filtargets, using pre-trained sentence encoders [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ]. tering decisions are guided by semantic alignment,
reinThis allows us to rank and select the most plausible (GRI, forcing the efectiveness of our similarity-based scoring
SDG) annotation pairs, resulting in a high-confidence, approach for assessing label relevance.
automatically annotated dataset. Second, we directly compare downstream performance
      </p>
      <p>
        Despite these constraints, unsupervised annotation when training models on data with and without
LLMmethods—especially those based on bootstrapping and se- based filtering. Across all configurations, we observe that
mantic similarity—can introduce noisy or weakly aligned pruning improves overall classification accuracy. These
labels. To address this, we propose a pruning strategy ifndings suggest that the pruning step not only aligns
that further refines annotation quality. Specifically, we with human judgments but also consistently enhances
employ an instruction-tuned large language model (LLM), the utility of the resulting training data for sustainability
such as LLaMA 3.1 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], to assess the contextual fit of text classification.
each paragraph-label pair. The model is prompted to The remainder of this paper is organized as follows:
answer, in a binary fashion, whether the proposed an- Section 2 reviews the relevant literature. Section 3
innotation is relevant to the given paragraph. This step troduces our automatic annotating and pruning
methodiflters out misaligned pairs and improves the reliability of ology. Section 4 outlines the experimental setup and
the final dataset for downstream sustainability analysis. presents our evaluation results. Finally, Section 5
conWhile our implementation uses LLaMA 3.1, the approach cludes the paper and discusses directions for future
reis compatible with other instruction-tuned LLMs. search.
      </p>
      <p>
        Directly assessing the quality of unsupervised
annotations is inherently challenging due to the lack of
groundtruth labels at scale. To address this, we adopt an indirect 2. Related Work
evaluation strategy: we train a supervised classifier on
our pruned automatically annotated dataset and assess Sustainability Reporting Frameworks.
Sustainabilits performance on a well-established benchmark, the ity reporting is increasingly guided by global
frameOSDG Community Dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Our working hypothesis works such as the United Nations Sustainable
Developis that if the inclusion of pruned automatically annotated ment Goals (SDGs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the Global Reporting Initiative
data leads to improved classification performance on the (GRI)3, and Environmental, Social, and Governance (ESG)
OSDG benchmark, then these data contribute useful in- principles [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The 2030 Agenda outlines 17 SDGs and
formation.2 Preliminary results confirm that supplement- 169 targets addressing major global development
chal2Although our method generates both SDG and GRI labels, we focus
on SDG evaluation in this work. Joint assessment of SDG and GRI
annotations is left for future research.
3https://www.globalreporting.org/standards/
lenges [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], while the GRI, established in 1997, ofers a related prompts and identifying nuanced sustainability
structured framework for reporting economic, environ- issues.
mental, and social impacts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It provides standardized Encoder-decoder models like BART [28] combine
disclosures—both required and recommended—that help text understanding and generation, making them
wellorganizations systematically communicate their contri- suited for complex tasks such as summarization. Though
butions. To support SDG integration, the Action Platform less commonly used, they have proven efective in
susReporting on the SDGs4, in collaboration with GRI, ofers tainability reporting—e.g., BART was used for SDG
multia database that maps SDG targets to specific GRI codes label categorization [29].
and disclosures, enabling companies to identify relevant Following the trends outlined above, our approach
reporting items and align strategic goals with operational assigns task-specific roles to decoder-only and
encodermetrics. only LLMs based on their architectural strengths. We
use LLaMA 3.1—an instruction-tuned decoder-only
Large Language Models in Sustainability Reporting model—to filter noisy or weakly aligned GRI-SDG
anLarge Language Models (LLMs) have become powerful notations through generative prompting, guided by an
tools in natural language processing, ofering innovative embedding-based similarity scoring process. Specifically,
solutions to longstanding challenges in sustainability we use a pre-trained MpNet model to compute alignment
reporting. Their high accuracy and adaptability make scores between each paragraph and its associated
GRIthem well-suited for extracting structured data, perform- SDG label descriptions, allowing us to generate more
ing textual analysis, and identifying misleading green semantically grounded annotations by prioritizing
laclaims [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. bel pairs with the highest similarity. For downstream
      </p>
      <p>
        LLMs are typically categorized into three main types classification, we fine-tune a BERT-based encoder model
based on their neural architecture: encoder-only, decoder- for multi-label SDG prediction, capitalizing on its
efeconly, and encoder-decoder models [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. tiveness in structured, discriminative tasks. This design
      </p>
      <p>
        Encoder-only models, such as BERT [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], focus on reflects a practical alignment between model capabilities
encoding the input text into rich contextual represen- and task requirements in the context of sustainability
tations using self-attention mechanisms. These mod- reporting. Moreover, by improving the quality of both
els are especially efective for classification and inter- human and automatically annotated data, our approach
pretive tasks like sentiment analysis and named entity contributes to more reliable alignment with established
recognition. These models dominate sustainability NLP reporting standards such as the SDGs and GRI, thereby
applications due to their high performance on classifi- supporting more transparent and accountable
sustaincation tasks. They have been widely used for aligning ability disclosures.
corporate texts with SDGs [
        <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
        ], GRI [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and
ESG [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21, 22</xref>
        ]. Models like BERT, RoBERTa, SBERT, 3. Automatic Paragraph
MiniLM, and DistilBERT are frequently fine-tuned to
extract structured insights and detect misleading green Annotation via Structured
claims using ClimateBERT [23] and MacBERT [24]. For Indices, Semantic Similarity,
example, ESG-KIBERT [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] employs an encoder-only
architecture specifically designed for industry-specific ESG and LLM Filtering
evaluation, demonstrating how domain adaptation can
improve the performance of deep language models in We present a multi-step pipeline for automatically
annosustainability contexts. tating paragraphs from sustainability reports with both
      </p>
      <p>
        Decoder-only models, such as LLaMA [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], operate GRI (Global Reporting Initiative) and SDG (Sustainable
auto-regressively by predicting one token at a time con- Development Goals) labels. The process leverages
docuditioned on prior outputs. This makes them suitable for ment structure, oficial mappings, and semantic similarity,
generative tasks such as text completion, summarization, with a final human-like filter based on a large language
and dialogue generation. Recent studies underscore the model.
growing role of decoder-only models in sustainability
reporting, particularly through their integration with Paragraph Segmentation and Preprocessing. Each
retrieval-augmented generation (RAG) techniques [25], report is parsed with a layout-aware tool (e.g., PyMuPDF5),
as demonstrated in ESG applications by Bronzini et al. extracting all text blocks and filtering out headers,
foot[26] and Zou et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Additionally, Jain et al. [27] high- ers, and fragments. Only blocks of at least 20 words are
lighted the efectiveness of GPT-3.5 in addressing ESG- retained as candidate paragraphs.
4https://www.globalreporting.org/reporting-support/
goals-and-targets-database/
      </p>
      <sec id="sec-1-1">
        <title>5https://github.com/pymupdf/PyMuPDF</title>
        <p>• The candidate set as all GRI codes explicitly
linked to  via the content index.
• The alternative set as all remaining GRI codes
not mentioned in the index for  , but potentially
relevant based on semantic content.</p>
      </sec>
      <sec id="sec-1-2">
        <title>This produces two filtered sets of candidate triples: those based on content-indexed GRI codes, and those based on alternative codes. For the running example, the triples derived from the content index are:</title>
        <p>For example, a typical extracted paragraph might be: Quality Education)—and (ii) it guarantees that
down“In 2023, CompanyX reduced its greenhouse gas emissions stream semantic similarity scoring is only performed
by 15% by switching to renewable energy sources. The between a paragraph and label pairs with a recognized
organization remains committed to transparent reporting conceptual connection, thus improving interpretability
of its climate targets and actions.” and actionability for sustainability analysis.
Given a paragraph , we use its associated GRI
Generating Candidate and Alternative Labels. codes—those directly referenced in the content index
Most reports include a GRI content index, a table authored (candidate set) and all other codes not mentioned
(alterby the company that indicates, for each GRI disclosure native set)—to generate all valid triples (, , ), where
code (e.g., GRI 305: Emissions, GRI 302: Energy),  ∈ ℳ(). For example, as above:
the specific pages where the disclosure is addressed.</p>
        <p>For each paragraph  occurring on page  , we define:
• GRI 305 maps to SDG 13 (Climate Action),
• GRI 302 maps to both SDG 13 and SDG 7
(Affordable and Clean Energy).</p>
        <p>
          Continuing the example, suppose the GRI content in- • (paragraph, GRI 305, SDG 13),
dex indicates that the pages containing the paragraph • (paragraph, GRI 302, SDG 13),
above refer to GRI 305 (Emissions) and GRI 302 (En- • (paragraph, GRI 302, SDG 7).
ergy). These two codes are included in the candidate set
for the paragraph, as they are explicitly claimed by the At this stage, all generated triples are semantically
plaureport on that page. All remaining GRI codes—among sible and ready for embedding-based similarity scoring.
the approximately 33 topical standards defined in the
GRI framework—are considered part of the alternative Semantic Similarity Ranking. Even after filtering
set. These alternatives are not mentioned in the content out irrelevant combinations via the oficial GRI→SDG
index for this page, but may still be semantically relevant mapping, each paragraph remains associated with a large
to the paragraph based on its content. Note that, due number of possible label pairs. We therefore rank all
to the broad and multi-faceted nature of sustainability remaining (paragraph, GRI, SDG) triples based on how
topics, the content index is not expected to capture all semantically aligned they are with the paragraph content.
relevant GRI standards for each page. It typically high- To quantify alignment, we use a pre-trained sentence
lights the main disclosures, while secondary or nuanced encoder (MPNet [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]) to compute cosine similarities in
themes may be omitted. By considering both the candi- embedding space. For each triple, we consider the textual
date set (directly indexed codes) and the alternative set description of the SDG target and all available disclosure
(other potentially relevant codes), our approach accounts requirements associated with the GRI code. We define
for both explicit priorities and additional associations the similarity score  (, , ) as:
present in the narrative.
        </p>
        <sec id="sec-1-2-1">
          <title>Expansion to SDG Pairs via Oficial Mapping. Each</title>
          <p>GRI code captures a specific disclosure standard (e.g.,
energy consumption, gender pay equality), while each SDG where e is the embedding of the paragraph,  is the
describes a broader societal goal (e.g., SDG 7: Afordable set of disclosure texts for GRI code , and  is the set
and Clean Energy; SDG 5: Gender Equality). To bridge of textual definitions for SDG  (typically the goal and
these conceptual levels in a principled way, we use the its targets). This formulation favors pairs for which both
oficial mapping 6 ℳ, which links each GRI code only to components—GRI and SDG—are independently relevant
semantically relevant SDG targets. to the paragraph: if either component is weakly aligned,</p>
          <p>This mapping is essential for two reasons: (i) it the product score will be low. This reflects the intuition
avoids generating irrelevant or misleading (GRI, SDG) that a good annotation should simultaneously satisfy
pairs—since not every combination is meaningful in prac- both frameworks. For example, suppose a paragraph
tice (e.g., GRI 305: Emissions is unrelated to SDG 4: discusses emissions reduction due to renewable energy
adoption. We obtain:
 (, , ) = max cos(e, e) · m∈ax cos(e, e)
∈
6https://www.globalreporting.org/reporting-support/
goals-and-targets-database/
• cos(paragraph, GRI 305) = 0.92 (strong
match with “Reduction of GHG emissions”),
• cos(paragraph, SDG 13) = 0.88 (climate ac- Permissive Policy: This policy is designed to maximize
tion), recall and accommodate semantic ambiguity—useful for
• cos(paragraph, GRI 302) = 0.69 (energy re- exploratory analysis or downstream expert curation.</p>
          <p>duction consumption),
• cos(paragraph, SDG 7) = 0.54 (clean energy).
1. Find the candidate triple with the highest score</p>
          <p>and set a threshold at half that value.
2. Retain up to two candidate triples whose scores
exceed this threshold (to account for ties or
nearequivalent topics).
3. Always include the best-scoring alternative triple,
regardless of its absolute score, ensuring that
strong semantic signals outside the index are
never discarded a priori.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>As a result, this policy can return up to three triples (two</title>
        <p>candidates plus one alternative) for a given paragraph,
allowing for richer, multi-label annotation. In summary,
the conservative policy favors precision, whereas the
permissive policy promotes recall and label diversity.
The resulting joint scores are: (GRI 305, SDG 13): 0.92×
0.88 = 0.81, (GRI 302, SDG 13): 0.69 × 0.88 = 0.61,
(GRI 302, SDG 7): 0.69 × 0.54 = 0.37.</p>
        <p>Notably, we compute these scores for both candidate
and alternative triples. While candidate triples originate
from the GRI content index (i.e., the report explicitly
claims these topics are discussed on the page),
alternative triples arise from GRI codes not mentioned in the
index. Though potentially less reliable, alternative labels
may capture omissions or relevant but unindexed
content. Hence, if a triple from the alternative set obtains
a substantially higher semantic score than those in the
candidate set, it may signal that the original index missed
something. In this case, our strategy allows the model
to retain the best alternative triple. While semantic
similarity ofers a useful initial filter, it may miss deeper
context or introduce noise. To address this, we add later
an LLM-based filtering step for more robust alignment.</p>
        <sec id="sec-1-3-1">
          <title>Final Filtering with LLM Relevance Assessment</title>
          <p>While semantic similarity models are powerful for linking
text to structured concepts, they can sometimes
overestimate relevance—especially for vague, generic, or
multitopic paragraphs. For example, a paragraph mentioning
“sustainable growth” could weakly match almost any
Disambiguation Policies: Conservative and Permis- SDG, leading to noisy or spurious labels even after
caresive. After ranking all (paragraph, GRI, SDG) triples by ful mapping and scoring.
joint semantic similarity, the final step is to select which To further improve annotation quality, we add a
fiannotations to retain for each paragraph. This choice nal “human-like” relevance check using a large language
must balance precision (avoiding spurious labels) with model (LLM) such as LLaMA 3.1 Instruct. This step serves
recall (capturing genuine but possibly under-indexed con- two key purposes: i) it filters out weak, contextually
intent). We propose two complementary disambiguation appropriate, or overly broad matches that the
similaritypolicies, which reflect diferent trade-ofs between cover- based method might miss; ii) it simulates expert review
age and selectivity. at scale, bringing richer contextual understanding and
Conservative Policy: This policy is tailored for high- nuanced judgment—skills typically seen in human
annoprecision applications, where false positives are espe- tators—while maintaining automation and consistency.
cially costly. For each paragraph, we: For each retained (paragraph, GRI, SDG) triple, we
con1. Identify the best-scoring candidate triple (i.e., de- struct a structured prompt (shown in Figure 1) presenting
rived from the GRI codes listed in the report’s the paragraph and the oficial descriptions of both labels.
index for the relevant page). The LLM is asked to answer—based solely on the
evi2. Identify the best-scoring alternative triple (i.e., dence given—whether the label pair is truly relevant to
derived from any other valid (GRI, SDG) pair for the paragraph content. Only those triples receiving a
the paragraph). “Yes” are included in the final dataset.
3. If the candidate triple’s score is greater than or For instance, a paragraph describing the company’s
equal to the alternative’s, we retain only the can- general commitment to “sustainable development” might
didate triple—reflecting high confidence in the weakly match several SDGs and GRIs in embedding space,
company’s index. but only a focused LLM assessment can determine if a
spe4. If the alternative triple has a higher score, we cific (GRI, SDG) pair is truly justified by the text. In this
return both the best candidate and the best alter- way, the LLM acts as a high-precision, scalable
expert-innative. This accounts for possible omissions or the-loop filter. This LLM-based filtering step significantly
underreporting in the index, while maintaining reduces false positives, capturing complex connections
interpretability. and subtle mismatches that even strong embedding
models may overlook. In efect, it combines the scale and
In practice, this policy outputs either one or two annota- speed of automated annotation with the contextual depth
tion triples per paragraph.</p>
          <p>You are a sustainability evaluation assistant.
Decide if the following GRI–SDG pair is relevant to
the paragraph.</p>
          <p>Paragraph: “Paragraph content here”
GRI [GRI Code]: GRI Description here
SDG [SDG Name]: SDG Description here
Only reply with one word: Yes or No.</p>
          <p>Format:
Answer: Yes
(or)</p>
          <p>Answer: No
of human reasoning, resulting in a cleaner, more
trustworthy annotated dataset ready for downstream analysis
or model training.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experimental Evaluation</title>
      <p>We conduct a comprehensive experimental evaluation
to assess the efectiveness of our automatic annotation
pipeline and its LLM-based filtering component. Our
analysis focuses on two main questions: (i) does LLM
filtering produce label decisions that align with human
consensus? and (ii) how do diferent label selection policies
(conservative vs. permissive) and LLM filtering impact
the quality and utility of the resulting annotated data for
downstream SDG classification?
puted cases) to 1.0 (full agreement among annotators).</p>
      <p>We use the LLaMA 3.1 Instruct model as a post-hoc filter:
for each paragraph-SDG pair in OSDG-CD, we prompt
the model to decide if the label is relevant to the
paragraph, using the same structured format adopted in our
main pipeline. We then analyze the fraction of examples
retained (“Yes” by the LLM) across diferent agreement
intervals.</p>
      <p>Table 1 reports the frequency distribution of samples
across agreement bins, and Figure 2 visualizes the key
re4.1. LLM Filtering and Human Consensus sult: the likelihood of a sample being retained by the LLM
on OSDG-CD iflter increases monotonically with human agreement.
A natural concern when introducing LLM-based filter- In other words, pairs with high human consensus are
ing into any annotation pipeline is whether the model’s almost always preserved by the model, while those with
binary “Yes/No” relevance judgments are in fact consis- low or disputed agreement are more frequently filtered
tent with human annotation practices. While LLMs are out. This positive correlation provides strong evidence
increasingly adopted as automated evaluators or assis- that LLM-based filtering is not arbitrary, but instead
captants, there is limited empirical evidence on how closely tures a notion of relevance that closely mirrors collective
their filtering behavior tracks with actual human agree- human judgment.
ment—particularly in specialized domains such as sus- This result has two important implications. First,
tainability. To address this, we leverage the OSDG Com- it provides empirical support for using LLMs as
scalmunity Dataset (OSDG-CD), a large-scale benchmark in able, “expert-in-the-loop” filters for semantic annotation,
which each paragraph-SDG pair is annotated not only even in cases where manual adjudication would be
prowith the assigned label, but also with an explicit agree- hibitively expensive. Second, it suggests that LLMs can
ment score reflecting the proportion of human annota- help mitigate annotation noise in weakly or ambiguously
tors who supported the label assignment. This agreement labeled data—removing many of the examples that
huscore provides a direct, interpretable measure of human mans themselves would likely judge as borderline or
consensus, ranging from 0.1 (highly ambiguous or dis- unreliable. Overall, this agreement-guided analysis not
only validates our specific use of LLM filtering in the
construction of GRI-SDG training data, but also suggests
a broader role for LLMs as automatic quality controllers
in human-in-the-loop NLP pipelines.
4.2. Assessing Labeling Strategies for</p>
      <p>Automatic Paragraph Annotation
whether LLM-based filtering efectively improves the
utility of automatically annotated data, and how the choice
of annotation policy (conservative vs. permissive)
impacts downstream model performance.</p>
      <p>Experimental Setup. To systematically evaluate our
annotation pipeline, we applied it to a curated corpus of
30 sustainability reports spanning 10 sectors and 3,663 Training Simple Complex
pages. After preprocessing and paragraph segmentation, Conservative 0.762 0.737
we obtained 19,133 candidate paragraphs, of which 10,303 Conservative + LLM 0.783 0.752
were indexed by company-provided GRI content indices PPeerrmmiissssiivvee + LLM 00..768286 00..666905
and thus eligible for annotation. Annotation followed
the multi-step procedure described in Section 3: we
generated (GRI, SDG) label pairs using the oficial mapping, Results (Table 2) indicate that both policies benefit
scored their semantic similarity, and selected final annota- from LLM filtering, but to diferent extents. The
contions according to either the conservative (high-precision, servative policy (high-precision, fewer labels) already
at most one or two triples per paragraph) or permissive yields reasonably strong results, but applying LLM
fil(higher recall, up to three triples) policy. tering further increases accuracy by removing residual</p>
      <p>Applying the conservative policy yielded 17,216 label false positives. The permissive policy (higher recall, more
pairs initially, which were reduced to 4,558 after LLM- candidate triples per paragraph) initially introduces
subbased relevance filtering. The permissive policy produced stantially more noise, as reflected in lower baseline
aca higher initial volume of annotations (30,647 label pairs), curacy; however, LLM filtering provides a larger relative
which was pruned to 7,425 after filtering with LLaMA improvement—yet, even after filtering, the permissive
3.1 Instruct. This substantial reduction confirms the im- setting still lags behind the conservative one in
absopact of the LLM-based step in filtering out weak or noisy lute performance. This suggests that, while the LLM can
annotations, ultimately improving the quality and reli- mitigate a large portion of annotation noise, excessive
ability of the final labeled dataset. For evaluation, we over-labeling (as in the permissive setting) cannot be
leveraged the OSDG Community Dataset (OSDG-CD), fully corrected in post-processing, and some spurious
aswhich contains single-label SDG assignments per para- sociations may persist. In summary, LLM-based filtering
graph, validated by crowdsourced agreement scores. To systematically improves the quality of automatically
genensure reliability, we defined two test splits: a Simple erated labels, especially in the presence of noisy or overly
set (agreement = 1.0, fully unambiguous) and a Com- broad candidate assignments. However, the conservative
plex set (0.7 ≤ agreement ≤ 1.0). All models were policy remains preferable in settings where downstream
trained in a multi-label setting, but evaluated using only precision is paramount7.
the highest-scoring prediction per paragraph to match Does Adding Automatically Annotated Data
Benethe OSDG single-label ground truth. As a baseline, we fit Supervised Training? In a second experiment, we
used a BERT-based classifier ( bert-base-cased). We assessed whether supplementing human-annotated data
used a standard binary cross-entropy loss for multi-label (OSDG-CD) with LLM-pruned automatic annotations
classification over the full label set, treating each label yields tangible improvements in SDG classification.
independently during training. The model was trained
with an efective batch size of 16 (via gradient accumu- Table 3
lation over 4 mini-batches of size 4), using the AdamW Accuracy on OSDG test sets with and without adding pruned
optimizer with a learning rate of 2 × 10− 5, weight decay automatic data (Cons.: conservative, Perm.: permissive).
of 0.1, and a linear learning rate scheduler with a warmup
ratio of 0.1, for a total of 5 training epochs. Accuracy is Training Simple Complex
defined as the percentage of paragraphs for which the OSDG (full) 0.917 0.907
top predicted label matches the ground truth; since the OOSSDDGG ++ PCeornms.. ++ LLLLMM 00..991291 00..990190
OSDG test set provides only one true label per paragraph,
this top-1 accuracy measure is equivalent to precision,
recall, and F1-score, which are therefore omitted.</p>
      <p>Does LLM Filtering Improve Automatically
Annotated Training Data? Our first experiment tests</p>
      <sec id="sec-2-1">
        <title>7Note that the test set requires a single SDG per paragraph, so we</title>
        <p>evaluate our classifier by selecting only the top prediction. This
may not capture all relevant SDGs, especially for complex cases,
but gives a reasonable first estimate of performance.</p>
        <p>Results in Table 3 show that, for both policies, adding Table 4
pruned automatic annotations to the OSDG training set Mean product similarity score for retained vs. discarded
samconsistently increases accuracy on both simple and com- ples under conservative and permissive label selection.
plex test splits. While the gains are modest, they are Policy Category Retained Discarded
robust across settings, confirming that our pipeline
produces useful complementary signal even in the presence Overall 0.434 0.321
of expert-labeled data. As in the previous experiment, Conservative Alternatives 0.463 0.351
the conservative policy remains more reliable, providing Candidates 0.422 0.298
slightly higher accuracy than the permissive policy; the Overall 0.414 0.308
latter, despite contributing more examples, appears to Permissive Alternatives 0.456 0.353
introduce a small amount of residual noise that is not Candidates 0.400 0.283
fully eliminated by LLM filtering.</p>
        <p>Taken together, these findings support a dual
conclusion: (1) the automatic annotation pipeline is efective combined set. To ensure statistical significance, we only
for scalable SDG data generation, and (2) the interplay report bins containing at least 700 samples. The
threshbetween label selection policy and LLM-based filtering old of 700 samples was chosen empirically based on the
is crucial for balancing coverage and precision. The con- distribution of paragraph counts across prediction score
servative strategy, enhanced by LLM filtering, delivers intervals. Specifically, we observed that the total number
high-quality labels that boost supervised learning, while of samples in the higher-confidence intervals—i.e., those
the permissive strategy is valuable for recall-oriented greater than 0.7 ((0.7-0.8], (0.8-0.9], (0.9-1])—was only 272
applications but requires careful calibration to avoid ex- (227 + 40 + 5). Given such low sample sizes, reporting
cessive noise. performance metrics for these bins would risk statistical
instability and lack of representativeness. To mitigate
4.3. Analysis of LLM Retention Decisions this, we selected 700 as a minimum cutof to ensure that
each bin included in our analysis contains a suficient
on Automatically Annotated Data number of samples for reliable metric estimation. This
threshold balances coverage across confidence intervals
with the statistical reliability of the reported results.</p>
        <p>Having established that the LLM-based filter is well
aligned with human consensus on the OSDG dataset
(Section 4.1), we next analyze how the LLM’s binary
relevance judgments interact with the underlying seman- 80%
tic similarity scores in our full, automatically annotated 70% Proportion of Alternatives Kept
dataset. This provides a deeper understanding of whether 60% Proportion of Candidates Kept
the LLM filter simply introduces an arbitrary bottleneck, 50% Proportion of Data Kept
or if it systematically reinforces semantic quality. 40%</p>
        <p>We consider the product similarity score—the product 30%
of cosine similarities between a paragraph and its asso- 1200%%
ciated GRI and SDG descriptions (see Section 3)—as a 0%
measure of semantic alignment for each candidate la- (0.1-0.2] (0.2-0.3] (0.3-0.4] (0.4-0.5] (0.5-0.6] (0.6-0.7]
bel. For every (paragraph, GRI, SDG) triple, we record Product Similarity Score Intervals
whether the LLM filter retained the annotation (“Yes”) or Figure 3: Proportion of (paragraph, GRI, SDG) triples
rediscarded it (“No”). Table 4 reports the mean similarity tained by the LLM filter as a function of product similarity
scores for retained and discarded samples, disaggregated score, binned by intervals. Results are shown for candidate,
alby both label type (Candidate, Alternative) and selection ternative, and all labels under the conservative (Top-1) policy.
policy (Conservative, Permissive).</p>
        <p>As shown, the LLM filter systematically prefers to
reThe figure demonstrates a clear monotonic trend: as
tain labels with higher semantic similarity to the
parathe product similarity score increases, the probability of
graph, regardless of whether they are candidate or
alretention by the LLM rises sharply. For scores below 0.3,
ternative labels, and across both policies. The efect is
fewer than 20% of labels are retained, while for scores
particularly pronounced for alternatives, which are only
above 0.6, the retention rate exceeds 60%. This pattern
kept when they exhibit a strong semantic match.</p>
        <p>holds for both candidates and alternatives, further
supTo further examine this relationship, we discretize the
porting the conclusion that the LLM acts as a semantic
similarity scores into bins and calculate, for each bin,
relevance filter—amplifying the selectivity of the
autothe proportion of samples retained by the LLM. Figure 3
matic annotation pipeline and systematically favoring
presents these retention rates for the conservative
(Toplabels with strong textual alignment.
1) policy, separately for candidates, alternatives, and the</p>
        <p>In summary, these results indicate that our LLM-based
ifltering mechanism is not merely an arbitrary
postprocessing step, but an efective semantic validator: it
consistently prioritizes label assignments with robust
evidence in the paragraph text.
documents that inspired and enabled this initial
experimental study. We acknowledge financial support from
the PNRR project FAIR - Future AI Research (PE00000013),
under the NRRP MUR program funded by the
NextGenerationEU.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion and Future Work References</title>
      <p>This work presents a fully automated pipeline for
largescale annotation of sustainability reports at paragraph
level, aligning text with both GRI disclosures and SDG
targets. Leveraging structured metadata, oficial GRI-SDG
mappings, semantic similarity, and an LLM-based
relevance filter (LLaMA), our method ofers an interpretable
and scalable alternative to manual annotation. The LLM
iflter proves highly efective in reducing semantic noise
and producing annotations that closely match human
consensus.</p>
      <p>Our experiments show that LLaMA-based filtering
favors labels with high semantic similarity, aligns with
human judgments on the OSDG benchmark, and
consistently improves downstream SDG classification—even
when combined with expert-labeled data. While
permissive labeling increases coverage, it also adds noise that is
only partly corrected by LLM filtering.</p>
      <p>
        This pipeline lays the foundation for more transparent
and data-driven sustainability analytics. Future research
will focus on several open challenges. First, we aim to
expand the LLM filter to provide natural language
justiifcations for its decisions, improving explainability and
facilitating expert validation. We also acknowledge that
scalability may become a limitation when applying our
pipeline to thousands of reports, particularly due to the
computational cost of LLM-based filtering; addressing
this bottleneck through optimization or distillation
techniques is a key direction for future work. Second, while
our current evaluation is primarily model-based, we plan
to conduct in-depth human studies, including manual
validation of high-confidence (GRI, SDG) pairs, and direct
comparisons with prior supervised approaches [
        <xref ref-type="bibr" rid="ref16 ref18">16, 18</xref>
        ],
especially regarding the annotation of GRI codes. Third,
we envision extending our framework to cover a wider
array of sustainability and ESG standards, as well as to
support fine-grained analysis of the substance and quality of
sustainability reporting—such as distinguishing between
specific, verifiable disclosures and generic statements,
thus advancing automated detection of greenwashing.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>We thank Armando Calabrese, Roberta Costa and Luigi</title>
        <p>Tiburzi for their valuable advice, insightful discussions
on sustainability reporting, and for generously sharing
Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nationen</surname>
          </string-name>
          , Transforming Our World:
          <article-title>The 2030 Agenda for Sustainable Development:</article-title>
          A/Res/70/1,
          <string-name>
            <surname>United</surname>
            <given-names>Nations</given-names>
          </string-name>
          ,
          <source>Division for Sustainable Development</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H. Q.</given-names>
            <surname>Ngee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A. N.</given-names>
            <surname>Azmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mukred</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mohammed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. B.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <article-title>Environmental, social and governance (esg) scores automation in global reporting initiative (gri) with natural language processing</article-title>
          ,
          <source>in: Proc. 2024 7th Int. Conf. Internet Appl</source>
          .,
          <string-name>
            <surname>Protocols</surname>
          </string-name>
          , and
          <string-name>
            <surname>Services</surname>
          </string-name>
          (NETAPPS),
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Esgreveal:</surname>
          </string-name>
          <article-title>An llm-based approach for extracting structured data from esg reports</article-title>
          ,
          <source>J. Clean. Prod</source>
          .
          <volume>489</volume>
          (
          <year>2025</year>
          )
          <fpage>144572</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Analyzing and visualizing text information in corporate sustainability reports using natural language processing methods</article-title>
          ,
          <source>Appl. Sci</source>
          .
          <volume>12</volume>
          (
          <year>2022</year>
          )
          <fpage>5614</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Moodaley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Telukdarie</surname>
          </string-name>
          ,
          <article-title>A conceptual framework for subdomain specific pre-training of large language models for green claim detection</article-title>
          ,
          <source>Eur. J. Sustain. Dev</source>
          .
          <volume>12</volume>
          (
          <year>2023</year>
          )
          <article-title>319</article-title>
          . doi:
          <volume>10</volume>
          .14207/ejsd.
          <year>2023</year>
          .v12n4p319.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pukelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bautista-Puig</surname>
          </string-name>
          , G. Statulevičiu¯tė,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stančiauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dikmener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Akylbekova</surname>
          </string-name>
          ,
          <article-title>Osdg 2.0: A multilingual tool for classifying text data by un sustainable development goals (sdgs</article-title>
          ),
          <source>arXiv preprint abs/2211</source>
          .11252 (
          <year>2022</year>
          ). Available at: https: //arxiv.org/abs/2211.11252.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohtaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <article-title>Classifying sustainability reports using companies selfassessments</article-title>
          ,
          <source>in: Future of Information and Communication Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>547</fpage>
          -
          <lpage>557</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Nechaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Hain</surname>
          </string-name>
          ,
          <article-title>Social impacts reflected in csr reports: Method of extraction and link to ifrms' innovation capacity</article-title>
          ,
          <source>J. Clean. Prod</source>
          .
          <volume>429</volume>
          (
          <year>2023</year>
          )
          <fpage>139256</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          , T.-Y. Liu,
          <article-title>Mpnet: Masked and permuted pre-training for language understanding</article-title>
          ,
          <source>Adv. Neural Inf. Process. Syst</source>
          .
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>16857</fpage>
          -
          <lpage>16867</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proc. 2019 Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (NAACL-HLT)</source>
          , Vol.
          <volume>1</volume>
          (Long and Short Pa- social, and governance communication,
          <source>Finance pers)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . Res.
          <source>Lett</source>
          .
          <volume>61</volume>
          (
          <year>2024</year>
          )
          <article-title>104979</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.frl.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-bert:
          <year>Sentence 2024</year>
          .
          <volume>104979</volume>
          .
          <article-title>embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proc. [</source>
          22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tewari</surname>
          </string-name>
          ,
          <source>A natural lan2019 Conf. Empirical Methods Nat. Lang. Process. guage processing model on bert and yake technique and 9th Int. Joint Conf. Nat. Lang</source>
          . Process.
          <article-title>(EMNLP- for keyword extraction on sustainability reports</article-title>
          ,
          <source>IJCNLP)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3982</fpage>
          -
          <lpage>3992</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/ IEEE Access (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2024</year>
          . D19-
          <volume>1410</volume>
          .
          <fpage>3352742</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vinella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Capetz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pattichis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chance</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Ghosh</surname>
          </string-name>
          , Leveraging language models to detect E.
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Joulin, greenwashing, arXiv preprint arXiv:
          <volume>2311</volume>
          .01469
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          , G. Lample, Llama: Open and efi
          <string-name>
            <surname>-</surname>
          </string-name>
          (
          <year>2023</year>
          ). Available at: https://arxiv.org/abs/2311.
          <article-title>cient foundation language models</article-title>
          ,
          <source>arXiv preprint 01469. arXiv:2302.13971</source>
          (
          <year>2023</year>
          ). Available at: https://arxiv. [24]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <source>Construction and analysis org/abs/2302</source>
          .13971.
          <article-title>of corporate greenwashing index: a deep learning</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>T. B. Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vacca</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Mantegazza</surname>
          </string-name>
          , I. Capua, approach,
          <source>EPJ Data Sci</source>
          .
          <volume>14</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
          <article-title>Natural language processing</article-title>
          and network anal- [25]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mehul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Kanagavalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Saradha</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. N.</surname>
          </string-name>
          <article-title>ysis provide novel insights on policy and scien-</article-title>
          <string-name>
            <surname>Gowtham</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <string-name>
            <surname>Sachin</surname>
            , U. Surya,
            <given-names>R.</given-names>
          </string-name>
          <article-title>Godhandaratific discourse around sustainable development man</article-title>
          , S. Girish,
          <string-name>
            <given-names>R.</given-names>
            <surname>Naveen</surname>
          </string-name>
          ,
          <article-title>Gen ai driven faq chatgoals</article-title>
          ,
          <source>Sci. Rep</source>
          .
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <article-title>22427</article-title>
          . doi:
          <volume>10</volume>
          .
          <article-title>1038/ bot using advanced rag architecture for querying s41598-021-01801-6</article-title>
          . annual reports,
          <source>in: Proc. 2025 Int. Conf. Comput.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          , Pre- Commun. Technol.
          <source>(ICCCT)</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
          <article-title>trained models for natural language processing:</article-title>
          A [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bronzini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nicolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lepri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passerini</surname>
          </string-name>
          , J. Stasurvey, Sci. China Technol. Sci.
          <volume>63</volume>
          (
          <year>2020</year>
          )
          <fpage>1872</fpage>
          -
          <lpage>1897</lpage>
          . iano, Glitter or gold? deriving structured insights doi:
          <volume>10</volume>
          .1007/s11431-020
          <article-title>-1647-3. from sustainability reports via large language mod-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , els,
          <source>EPJ Data Sci</source>
          .
          <volume>13</volume>
          (
          <year>2024</year>
          )
          <article-title>41</article-title>
          . doi:
          <volume>10</volume>
          .48550/ Bert:
          <article-title>Pre-training of deep bidirectional transform-</article-title>
          <source>arXiv.2310</source>
          .05628.
          <article-title>ers for language understanding</article-title>
          , arXiv preprint [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yalciner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Joglekar</surname>
          </string-name>
          , arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ). Available at: https://arxiv. P. Khetan, T. Zhang, Overcoming complexity in org/abs/
          <year>1810</year>
          .04805. esg investing:
          <article-title>The role of generative ai integration</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Angin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Taşdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Yılmaz</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Demiralp, in identifying contextual esg factors</article-title>
          ,
          <source>SSRN</source>
          (
          <year>2023</year>
          ). M. Atay,
          <string-name>
            <given-names>P.</given-names>
            <surname>Angin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dikmener</surname>
          </string-name>
          , A roberta ap- Available
          <source>at SSRN 4495647. proach for automated processing of sustainability</source>
          [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreports</surname>
          </string-name>
          , Sustain.
          <volume>14</volume>
          (
          <year>2022</year>
          )
          <article-title>16139</article-title>
          . doi:
          <volume>10</volume>
          .3390/ hamed,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, Bart: su142316139.
          <article-title>Denoising sequence-to-sequence pre-training for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Rockinger, Unfolding the transitions in natural language generation, translation, and comsustainability reporting</article-title>
          ,
          <source>Sustain</source>
          .
          <volume>16</volume>
          (
          <year>2024</year>
          )
          <article-title>809</article-title>
          . prehension, arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>13461</volume>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .3390/su16020809. Available at: https://arxiv.org/abs/
          <year>1910</year>
          .13461.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tian</surname>
          </string-name>
          , C.-U. Lei,
          <string-name>
            <given-names>D. K. W.</given-names>
            <surname>Chiu</surname>
          </string-name>
          , Assigning [29]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hope</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sachan, multiple labels of sustainable development goals to R. Mihalcea, Beyond good intentions: Reporting open educational resources for sustainability edu- the research landscape of nlp for social good, arXiv cation</article-title>
          ,
          <source>Educ. Inf. Technol</source>
          .
          <volume>29</volume>
          (
          <year>2024</year>
          )
          <fpage>18477</fpage>
          -
          <lpage>18499</lpage>
          . preprint arXiv:
          <volume>2305</volume>
          .05471 (
          <year>2023</year>
          ). Available at: https:
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hillebrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pielka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Leonhard</surname>
          </string-name>
          , T. Deußser, //arxiv.org/abs/2305.05471.
          <string-name>
            <surname>T. Dilmaghani</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kliem</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Loitz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Morad</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Temath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Bell</surname>
          </string-name>
          , et al.,
          <article-title>sustain.ai: a recommender system to analyze sustainability reports</article-title>
          ,
          <source>in: Proc. Online Resources 19th Int. Conf. Artif. Intell. Law</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>412</fpage>
          -
          <lpage>416</lpage>
          . doi:
          <volume>10</volume>
          .1145/3594536.3595131. • OSDG Community Dataset,
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <article-title>Esg-kibert: A new • United Nations Sustainable Development Goals paradigm in esg evaluation using nlp and industry- (SDGs) specific customization</article-title>
          ,
          <source>Decis. Support Syst</source>
          .
          <volume>193</volume>
          (
          <year>2025</year>
          )
          <fpage>114440</fpage>
          . • Global Reporting Initiative (GRI)
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schimanski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reding</surname>
          </string-name>
          , J. Bingler, •
          <string-name>
            <surname>GRI-SDG Mapping M. Kraus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Leippold, Bridging the gap in esg measurement: Using nlp to quantify environmental,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>