1. Introduction

S. A. Mousavian Anaraki);

Automatic GRI-SDG Annotation and LLM-Based Filtering for Sustainability Reports

Seyed Alireza Mousavian Anaraki

Danilo Croce

Roberto Basili

0 0 Department of Enterprise Engineering, University of Rome Tor Vergata , Via del Politecnico 1, 00133, Rome , Italy

2025

000 0 0001

Sustainability reports are often aligned with frameworks such as the Global Reporting Initiative (GRI) and the Sustainable Development Goals (SDGs), but large-scale, paragraph-level annotation remains a challenge. This paper introduces a fully automated pipeline that generates weak supervision by linking report paragraphs to GRI and SDG categories using structured content indices, oficial GRI-SDG mappings, and semantic similarity scoring. To mitigate the noise inherent in automatic annotation, we employ an instruction-tuned large language model (LLaMA 3.1) to filter assigned labels based on paragraph relevance. We evaluate the quality of our annotations through downstream SDG classification tasks on the OSDG Community Dataset, showing that LLM-based filtering aligns closely with human consensus and significantly improves model performance. Our results demonstrate that combining pruned, automatically annotated data with human-labeled examples leads to more accurate and robust SDG classification, supporting scalable, interpretable sustainability analysis.

eol>Sustainability Reporting Sustainable Development Goals Global Reporting Initiative Large Language Models

1. Introduction

salary and remuneration of women to men.” ing human-annotated data with pruned automatically annotated examples consistently improves classification This example illustrates how individual report para- accuracy, particularly for challenging or ambiguous texts. graphs can be meaningfully aligned with both the SDG We further evaluate the efectiveness of our pruning and GRI frameworks; however, performing this mapping strategy through two complementary analyses. First, we at scale is non-trivial. The full task involves 17 SDGs and leverage the structure of the OSDG Community Dataset, 33 GRI standard codes (each with multiple disclosures), in which each text is associated not only with an SDG yielding hundreds of potential (GRI, SDG) combina- label but also with an agreement score, reflecting the protions and significant ambiguity in narrative text. Address- portion of annotators who endorsed the assigned label. ing this challenge requires a systematic approach that By applying our LLM-based filtering method to OSDG, can constrain the search space while preserving semantic we examine the correlation between human consensus relevance. and the LLM’s filtering decisions. Intuitively, a reliable

Our method bridges the gap between structured sus- pruning system should tend to retain annotations with tainability frameworks and unstructured report narra- high human agreement and filter more aggressively when tives, enabling large-scale and systematic annotation of annotator consensus is low, as these instances are more disclosures. Concretely, we restrict the annotation search likely to be ambiguous or noisy. Our results show a clear space by focusing on report pages linked to GRI standards alignment: paragraphs with high agreement scores are in the content index, and further constrain possible anno- more frequently retained, while those with lower contations using established mappings between GRI codes sensus are more likely to be discarded. Inspired by this and SDGs. This substantially reduces ambiguity and the analysis, we also examine the pruning behavior on aucombinatorial complexity inherent in considering all pos- tomatically annotated data. We find a consistent trend: sible code pairs. To assign labels at the paragraph level, as the semantic similarity between a paragraph and its we compute semantic similarity between each paragraph paired GRI-SDG labels increases, a larger proportion of and the textual definitions of GRI disclosures and SDG annotations is retained. This suggests that LLaMA’s filtargets, using pre-trained sentence encoders [ 9, 10, 11 ]. tering decisions are guided by semantic alignment, reinThis allows us to rank and select the most plausible (GRI, forcing the efectiveness of our similarity-based scoring SDG) annotation pairs, resulting in a high-confidence, approach for assessing label relevance. automatically annotated dataset. Second, we directly compare downstream performance

Despite these constraints, unsupervised annotation when training models on data with and without LLMmethods—especially those based on bootstrapping and se- based filtering. Across all configurations, we observe that mantic similarity—can introduce noisy or weakly aligned pruning improves overall classification accuracy. These labels. To address this, we propose a pruning strategy ifndings suggest that the pruning step not only aligns that further refines annotation quality. Specifically, we with human judgments but also consistently enhances employ an instruction-tuned large language model (LLM), the utility of the resulting training data for sustainability such as LLaMA 3.1 [ 12 ], to assess the contextual fit of text classification. each paragraph-label pair. The model is prompted to The remainder of this paper is organized as follows: answer, in a binary fashion, whether the proposed an- Section 2 reviews the relevant literature. Section 3 innotation is relevant to the given paragraph. This step troduces our automatic annotating and pruning methodiflters out misaligned pairs and improves the reliability of ology. Section 4 outlines the experimental setup and the final dataset for downstream sustainability analysis. presents our evaluation results. Finally, Section 5 conWhile our implementation uses LLaMA 3.1, the approach cludes the paper and discusses directions for future reis compatible with other instruction-tuned LLMs. search.

Directly assessing the quality of unsupervised annotations is inherently challenging due to the lack of groundtruth labels at scale. To address this, we adopt an indirect 2. Related Work evaluation strategy: we train a supervised classifier on our pruned automatically annotated dataset and assess Sustainability Reporting Frameworks. Sustainabilits performance on a well-established benchmark, the ity reporting is increasingly guided by global frameOSDG Community Dataset [ 6 ]. Our working hypothesis works such as the United Nations Sustainable Developis that if the inclusion of pruned automatically annotated ment Goals (SDGs) [ 1 ], the Global Reporting Initiative data leads to improved classification performance on the (GRI)3, and Environmental, Social, and Governance (ESG) OSDG benchmark, then these data contribute useful in- principles [ 3 ]. The 2030 Agenda outlines 17 SDGs and formation.2 Preliminary results confirm that supplement- 169 targets addressing major global development chal2Although our method generates both SDG and GRI labels, we focus on SDG evaluation in this work. Joint assessment of SDG and GRI annotations is left for future research. 3https://www.globalreporting.org/standards/ lenges [ 13 ], while the GRI, established in 1997, ofers a related prompts and identifying nuanced sustainability structured framework for reporting economic, environ- issues. mental, and social impacts [ 2 ]. It provides standardized Encoder-decoder models like BART [28] combine disclosures—both required and recommended—that help text understanding and generation, making them wellorganizations systematically communicate their contri- suited for complex tasks such as summarization. Though butions. To support SDG integration, the Action Platform less commonly used, they have proven efective in susReporting on the SDGs4, in collaboration with GRI, ofers tainability reporting—e.g., BART was used for SDG multia database that maps SDG targets to specific GRI codes label categorization [29]. and disclosures, enabling companies to identify relevant Following the trends outlined above, our approach reporting items and align strategic goals with operational assigns task-specific roles to decoder-only and encodermetrics. only LLMs based on their architectural strengths. We use LLaMA 3.1—an instruction-tuned decoder-only Large Language Models in Sustainability Reporting model—to filter noisy or weakly aligned GRI-SDG anLarge Language Models (LLMs) have become powerful notations through generative prompting, guided by an tools in natural language processing, ofering innovative embedding-based similarity scoring process. Specifically, solutions to longstanding challenges in sustainability we use a pre-trained MpNet model to compute alignment reporting. Their high accuracy and adaptability make scores between each paragraph and its associated GRIthem well-suited for extracting structured data, perform- SDG label descriptions, allowing us to generate more ing textual analysis, and identifying misleading green semantically grounded annotations by prioritizing laclaims [ 5 ]. bel pairs with the highest similarity. For downstream

LLMs are typically categorized into three main types classification, we fine-tune a BERT-based encoder model based on their neural architecture: encoder-only, decoder- for multi-label SDG prediction, capitalizing on its efeconly, and encoder-decoder models [ 14 ]. tiveness in structured, discriminative tasks. This design

Encoder-only models, such as BERT [ 15 ], focus on reflects a practical alignment between model capabilities encoding the input text into rich contextual represen- and task requirements in the context of sustainability tations using self-attention mechanisms. These mod- reporting. Moreover, by improving the quality of both els are especially efective for classification and inter- human and automatically annotated data, our approach pretive tasks like sentiment analysis and named entity contributes to more reliable alignment with established recognition. These models dominate sustainability NLP reporting standards such as the SDGs and GRI, thereby applications due to their high performance on classifi- supporting more transparent and accountable sustaincation tasks. They have been widely used for aligning ability disclosures. corporate texts with SDGs [ 16, 17, 18 ], GRI [ 19 ], and ESG [ 20, 21, 22 ]. Models like BERT, RoBERTa, SBERT, 3. Automatic Paragraph MiniLM, and DistilBERT are frequently fine-tuned to extract structured insights and detect misleading green Annotation via Structured claims using ClimateBERT [23] and MacBERT [24]. For Indices, Semantic Similarity, example, ESG-KIBERT [ 20 ] employs an encoder-only architecture specifically designed for industry-specific ESG and LLM Filtering evaluation, demonstrating how domain adaptation can improve the performance of deep language models in We present a multi-step pipeline for automatically annosustainability contexts. tating paragraphs from sustainability reports with both

Decoder-only models, such as LLaMA [ 12 ], operate GRI (Global Reporting Initiative) and SDG (Sustainable auto-regressively by predicting one token at a time con- Development Goals) labels. The process leverages docuditioned on prior outputs. This makes them suitable for ment structure, oficial mappings, and semantic similarity, generative tasks such as text completion, summarization, with a final human-like filter based on a large language and dialogue generation. Recent studies underscore the model. growing role of decoder-only models in sustainability reporting, particularly through their integration with Paragraph Segmentation and Preprocessing. Each retrieval-augmented generation (RAG) techniques [25], report is parsed with a layout-aware tool (e.g., PyMuPDF5), as demonstrated in ESG applications by Bronzini et al. extracting all text blocks and filtering out headers, foot[26] and Zou et al. [ 3 ]. Additionally, Jain et al. [27] high- ers, and fragments. Only blocks of at least 20 words are lighted the efectiveness of GPT-3.5 in addressing ESG- retained as candidate paragraphs. 4https://www.globalreporting.org/reporting-support/ goals-and-targets-database/

5https://github.com/pymupdf/PyMuPDF

• The candidate set as all GRI codes explicitly linked to via the content index. • The alternative set as all remaining GRI codes not mentioned in the index for , but potentially relevant based on semantic content.

This produces two filtered sets of candidate triples: those based on content-indexed GRI codes, and those based on alternative codes. For the running example, the triples derived from the content index are:

For example, a typical extracted paragraph might be: Quality Education)—and (ii) it guarantees that down“In 2023, CompanyX reduced its greenhouse gas emissions stream semantic similarity scoring is only performed by 15% by switching to renewable energy sources. The between a paragraph and label pairs with a recognized organization remains committed to transparent reporting conceptual connection, thus improving interpretability of its climate targets and actions.” and actionability for sustainability analysis. Given a paragraph , we use its associated GRI Generating Candidate and Alternative Labels. codes—those directly referenced in the content index Most reports include a GRI content index, a table authored (candidate set) and all other codes not mentioned (alterby the company that indicates, for each GRI disclosure native set)—to generate all valid triples (, , ), where code (e.g., GRI 305: Emissions, GRI 302: Energy), ∈ ℳ(). For example, as above: the specific pages where the disclosure is addressed.

For each paragraph occurring on page , we define: • GRI 305 maps to SDG 13 (Climate Action), • GRI 302 maps to both SDG 13 and SDG 7 (Affordable and Clean Energy).

Continuing the example, suppose the GRI content in- • (paragraph, GRI 305, SDG 13), dex indicates that the pages containing the paragraph • (paragraph, GRI 302, SDG 13), above refer to GRI 305 (Emissions) and GRI 302 (En- • (paragraph, GRI 302, SDG 7). ergy). These two codes are included in the candidate set for the paragraph, as they are explicitly claimed by the At this stage, all generated triples are semantically plaureport on that page. All remaining GRI codes—among sible and ready for embedding-based similarity scoring. the approximately 33 topical standards defined in the GRI framework—are considered part of the alternative Semantic Similarity Ranking. Even after filtering set. These alternatives are not mentioned in the content out irrelevant combinations via the oficial GRI→SDG index for this page, but may still be semantically relevant mapping, each paragraph remains associated with a large to the paragraph based on its content. Note that, due number of possible label pairs. We therefore rank all to the broad and multi-faceted nature of sustainability remaining (paragraph, GRI, SDG) triples based on how topics, the content index is not expected to capture all semantically aligned they are with the paragraph content. relevant GRI standards for each page. It typically high- To quantify alignment, we use a pre-trained sentence lights the main disclosures, while secondary or nuanced encoder (MPNet [ 9 ]) to compute cosine similarities in themes may be omitted. By considering both the candi- embedding space. For each triple, we consider the textual date set (directly indexed codes) and the alternative set description of the SDG target and all available disclosure (other potentially relevant codes), our approach accounts requirements associated with the GRI code. We define for both explicit priorities and additional associations the similarity score (, , ) as: present in the narrative.

Expansion to SDG Pairs via Oficial Mapping. Each

GRI code captures a specific disclosure standard (e.g., energy consumption, gender pay equality), while each SDG where e is the embedding of the paragraph, is the describes a broader societal goal (e.g., SDG 7: Afordable set of disclosure texts for GRI code , and is the set and Clean Energy; SDG 5: Gender Equality). To bridge of textual definitions for SDG (typically the goal and these conceptual levels in a principled way, we use the its targets). This formulation favors pairs for which both oficial mapping 6 ℳ, which links each GRI code only to components—GRI and SDG—are independently relevant semantically relevant SDG targets. to the paragraph: if either component is weakly aligned,

This mapping is essential for two reasons: (i) it the product score will be low. This reflects the intuition avoids generating irrelevant or misleading (GRI, SDG) that a good annotation should simultaneously satisfy pairs—since not every combination is meaningful in prac- both frameworks. For example, suppose a paragraph tice (e.g., GRI 305: Emissions is unrelated to SDG 4: discusses emissions reduction due to renewable energy adoption. We obtain: (, , ) = max cos(e, e) · m∈ax cos(e, e) ∈ 6https://www.globalreporting.org/reporting-support/ goals-and-targets-database/ • cos(paragraph, GRI 305) = 0.92 (strong match with “Reduction of GHG emissions”), • cos(paragraph, SDG 13) = 0.88 (climate ac- Permissive Policy: This policy is designed to maximize tion), recall and accommodate semantic ambiguity—useful for • cos(paragraph, GRI 302) = 0.69 (energy re- exploratory analysis or downstream expert curation.

duction consumption), • cos(paragraph, SDG 7) = 0.54 (clean energy). 1. Find the candidate triple with the highest score

and set a threshold at half that value. 2. Retain up to two candidate triples whose scores exceed this threshold (to account for ties or nearequivalent topics). 3. Always include the best-scoring alternative triple, regardless of its absolute score, ensuring that strong semantic signals outside the index are never discarded a priori.

As a result, this policy can return up to three triples (two

candidates plus one alternative) for a given paragraph, allowing for richer, multi-label annotation. In summary, the conservative policy favors precision, whereas the permissive policy promotes recall and label diversity. The resulting joint scores are: (GRI 305, SDG 13): 0.92× 0.88 = 0.81, (GRI 302, SDG 13): 0.69 × 0.88 = 0.61, (GRI 302, SDG 7): 0.69 × 0.54 = 0.37.

Notably, we compute these scores for both candidate and alternative triples. While candidate triples originate from the GRI content index (i.e., the report explicitly claims these topics are discussed on the page), alternative triples arise from GRI codes not mentioned in the index. Though potentially less reliable, alternative labels may capture omissions or relevant but unindexed content. Hence, if a triple from the alternative set obtains a substantially higher semantic score than those in the candidate set, it may signal that the original index missed something. In this case, our strategy allows the model to retain the best alternative triple. While semantic similarity ofers a useful initial filter, it may miss deeper context or introduce noise. To address this, we add later an LLM-based filtering step for more robust alignment.

Final Filtering with LLM Relevance Assessment

While semantic similarity models are powerful for linking text to structured concepts, they can sometimes overestimate relevance—especially for vague, generic, or multitopic paragraphs. For example, a paragraph mentioning “sustainable growth” could weakly match almost any Disambiguation Policies: Conservative and Permis- SDG, leading to noisy or spurious labels even after caresive. After ranking all (paragraph, GRI, SDG) triples by ful mapping and scoring. joint semantic similarity, the final step is to select which To further improve annotation quality, we add a fiannotations to retain for each paragraph. This choice nal “human-like” relevance check using a large language must balance precision (avoiding spurious labels) with model (LLM) such as LLaMA 3.1 Instruct. This step serves recall (capturing genuine but possibly under-indexed con- two key purposes: i) it filters out weak, contextually intent). We propose two complementary disambiguation appropriate, or overly broad matches that the similaritypolicies, which reflect diferent trade-ofs between cover- based method might miss; ii) it simulates expert review age and selectivity. at scale, bringing richer contextual understanding and Conservative Policy: This policy is tailored for high- nuanced judgment—skills typically seen in human annoprecision applications, where false positives are espe- tators—while maintaining automation and consistency. cially costly. For each paragraph, we: For each retained (paragraph, GRI, SDG) triple, we con1. Identify the best-scoring candidate triple (i.e., de- struct a structured prompt (shown in Figure 1) presenting rived from the GRI codes listed in the report’s the paragraph and the oficial descriptions of both labels. index for the relevant page). The LLM is asked to answer—based solely on the evi2. Identify the best-scoring alternative triple (i.e., dence given—whether the label pair is truly relevant to derived from any other valid (GRI, SDG) pair for the paragraph content. Only those triples receiving a the paragraph). “Yes” are included in the final dataset. 3. If the candidate triple’s score is greater than or For instance, a paragraph describing the company’s equal to the alternative’s, we retain only the can- general commitment to “sustainable development” might didate triple—reflecting high confidence in the weakly match several SDGs and GRIs in embedding space, company’s index. but only a focused LLM assessment can determine if a spe4. If the alternative triple has a higher score, we cific (GRI, SDG) pair is truly justified by the text. In this return both the best candidate and the best alter- way, the LLM acts as a high-precision, scalable expert-innative. This accounts for possible omissions or the-loop filter. This LLM-based filtering step significantly underreporting in the index, while maintaining reduces false positives, capturing complex connections interpretability. and subtle mismatches that even strong embedding models may overlook. In efect, it combines the scale and In practice, this policy outputs either one or two annota- speed of automated annotation with the contextual depth tion triples per paragraph.

You are a sustainability evaluation assistant. Decide if the following GRI–SDG pair is relevant to the paragraph.

Paragraph: “Paragraph content here” GRI [GRI Code]: GRI Description here SDG [SDG Name]: SDG Description here Only reply with one word: Yes or No.

Format: Answer: Yes (or)

Answer: No of human reasoning, resulting in a cleaner, more trustworthy annotated dataset ready for downstream analysis or model training.

4. Experimental Evaluation

We conduct a comprehensive experimental evaluation to assess the efectiveness of our automatic annotation pipeline and its LLM-based filtering component. Our analysis focuses on two main questions: (i) does LLM filtering produce label decisions that align with human consensus? and (ii) how do diferent label selection policies (conservative vs. permissive) and LLM filtering impact the quality and utility of the resulting annotated data for downstream SDG classification? puted cases) to 1.0 (full agreement among annotators).

We use the LLaMA 3.1 Instruct model as a post-hoc filter: for each paragraph-SDG pair in OSDG-CD, we prompt the model to decide if the label is relevant to the paragraph, using the same structured format adopted in our main pipeline. We then analyze the fraction of examples retained (“Yes” by the LLM) across diferent agreement intervals.

Table 1 reports the frequency distribution of samples across agreement bins, and Figure 2 visualizes the key re4.1. LLM Filtering and Human Consensus sult: the likelihood of a sample being retained by the LLM on OSDG-CD iflter increases monotonically with human agreement. A natural concern when introducing LLM-based filter- In other words, pairs with high human consensus are ing into any annotation pipeline is whether the model’s almost always preserved by the model, while those with binary “Yes/No” relevance judgments are in fact consis- low or disputed agreement are more frequently filtered tent with human annotation practices. While LLMs are out. This positive correlation provides strong evidence increasingly adopted as automated evaluators or assis- that LLM-based filtering is not arbitrary, but instead captants, there is limited empirical evidence on how closely tures a notion of relevance that closely mirrors collective their filtering behavior tracks with actual human agree- human judgment. ment—particularly in specialized domains such as sus- This result has two important implications. First, tainability. To address this, we leverage the OSDG Com- it provides empirical support for using LLMs as scalmunity Dataset (OSDG-CD), a large-scale benchmark in able, “expert-in-the-loop” filters for semantic annotation, which each paragraph-SDG pair is annotated not only even in cases where manual adjudication would be prowith the assigned label, but also with an explicit agree- hibitively expensive. Second, it suggests that LLMs can ment score reflecting the proportion of human annota- help mitigate annotation noise in weakly or ambiguously tors who supported the label assignment. This agreement labeled data—removing many of the examples that huscore provides a direct, interpretable measure of human mans themselves would likely judge as borderline or consensus, ranging from 0.1 (highly ambiguous or dis- unreliable. Overall, this agreement-guided analysis not only validates our specific use of LLM filtering in the construction of GRI-SDG training data, but also suggests a broader role for LLMs as automatic quality controllers in human-in-the-loop NLP pipelines. 4.2. Assessing Labeling Strategies for

Automatic Paragraph Annotation whether LLM-based filtering efectively improves the utility of automatically annotated data, and how the choice of annotation policy (conservative vs. permissive) impacts downstream model performance.

Experimental Setup. To systematically evaluate our annotation pipeline, we applied it to a curated corpus of 30 sustainability reports spanning 10 sectors and 3,663 Training Simple Complex pages. After preprocessing and paragraph segmentation, Conservative 0.762 0.737 we obtained 19,133 candidate paragraphs, of which 10,303 Conservative + LLM 0.783 0.752 were indexed by company-provided GRI content indices PPeerrmmiissssiivvee + LLM 00..768286 00..666905 and thus eligible for annotation. Annotation followed the multi-step procedure described in Section 3: we generated (GRI, SDG) label pairs using the oficial mapping, Results (Table 2) indicate that both policies benefit scored their semantic similarity, and selected final annota- from LLM filtering, but to diferent extents. The contions according to either the conservative (high-precision, servative policy (high-precision, fewer labels) already at most one or two triples per paragraph) or permissive yields reasonably strong results, but applying LLM fil(higher recall, up to three triples) policy. tering further increases accuracy by removing residual

Applying the conservative policy yielded 17,216 label false positives. The permissive policy (higher recall, more pairs initially, which were reduced to 4,558 after LLM- candidate triples per paragraph) initially introduces subbased relevance filtering. The permissive policy produced stantially more noise, as reflected in lower baseline aca higher initial volume of annotations (30,647 label pairs), curacy; however, LLM filtering provides a larger relative which was pruned to 7,425 after filtering with LLaMA improvement—yet, even after filtering, the permissive 3.1 Instruct. This substantial reduction confirms the im- setting still lags behind the conservative one in absopact of the LLM-based step in filtering out weak or noisy lute performance. This suggests that, while the LLM can annotations, ultimately improving the quality and reli- mitigate a large portion of annotation noise, excessive ability of the final labeled dataset. For evaluation, we over-labeling (as in the permissive setting) cannot be leveraged the OSDG Community Dataset (OSDG-CD), fully corrected in post-processing, and some spurious aswhich contains single-label SDG assignments per para- sociations may persist. In summary, LLM-based filtering graph, validated by crowdsourced agreement scores. To systematically improves the quality of automatically genensure reliability, we defined two test splits: a Simple erated labels, especially in the presence of noisy or overly set (agreement = 1.0, fully unambiguous) and a Com- broad candidate assignments. However, the conservative plex set (0.7 ≤ agreement ≤ 1.0). All models were policy remains preferable in settings where downstream trained in a multi-label setting, but evaluated using only precision is paramount7. the highest-scoring prediction per paragraph to match Does Adding Automatically Annotated Data Benethe OSDG single-label ground truth. As a baseline, we fit Supervised Training? In a second experiment, we used a BERT-based classifier ( bert-base-cased). We assessed whether supplementing human-annotated data used a standard binary cross-entropy loss for multi-label (OSDG-CD) with LLM-pruned automatic annotations classification over the full label set, treating each label yields tangible improvements in SDG classification. independently during training. The model was trained with an efective batch size of 16 (via gradient accumu- Table 3 lation over 4 mini-batches of size 4), using the AdamW Accuracy on OSDG test sets with and without adding pruned optimizer with a learning rate of 2 × 10− 5, weight decay automatic data (Cons.: conservative, Perm.: permissive). of 0.1, and a linear learning rate scheduler with a warmup ratio of 0.1, for a total of 5 training epochs. Accuracy is Training Simple Complex defined as the percentage of paragraphs for which the OSDG (full) 0.917 0.907 top predicted label matches the ground truth; since the OOSSDDGG ++ PCeornms.. ++ LLLLMM 00..991291 00..990190 OSDG test set provides only one true label per paragraph, this top-1 accuracy measure is equivalent to precision, recall, and F1-score, which are therefore omitted.

Does LLM Filtering Improve Automatically Annotated Training Data? Our first experiment tests

7Note that the test set requires a single SDG per paragraph, so we

evaluate our classifier by selecting only the top prediction. This may not capture all relevant SDGs, especially for complex cases, but gives a reasonable first estimate of performance.

Results in Table 3 show that, for both policies, adding Table 4 pruned automatic annotations to the OSDG training set Mean product similarity score for retained vs. discarded samconsistently increases accuracy on both simple and com- ples under conservative and permissive label selection. plex test splits. While the gains are modest, they are Policy Category Retained Discarded robust across settings, confirming that our pipeline produces useful complementary signal even in the presence Overall 0.434 0.321 of expert-labeled data. As in the previous experiment, Conservative Alternatives 0.463 0.351 the conservative policy remains more reliable, providing Candidates 0.422 0.298 slightly higher accuracy than the permissive policy; the Overall 0.414 0.308 latter, despite contributing more examples, appears to Permissive Alternatives 0.456 0.353 introduce a small amount of residual noise that is not Candidates 0.400 0.283 fully eliminated by LLM filtering.

Taken together, these findings support a dual conclusion: (1) the automatic annotation pipeline is efective combined set. To ensure statistical significance, we only for scalable SDG data generation, and (2) the interplay report bins containing at least 700 samples. The threshbetween label selection policy and LLM-based filtering old of 700 samples was chosen empirically based on the is crucial for balancing coverage and precision. The con- distribution of paragraph counts across prediction score servative strategy, enhanced by LLM filtering, delivers intervals. Specifically, we observed that the total number high-quality labels that boost supervised learning, while of samples in the higher-confidence intervals—i.e., those the permissive strategy is valuable for recall-oriented greater than 0.7 ((0.7-0.8], (0.8-0.9], (0.9-1])—was only 272 applications but requires careful calibration to avoid ex- (227 + 40 + 5). Given such low sample sizes, reporting cessive noise. performance metrics for these bins would risk statistical instability and lack of representativeness. To mitigate 4.3. Analysis of LLM Retention Decisions this, we selected 700 as a minimum cutof to ensure that each bin included in our analysis contains a suficient on Automatically Annotated Data number of samples for reliable metric estimation. This threshold balances coverage across confidence intervals with the statistical reliability of the reported results.

Having established that the LLM-based filter is well aligned with human consensus on the OSDG dataset (Section 4.1), we next analyze how the LLM’s binary relevance judgments interact with the underlying seman- 80% tic similarity scores in our full, automatically annotated 70% Proportion of Alternatives Kept dataset. This provides a deeper understanding of whether 60% Proportion of Candidates Kept the LLM filter simply introduces an arbitrary bottleneck, 50% Proportion of Data Kept or if it systematically reinforces semantic quality. 40%

We consider the product similarity score—the product 30% of cosine similarities between a paragraph and its asso- 1200%% ciated GRI and SDG descriptions (see Section 3)—as a 0% measure of semantic alignment for each candidate la- (0.1-0.2] (0.2-0.3] (0.3-0.4] (0.4-0.5] (0.5-0.6] (0.6-0.7] bel. For every (paragraph, GRI, SDG) triple, we record Product Similarity Score Intervals whether the LLM filter retained the annotation (“Yes”) or Figure 3: Proportion of (paragraph, GRI, SDG) triples rediscarded it (“No”). Table 4 reports the mean similarity tained by the LLM filter as a function of product similarity scores for retained and discarded samples, disaggregated score, binned by intervals. Results are shown for candidate, alby both label type (Candidate, Alternative) and selection ternative, and all labels under the conservative (Top-1) policy. policy (Conservative, Permissive).

As shown, the LLM filter systematically prefers to reThe figure demonstrates a clear monotonic trend: as tain labels with higher semantic similarity to the parathe product similarity score increases, the probability of graph, regardless of whether they are candidate or alretention by the LLM rises sharply. For scores below 0.3, ternative labels, and across both policies. The efect is fewer than 20% of labels are retained, while for scores particularly pronounced for alternatives, which are only above 0.6, the retention rate exceeds 60%. This pattern kept when they exhibit a strong semantic match.

holds for both candidates and alternatives, further supTo further examine this relationship, we discretize the porting the conclusion that the LLM acts as a semantic similarity scores into bins and calculate, for each bin, relevance filter—amplifying the selectivity of the autothe proportion of samples retained by the LLM. Figure 3 matic annotation pipeline and systematically favoring presents these retention rates for the conservative (Toplabels with strong textual alignment. 1) policy, separately for candidates, alternatives, and the

In summary, these results indicate that our LLM-based ifltering mechanism is not merely an arbitrary postprocessing step, but an efective semantic validator: it consistently prioritizes label assignments with robust evidence in the paragraph text. documents that inspired and enabled this initial experimental study. We acknowledge financial support from the PNRR project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU.

5. Conclusion and Future Work References

This work presents a fully automated pipeline for largescale annotation of sustainability reports at paragraph level, aligning text with both GRI disclosures and SDG targets. Leveraging structured metadata, oficial GRI-SDG mappings, semantic similarity, and an LLM-based relevance filter (LLaMA), our method ofers an interpretable and scalable alternative to manual annotation. The LLM iflter proves highly efective in reducing semantic noise and producing annotations that closely match human consensus.

Our experiments show that LLaMA-based filtering favors labels with high semantic similarity, aligns with human judgments on the OSDG benchmark, and consistently improves downstream SDG classification—even when combined with expert-labeled data. While permissive labeling increases coverage, it also adds noise that is only partly corrected by LLM filtering.

This pipeline lays the foundation for more transparent and data-driven sustainability analytics. Future research will focus on several open challenges. First, we aim to expand the LLM filter to provide natural language justiifcations for its decisions, improving explainability and facilitating expert validation. We also acknowledge that scalability may become a limitation when applying our pipeline to thousands of reports, particularly due to the computational cost of LLM-based filtering; addressing this bottleneck through optimization or distillation techniques is a key direction for future work. Second, while our current evaluation is primarily model-based, we plan to conduct in-depth human studies, including manual validation of high-confidence (GRI, SDG) pairs, and direct comparisons with prior supervised approaches [ 16, 18 ], especially regarding the annotation of GRI codes. Third, we envision extending our framework to cover a wider array of sustainability and ESG standards, as well as to support fine-grained analysis of the substance and quality of sustainability reporting—such as distinguishing between specific, verifiable disclosures and generic statements, thus advancing automated detection of greenwashing.

Acknowledgments We thank Armando Calabrese, Roberta Costa and Luigi

Tiburzi for their valuable advice, insightful discussions on sustainability reporting, and for generously sharing Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1]

Nationen , Transforming Our World: The 2030 Agenda for Sustainable Development: A/Res/70/1, United

Nations

, Division for Sustainable Development , 2015 .

[2]

H. Q.

Ngee ,

Ganesh ,

M. A. N.

Azmi ,

T. Y.

Tang ,

Mukred ,

Mohammed ,

A. A. B.

Ahmad , Environmental, social and governance (esg) scores automation in global reporting initiative (gri) with natural language processing , in: Proc. 2024 7th Int. Conf. Internet Appl ., Protocols , and Services (NETAPPS), 2024 , pp. 1 - 7 .

[3]

Zou ,

Shi ,

Chen ,

Deng ,

Lei ,

Zeng ,

Yang ,

Tong ,

Xiao ,

Zhou , Esgreveal: An llm-based approach for extracting structured data from esg reports , J. Clean. Prod . 489 ( 2025 ) 144572 .

[4]

Kang ,

Kim , Analyzing and visualizing text information in corporate sustainability reports using natural language processing methods , Appl. Sci . 12 ( 2022 ) 5614 .

[5]

Moodaley ,

Telukdarie , A conceptual framework for subdomain specific pre-training of large language models for green claim detection , Eur. J. Sustain. Dev . 12 ( 2023 ) 319 . doi: 10 .14207/ejsd. 2023 .v12n4p319.

[6]

Pukelis ,

Bautista-Puig , G. Statulevičiu¯tė,

Stančiauskas ,

Dikmener ,

Akylbekova , Osdg 2.0: A multilingual tool for classifying text data by un sustainable development goals (sdgs ), arXiv preprint abs/2211 .11252 ( 2022 ). Available at: https: //arxiv.org/abs/2211.11252.

[7]

Jakob ,

Schmitt ,

Mohtaj ,

Möller , Classifying sustainability reports using companies selfassessments , in: Future of Information and Communication Conference , Springer, 2024 , pp. 547 - 557 .

[8]

Nechaev ,

D. S.

Hain , Social impacts reflected in csr reports: Method of extraction and link to ifrms' innovation capacity , J. Clean. Prod . 429 ( 2023 ) 139256 .

[9]

Song ,

Tan ,

Qin ,

Lu , T.-Y. Liu, Mpnet: Masked and permuted pre-training for language understanding , Adv. Neural Inf. Process. Syst . 33 ( 2020 ) 16857 - 16867 .

[10]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , in: Proc. 2019 Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (NAACL-HLT) , Vol. 1 (Long and Short Pa- social, and governance communication, Finance pers) , 2019 , pp. 4171 - 4186 . Res. Lett . 61 ( 2024 ) 104979 . doi: 10 .1016/j.frl.

[11]

Reimers , I. Gurevych , Sentence-bert: Sentence 2024 . 104979 . embeddings using siamese bert-networks , in: Proc. [ 22]

Gupta ,

Chadha ,

Tewari , A natural lan2019 Conf. Empirical Methods Nat. Lang. Process. guage processing model on bert and yake technique and 9th Int. Joint Conf. Nat. Lang . Process. (EMNLP- for keyword extraction on sustainability reports , IJCNLP) , 2019 , pp. 3982 - 3992 . doi: 10 .18653/v1/ IEEE Access ( 2024 ). doi: 10 .1109/ACCESS. 2024 . D19- 1410 . 3352742 .

[12]

Touvron ,

Lavril ,

Izacard ,

Martinet , [23]

Vinella ,

Capetz ,

Pattichis ,

Chance , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , R.

Ghosh , Leveraging language models to detect E. Hambro , F.

Azhar , A.

Rodriguez , A . Joulin, greenwashing, arXiv preprint arXiv: 2311 .01469

Grave , G. Lample, Llama: Open and efi - ( 2023 ). Available at: https://arxiv.org/abs/2311. cient foundation language models , arXiv preprint 01469. arXiv:2302.13971 ( 2023 ). Available at: https://arxiv. [24]

Wang ,

Gao ,

Sun , Construction and analysis org/abs/2302 .13971. of corporate greenwashing index: a deep learning

[13] T. B. Smith , R.

Vacca , L.

Mantegazza , I. Capua, approach, EPJ Data Sci . 14 ( 2025 ) 1 - 25 . Natural language processing and network anal- [25]

Mehul ,

V. R.

Kanagavalli ,

K. R.

Saradha , P. N.

ysis provide novel insights on policy and scien-

Gowtham , M. P.

Sachin , U. Surya, R.

Godhandaratific discourse around sustainable development man , S. Girish,

Naveen , Gen ai driven faq chatgoals , Sci. Rep . 11 ( 2021 ) 22427 . doi: 10 . 1038/ bot using advanced rag architecture for querying s41598-021-01801-6 . annual reports, in: Proc. 2025 Int. Conf. Comput.

[14]

Qiu ,

Sun ,

Xu ,

Shao ,

Dai ,

Huang , Pre- Commun. Technol. (ICCCT) , 2025 , pp. 1 - 6 . trained models for natural language processing: A [26]

Bronzini ,

Nicolini ,

Lepri ,

Passerini , J. Stasurvey, Sci. China Technol. Sci. 63 ( 2020 ) 1872 - 1897 . iano, Glitter or gold? deriving structured insights doi: 10 .1007/s11431-020 -1647-3. from sustainability reports via large language mod-

[15]

Devlin , M.-

Chang ,

Lee ,

Toutanova , els, EPJ Data Sci . 13 ( 2024 ) 41 . doi: 10 .48550/ Bert: Pre-training of deep bidirectional transform- arXiv.2310 .05628. ers for language understanding , arXiv preprint [27]

Jain ,

Gupta ,

Yalciner ,

Y. N.

Joglekar , arXiv: 1810 . 04805 ( 2018 ). Available at: https://arxiv. P. Khetan, T. Zhang, Overcoming complexity in org/abs/ 1810 .04805. esg investing: The role of generative ai integration

[16]

Angin ,

Taşdemir ,

C. A.

Yılmaz , G. Demiralp, in identifying contextual esg factors , SSRN ( 2023 ). M. Atay,

Angin ,

Dikmener , A roberta ap- Available at SSRN 4495647. proach for automated processing of sustainability [28]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad ,

Moreports , Sustain. 14 ( 2022 ) 16139 . doi: 10 .3390/ hamed,

Levy ,

Stoyanov , L. Zettlemoyer, Bart: su142316139. Denoising sequence-to-sequence pre-training for

[17]

Li , M. Rockinger, Unfolding the transitions in natural language generation, translation, and comsustainability reporting , Sustain . 16 ( 2024 ) 809 . prehension, arXiv preprint arXiv: 1910 . 13461 ( 2019 ). doi: 10 .3390/su16020809. Available at: https://arxiv.org/abs/ 1910 .13461.

[18]

Yao ,

Tian , C.-U. Lei,

D. K. W.

Chiu , Assigning [29]

Gonzalez ,

Jin ,

Schölkopf ,

Hope , M. Sachan, multiple labels of sustainable development goals to R. Mihalcea, Beyond good intentions: Reporting open educational resources for sustainability edu- the research landscape of nlp for social good, arXiv cation , Educ. Inf. Technol . 29 ( 2024 ) 18477 - 18499 . preprint arXiv: 2305 .05471 ( 2023 ). Available at: https:

[19]

Hillebrand ,

Pielka ,

Leonhard , T. Deußser, //arxiv.org/abs/2305.05471. T. Dilmaghani , B.

Kliem , R.

Loitz , M.

Morad , C.

Temath , T.

Bell , et al., sustain.ai: a recommender system to analyze sustainability reports , in: Proc. Online Resources 19th Int. Conf. Artif. Intell. Law , 2023 , pp. 412 - 416 . doi: 10 .1145/3594536.3595131. • OSDG Community Dataset,

[20]

Lee ,

J. H.

Kim ,

H. S.

Jung , Esg-kibert: A new • United Nations Sustainable Development Goals paradigm in esg evaluation using nlp and industry- (SDGs) specific customization , Decis. Support Syst . 193 ( 2025 ) 114440 . • Global Reporting Initiative (GRI)

[21]

Schimanski ,

Reding ,

Reding , J. Bingler, • GRI-SDG Mapping M. Kraus , M.

Leippold, Bridging the gap in esg measurement: Using nlp to quantify environmental,