1. Introduction

Sparse Autoencoders Find Partially Interpretable Features in Italian Small Language Models

Alessandro Bondielli

0 1

Lucia Passaro

0 1

Alessandro Lenci

0 0 CoLing Lab, Department of Philology , Literature and Linguistics , University of Pisa 1 Department of Computer Science, University of Pisa

2025

Sparse Autoencoders (SAEs) have become a popular technique to identify interpretable concepts in Language Models. They have been successfully applied to several models of varying sizes, including both open and commercial ones, and have become one of the main avenues for interpretability research. A number of approaches have been proposed to extract latents from the model, as well as automatically provide natural language explanations for the concepts they supposedly represent. Despite these advances, little attention has been given to applying SAEs to Italian language models. This may be due to several factors: i) the small number of Italian models; ii) the costs associated with leveraging SAEs, which includes the training itself, as well as the necessity to parse and assign an interpretation to a very large number of features. In this work, we present an initial step toward addressing this gap. We train a SAE on the residual stream of the Minerva-1B-base-v1.0 model, for which we release the weights; we leverage an automated interpretability pipeline based on LLMs to evaluate both the quality of the latents, and provide explanations for some of them. We show that, albeit the approach shows several limitations, we find some concepts in the weights of the model.

eol>Mechanistic Interpretability Sparse Autoencoders Large Langauge Models Italian

1. Introduction

a != dimensional space; a decoding function, that should reconstruct the -dimensional data back into the The rise of Large Language Models (LLMs) have pro- original -dimensional one. Autoencoders are typically foundly afected the landscape of Natural Language Pro- used for dimensionality reduction, i.e., << . In the cessing (NLP). These models have demonstrated remark- case of SAEs, instead, >> : the model is trained to able capabilities in many tasks, often achieving near- project the input space into a much higher-dimensional human performances and saturating benchmarks as soon (and thus sparser) one, and then project it back into the as they are released. Nevertheless, many questions re- original dimensional space. In our context, SAEs are main about their internal workings: Whether and how trained to reconstruct the internal activations of a lanthey perform some form of reasoning [ 1 ], and to what guage model’s residual stream by projecting them into a extent their grasp of concepts through natural language higher-dimensional latent space, while being constrained approximates human conceptual understanding. to use only a small number of “features” from a learned

The aim of Mechanistic Interpretability (MechIn- dictionary. This sparsity constraint encourages the SAE terp) is to address this pressing issue by attempting to to learn a set of monosemantic features, also referred to reverse-engineer the learned representations and algo- as latents, that is, features each corresponding to a single, rithms within their neural networks [ 2 ]. A promising hopefully more interpretable concept [4]. This is in contechnique within MechInterp is the use of sparse dictio- trast with a polysemantic representation, which is typical nary learning methods like Sparse Autoencoders (SAEs) of standard dense neural networks [ 5, 6 ], in which sev[ 3 ]. The idea behind SAEs is similar to that of standard eral concepts are superimposed in the same activation autoencoder. Autoencoders are unsupervised models patterns. SAEs allow to decompose model activations that learn two functions: an encoding function, that into a set of near-orthogonal, i.e., largely disentangled projects the input data from an dimensional space into features that should be semantically coherent. Recent work has demonstrated the efectiveness of CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- SAEs in uncovering meaningful features within both toy tics, September 24 — 26, 2025, Cagliari, Italy models [ 7 ] and large-scale commercial LMs, revealing * Corresponding author. representations for concepts ranging from concrete ob†$Thaelessesaauntdhroor.bsocnodniterlilbi@utuedniepqi.uital(lAy.. Bondielli); jects to abstract ideas [ 8, 9, 10 ]. As noted in [ 9 ], several lucia.passaro@unipi.it (L. Passaro); alessandro.lenci@unipi.it distinctive features have been identified in Claude-3.5(A. Lenci) Sonnet – most notably, one corresponding to the “Golden 0000-0003-3426-6643 (A. Bondielli); 0000-0003-4934-5344 Gate Bridge.” SAEs have also been applied successfully (L. Passaro); 0000-0001-5790-43086 (A. Lenci) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License to smaller, English-centric models in the 1 to 10 Billion Attribution 4.0 International (CC BY 4.0). parameter range [11]. This class of models is becom- Autoencoder weights available to the research ing more and more relevant, as research on Small Lan- community via HuggingFace.1 guage Models (SLMs) [12] and Baby Language Models • We collect feature activations from a rela(BabyLMs) [13, 14], that mitigate the costs of training and tively large collection of Italian data, and proserving LLMs while attempting to retain most of their vide a quantitative and qualitative evaluabilities, is a very active endeavour particularly in the ation on the explanations using an autoopen-source/open-weights community. interpretability pipeline. We show that SAE

Two key limitations remain for the applicability of are promising for finding concepts in Italian SAEs to achieve interpretability. First, the computational SLMs, but auto-interpretability pipelines shows cost of training a SAE. Given their nature, the internal several limitations for Italian. layer of a SAE has to be a number of times larger than • We report on the challenges and lessons the size of residual stream, and thus the context window, learned on training and using SAEs, especially of the target LM. The number of parameters of a SAE in computationally constrained settings. scales with a factor of the context size of the model, multiplied by the number of hookpoints in the models where This paper is organised as follows: In Section 2 we activations are collected (e.g., after every transformer detail the training procedure of the SAE; Section 3 problock/layer). Thus, the larger the target LM, the bigger vides an overview of the auto-interpretability pipeline we and the more computationally expensive the SAE. employ; in Section 4 we present and discuss the obtained

Second, and most importantly, SAEs output a large results; finally, Section 5 draws some conclusions and number of features, that have then to be interpreted in highlights future works. some way. While the literature has not reached a consensus on what is the best practice, a popular method to 2. SAE Training address this is to leverage another LLM to provide explanations for the features based on examples of which to- In the following, we detail the data and procedure used kens (and respective contexts) they fired on. For example, to train the SAE on the Minerva-1B-base-v1.0 SLM. if the feature fired on 10 tokens, the explainer model We trained the SAE on the residual stream of the model, is fed with these tokens, their contexts and the request with hookpoints on the outputs of each attention block. to find a common property among them. In most works, For our experiments we used the Sparsify library from commercial LLMs with hundred of billions of parameters EleutherAI,2 which is built to roughly follow the trainare successfully used for this task [ 9, 10 ]. However, re- ing recipe presented in [ 10 ] for a GPT-4 SAE. It trains searchers have also shown that smaller and cheaper LMs a -Sparse Autoencoder [21]. The autoencoder uses a can be leveraged efectively as well [15]. TopK activation function that allows for direct control over the number of active latents. Specifically, it only

The vast majority of eforts regarding the use of SAEs keeps the largest latents and assign zero to the rest. for interpretability has been done on English-centric Authors in [ 10 ] argue that this eliminates the need for LMs[ 9, 10, 11 ]. In addition to this, several eforts have the L1 penalty, which biases activations toward zero and been made in the direction of finding universal features is only a rough proxy for L0, and supports any activathat apply across models and languages [16, 17]. How- tion function. They also show that it outperforms ReLU ever, models primarily trained on languages other than autoencoders in sparsity-reconstruction tradeofs and enEnglish have received less attention. hances monosemanticity as small activations are clamped

In this work, we aim to provide an early evaluation to zero. on the feasibility of using SAEs to interpret models trained to be natively Italian. In the interest of main- Recipe. A full breakdown of the most relevant paramtaining a limited computational cost, we chose to use eters selected for training is presented in Table 1. The the Minerva-1B-base-v1.0 from the Minerva model parameters were chosen following recipes for similar family [18]. We trained a SAE on the residual stream of sized models, e.g. [11]. The expansion factor controls every layer of the model using an Italian split of mC4 the size of the hidden layer, and is a multiplier over the [19]. Then, we collected feature activations for the Italian model context size. In our case, an expansion factor of dump of Wikipedia [20], and attempt to explain them 32 yields a hidden layer of size 2, 048 × 32 = 65, 536 and score explanations automatically using an LLM, fol- parameters. lowing [15].

Our contributions are the following: • We train and release a Sparse Autoencoder on Minerva-1B-base-v1.0. We make the

1https://huggingface.co/alessandrobondielli/sae-Minerva-1B-32x

The model can be used with the Sparsify and Delphi libraries for interpretabilty. 2https://github.com/EleutherAI/sparsify

Data. As for the training data, we chose to use mC4

[22]. Specifically, we consider the “tiny” split of the clean_mc4_it dataset [19]. It includes 6 Billion tokens (4 Billion words). The choice of the dataset was made on the basis that it is relatively large, especially for the Italian language, and it includes a variety of diferent texts. The data was not included in the training set for Minerva-1B-base-v1.0. We chose to use 6 Billion tokens following recent literature on training SAEs for similar-sized models [11].

Setup. We trained our model on a single Nvidia A100

with 80 GB VRAM. A full training run required 200 GPU hours, which roughly equates to 8 days. The final model, that we call sae-Minerva-1B-32x, occupies around 40 GB of disk space including hookpoints to all layers. The final model is available on HuggingFace 3 and can be loaded and used with Sparsify.

3. Auto-Interpretation of Features For finding and explaining latents of the SAE models, we

use the auto interpretability pipeline proposed in [15]. It is implemented via the Delphi library from EleutherAI.4 The library includes tools for generating and scoring text explanations for SAE.

The auto intepretability pipeline has three main steps:

In the following we detail our implementation of the pipeline.

Collecting Activations. As for the text dataset, we chose to use 20 Million tokens from the Italian subset of the November 2023 Wikipedia dump [20] available on HuggingFace.5 The choice of Wikipedia as our test dataset rather than a sample of the SAE training data (clean_mc4_it) was made with the purpose of increasing the probability of finding concepts specific to the Italian language and culture, that could have been left out from a relatively small sample of a web-based dataset. We created equal-sized batches from the texts, shufled them, and then collected their token-level activations. We collected the activations at three hookpoints, namely at layers 2, 8 and 14. We did so with the aim of understanding whether there is any diference in the features found near the beginning, middle, or near the end of the residual stream. In the following we use the hookpoint notation to refer to layers, namely Layer..

Generating Explanations. As for the explanation

generation step, we followed the same procedure as [15]. We showed the Explainer LLM 40 examples of the activating tokens and their contexts. We used a context length of 32 tokens. The activating token can be in any of the 32 positions, but is highlighted as "« token »". We show an example of explanation generation in Figure 1.

To limit the computational cost, we attempted to generate explanations only for a sample of 2,000 latents selected from the pool of 65k. Latents with less than 40 examples were skipped. We used the number of latents with enough examples at each hookpoint in the residual stream to highlight their diferences.

The chosen model to generate explanations is Meta-Llama-3.1-8B-Instruct-AWQ-INT4,6 a quantized version of Meta-Llama-3.1-8B-Instruct [23]. We prompted the model both in English and Italian. For the English prompt, we used the one provided in [15] for the zero-shot experiment. The Italian version is a direct translation of the English prompt. The translation was made semi-automatically: first, the prompts were translated with Gemini-2.5 Pro.7 Then, the translated prompt was manually revised to ensure its quality.8 3https://huggingface.co/alessandrobondielli/sae-Minerva-1B-32x 4https://github.com/EleutherAI/delphi to decide, for each example, whether it corresponded to the explanation, and output a list of of decisions. If the output did not match a list of decision, it was assigned None. The output was then compared with the ground truth provided by the activations. The model for scoring was the same one used to generate explanations. As for the prompt and its translation in Italian, we followed the same translation procedure as well. We evaluated the quality of explanations with accuracy. Specifically, we considered a per-sample accuracy (i.e., how many out of the five examples the scorer model got right) and the average accuracy across across latents for the same hookpoint. the interest of limiting the computational costs of our experiments, i.e., both in terms of the memory footprint of the model, and of the overall GPU hours. Using larger (including non-quantized variants) models would have drastically increased both the need of resources and overall time of the experiments. Nonetheless, we argue that our choice represents a lower-cost alternative to using much larger and costlier models, that could prove especially useful to provide some early insights into the quality of the latents found by the SAE, and of the model being interpreted.

Authors in [15] estimate a cost in the order of hundreds or thousand of dollars for explaining and scoring 100k latents with larger or commercial models; our experiments, in contrast, can be easily replicated on a single GPU. In our case, generating and scoring explanations for 2,000 latents at three diferent hookpoints, in two diferent languages, took 0.5 GPU hours each on a single Nvidia A100, for a grand total of 3 GPU hours. Given the size of the model used, the experiments could be also replicated on much less performing hardware as well, provided a trade-of on GPU hours.

4. Results and Discussion

In the following, we present our results. First, we show a quantitative evaluation of the extracted latents, and the performances of the generation and scoring pipeline, both with Italian and English prompts. To explore the results in greater depth, we also perform a qualitative evaluation. We consider explanations that received highest scores by the scorer model. We use the results to discuss the feasibility of the proposed approach on Italian SLM, as well as potential shortcomings. 4.1. Quantitative Evaluation

The core of our quantitative analysis is based on the

results we obtained using the Delphi library, with the

We acknowledge that our choice of using a multilin- configuration presented in Section 3. gual, relatively small, and quantized LLMs for generating and scoring explanations is far from ideal, and it is not Quality of the Latents. To evaluate the quality of the an adequate substitute neither for human evaluation nor latents obtained via the SAE encoding, several metrics for more performing LLMs. The choice of a multilingual can be used. Recall that we collected latent activations usmodel rather than an Italian-only one was made due to ing 20 Million tokens from the Italian subset of Wikipedia. the current lack of such models with open weights, high Note also that here we are not yet using prompts, so we performances and capability to follow instructions. This do not distinguish between Italian and English. choice led also to prompting the model both in English Table 2 provide several common metrics used to evaluand Italian; this was done to assess its explanation/scor- ate the quality of the extracted latents at each hookpoint. ing capabilities both in its “native” language, albeit on First, we look at fraction of alive latents. A latent is condata from another language, and on Italian, in order to sidered alive if at least one input token in the dataset limit potential biases in the interpretation of results from made it fire. With the exception of Layer.8, the other using only one or the other language. As for the choice two have much smaller fractions of alive latents than it of a medium-sized quantized model, this was made in is typical for SAEs (see for examples results reported in

Quality of the Explanations. To evaluate the quality

of explanations, we consider the results of the explanation generation and scoring pipeline. Specifically, for each latent, we compute the accuracy at distinguishing

Metric Layer.2 Layer.8 Layer.14 between sequences that activate and do not activate the Fraction of latents alive (%) 72.02 95.16 84.65 latent. Figures 2 and 3 show respectively the distribution LLaatteennttss ffiirreedd >>110%%ooffththeetitmimee( %(%) ) 00..2076 00..4005 00..3081 of Accuracy for the scorer model using Italian and EnWeak single-token latents (%) 9.93 2.20 2.77 glish prompts for each hookpoint in the residual stream.

Strong single-token latents (%) 12.40 0.55 0.47 We observe that, in both cases, there are significant Table 2 diferences both in distribution and averages for the three Latent activity statistics across selected layers hookpoints. We also observe that explanations for latents extracted from later layers seem to be easier to score correctly for the scorer model. This may indicate that con[ 10 ] and [11]). This may be the results of several factors. cepts identified in later layers are, on average, more On the SAE side, we could hypothesize an overcomplete easily interpretable by an LLM. The accuracy scores latent space for the evaluation data, i.e. a too broad latent obtained using the Italian prompt are generally higher space for encoding the evaluation data. Recall in fact than those for the English one, with average scores rangthat we used mC4 to train the SAE, and evaluated it on ing from 0.64 to 0.69; the English ones, in contrast, range Wikipedia, which may present less variety in terms of from 0.55 to 0.62. However, these results in isolation cantexts. not be taken as a direct indication that explanations in

On the Language Model side, we could hypothesize Italian are better than English ones. It may as well be the that the latent space of the analyzed model is very result of poorer and broader explanations provided by anisotropic at both earliest and latest layers, while more the Explainer model. isotropic near the middle of the stack. This however is We also plot the aggregate confusion matrices over all in direct contrast with works such as [24], and thus re- the predictions of both prompts. The confusion matrices quires a more in-depth analysis, that we leave to future are shown in Figure 4. While the model prompted in works. Another interesting aspect to consider are weak Italian seem to fare better in all metrics except for True and strong single-token latents, that is latents that fire on Positives, we also see that the number of times the model a specific token only. Weak ones are those for which the was not able to follow instructions and provide a predictoken in question makes many other latents fire; strong tion with the Italian prompt is three times higher than ones are cases where the token preferentially activates with the English one. This may be further indication that the specific latent. We observe that Layer.2 is heavily the Explainer/Scorer model used struggles with Italian. biased towards single token latents. This may indicate that earliest layers sill leverage the embedding represen- 4.2. Qualitative Evaluation tation quite strongly. Finally, we see that latents that ifred either more than one or 10% of the times are less and less as we move towards the residual stream. These latents may be used to store single-token concepts of words such as function ones.

To dig deeper into the quality of the explanations, we di

rectly looked at them and provide examples of seemingly good and bad explanations. Specifically, we sampleed from the 50 explanations that received highest scores by the Scorer, both in English and Italian.

As for the Italian explanations, we immediately observed that a large fractions of them sufer from Degenerate Repetition [25]: The model starts to generate the same token or sequence of tokens over and over. On (a) Italian Prompt. (b) English Prompt. the fact that, while it is specified in the prompt, we use Italian texts as examples but instructions and expected outputs are in English. Neverhteless, we observe an interesting trend: most explanations, at all layers, that actually focus on the firing tokens refer to functional aspects of the text, including punctuation marks, special characters, and functional words. For example, Latent 1818 of Layer.14 is explained as “Prepositions and conjunctions used to connect words or phrases in Italian text, such as "a", "di", "nel", "in", "su", "da", "al", "nei", "all", "sulle", "col" [...]”. This is in contrast with what we observed for Italian explanations. 4.3. Discussion of Key Findings

In the following, we highlight some of the key aspects that emerged from the experiments. SAEs can find partially interpretable features in

Italian Small Language Models. First, we observe that using a SAE we are able to extract features that somewhat align to interpretable concepts, despite some limitations that we can mostly attribute to the quality of the training data, both for the original model and the SAE, and to the limitations of the auto-interpretability pipeline (see below). It is possible that leveraging a dataset more attuned with the Italian culture would yield better results in finding relevant latents. the contrary, English ones does not sufer from this issue. However, if we look at the quality of explanations, aside from repetitions, we observe that at least some of the Italian ones are quite relevant to the examples, and while sometimes slightly missing the mark, they highlight some interesting aspects of the tokens that fire the latent.

Among these, we can clearly see that Layer.2 is Auto-interpretability is promising, but currently mostly represented by single token latents: the token shows limitations for Italian. Auto-interpretability “ale” as part of “federale” (federal), in several contexts, or pipelines are definitely a promising approach for simplithe token “letto”, as both a noun (bed) and a verb (read). fying and reducing the costs of finding explanations for Layer.14 latents on the other had appear to represent latents of SAEs. Our experiment suggest in fact that this more abstract concepts. For example, we see latents firing is a low-cost alternative that is nonetheless able to deon the final number of a year date, and a very interesting liver some interesting results. Nevertheless, we observed latent firing on the concept of competition (see Fig. ??). two main limitations that we can argue are actually two Layer.8 explanations are generally more confusing and sides of the same coin. On the one side, the Explainer less interesting. Examples are reported in Figure 5 with model showed some limitations in understanding the task the relative explanation, cut to avoid showing repetitions. and providing coherent texts for the explanations, while

As for the English explanations on the other hand, we the Scorer model performed quite poorly in the binary observed that most of them actually miss the mark. In classificationt task. This is especially true in the case of fact, they often provide an explanation related to the con- language mixing, i.e. when the model is prompted in its texts, rather than the firing tokens. This may be due to “main” language, i.e. English, but has to work on another Diferent behaviours in the residual stream. We observed some relevant diferences in the quality and types of latents that are properly identified in various points of the residual stream. In general, we observed that latents obtained from earlier in the stream are more relevant to single tokens and grammatical aspects of the language, while latents in later points of the stream show a slight tendency towards more abstract conceptualizations.

In this paper, we have shown that SAEs can partly un

cover interpretable concepts in Italian Small Language Models. Specifically, we did so by training a SAE model on the residual stream of the Minerva-1B-base-v1.0 SLM, and then applying an auto-interpretability pipeline to generate explanations for its latents.

Our findings suggest that SAE can be used to this end, and that it exist a hierarchical representation within the model, with earlier layers showing more token-centric features and later layers more abstract concepts. As for the auto-interpretability pipeline, while promising for its low cost, underscored the need for better languagespecific tools for Italian.

Moving forward, we aim to explore several avenues. language, in this case Italian. On the other side, the size First, we plan to scale our experiments in two directions: of the model used in our experiments could severely limit on the one hand, we aim to train SAEs on larger Italian its performances. models, e.g. larger variants of Minerva as well as others;

Thus, both issues could be solved either by leveraging on the other hand, we observe that we need to improve a stronger Italian-centric model as the Explainer/Score, or the models used for auto-interpretability, in order obby using a generally larger and better performing model. tain more reliable explanations. This could be achieved However, as for the first solution, there are currently no both by scaling them up substantially, and by tuning models on par with English ones in the 7-15B parameters Italian-speaking models to the specific tasks of latent exrange, wich whould allow for reducing the cost. As for planation and scoring. Second, we plan to leverage SAE the second solution, this would dramatically increase the and auto interpretability to address potential diferences costs, both computational and monetary. of representations in models pre-trained specifically on Italian data, e.g. Minerva and Velvet [26], and multilingual models that received only fine-tuning in Italian, like 5. Conclusions and Future Works the LLaMAntino variants [27] and Cerbero [ 28 ]. Finally, we plan to explore the larger latent space to attempt to uncover features linked specifically to Italian-centric concepts, in addition to properties of the Italian Language.

This work is an early first step in exploring interpretability research using Sparse Autoencoders for nonEnglish-centric Language Models. Albeit limited in scope, we are optimistic that it may provide a relevant foundation for this yet under explored research area, both in terms of approach and the release of open models for the community.

Acknowledgments This work has been supported by the PNRR MUR project PE0000013-FAIR (Spoke 1), funded by the European Commission under the NextGeneration EU programme, and the EU EIC project EMERGE (Grant No. 101070918). Our initial efort to interpret Italian SLMs using Sparse

Autoencoders has several limitations. The choice of the smaller Minerva-1B-base-v1.0 model, driven by computational constraints, means our findings might not generalize to larger Italian models. The SAE’s training data, while substantial for Italian, might not fully capture all linguistic nuances, potentially afecting the quality of learned features. Additionally, using diferent data to train and evaluate the SAE, while arguably not problematic in principle, may have introduced some unwanted biases.

A key limitation stems from our cost-efective autointerpretability pipeline, which relies on a relatively small, quantized multilingual LLM. This model struggled with generating coherent Italian explanations, often repeating itself, and performed poorly in scoring when mixing languages. This highlights the strong dependence of explanation quality on the explainer/scorer model’s capabilities, and the current lack of robust, afordable, Italian-specific tools.

Finally, our analysis was based on a sample of 2000 latents across only three layers, not the entire SAE latent space. While insightful, this limited scope and subjective qualitative assessment means we cannot yet claim a comprehensive understanding of the model’s internal workings.

G u i d e l i n e s : You w i l l be g i v e n a l i s t o f t e x t examples i n I t a l i a n on which s p e c i a l words a r e s e l e c t e d and between d e l i m i t e r s l i k e << t h i s > > .

I f a sequence o f c o n s e c u t i v e t o k e n s a l l a r e import ant , t he e n t i r e sequence o f t o k e n s w i l l be c o n t a i n e d between d e l i m i t e r s << j u s t l i k e t h i s > > . How i m p o r t a n t each token i s f o r t he b e h a v i o r i s l i s t e d a f t e r each example i n p a r e n t h e s e s . − Try t o produce a c o n c i s e f i n a l d e s c r i p t i o n . Simply d e s c r i b e th e t e x t l a t e n t s t h a t a r e common i n th e examples , and what p a t t e r n s you found . − I f t he examples a r e u n i n f o r m a t i v e , you don ’ t need t o mention them .

Don ’ t f o c u s on g i v i n g examples o f i m p o r t a n t tokens , but t r y t o summarize th e p a t t e r n s found i n th e examples . − Do not mention th e marker t o k e n s

( < < > >) i n your e x p l a n a t i o n . − Do not make l i s t s o f p o s s i b l e e x p l a n a t i o n s . Keep your e x p l a n a t i o n s s h o r t and c o n c i s e . − The l a s t l i n e o f your r e s p o n s e must be t he f o r m a t t e d e x p l a n a t i o n , u s i n g [ EXPLANATION ] : { { prompt } } S e i un m e t i c o l o s o r i c e r c a t o r e d i i n t e l l i g e n z a a r t i f i c i a l e che conduce un ’ i m p o r t a n t e i n d a g i n e s u g l i schemi p r e s e n t i n e l l a l i n g u a i t a l i a n a . I l tuo compito e ’ a n a l i z z a r e i l t e s t o e f o r n i r e una s p i e g a z i o n e che r a c c h i u d a i n modo e s a u r i e n t e i p o s s i b i l i schemi i n e s s o r i s c o n t r a t i .

L i n e e g u i d a : T i v e r r a ’ f o r n i t o un e l e n c o d i esempi d i t e s t o i n i t a l i a n o i n c u i p a r o l e s p e c i a l i sono s e l e z i o n a t e e i n s e r i t e t r a d e l i m i t a t o r i come << questo > > . Se una sequenza d i token c o n s e c u t i v i e ’ t u t t a i m p o r t a n t e , l ’ i n t e r a sequenza d i token sa ra ’ c o n t e n u t a t r a d e l i m i t a t o r i << p r o p r i o come questo > > . L ’ i m p o r t a n z a d i c i a s c u n token p e r i l comportamento e ’ e l e n c a t a dopo o g n i esempio t r a p a r e n t e s i . − Cerca d i p r o d u r r e una d e s c r i z i o n e f i n a l e c o n c i s a . D e s c r i v i s e m p l i c e m e n t e g l i e l e m e n t i l a t e n t i d e l t e s t o comuni n e g l i esempi e g l i schemi che h a i t r o v a t o . − Se g l i esempi non sono i n f o r m a t i v i , non e ’ n e c e s s a r i o m e n z i o n a r l i .

Non c o n c e n t r a r t i s u l f o r n i r e esempi d i token i m p o r t a n t i , ma c e r c a d i r i a s s u m e r e g l i schemi t r o v a t i n e g l i esempi . − Non menzionare i token m a r c a t o r i

( < < > >) n e l l a tua s p i e g a z i o n e . − Non c r e a r e e l e n c h i d i p o s s i b i l i s p i e g a z i o n i . M a n t i e n i l e t u e s p i e g a z i o n i b r e v i e c o n c i s e . − L ’ u l t i m a r i g a d e l l a tua r i s p o s t a deve e s s e r e l a s p i e g a z i o n e f o r m a t t a t a , usando [ SPIEGAZIONE ] : { { prompt } }

A. Explainer Prompts In Figure 6 we provide prompts fed to the Explainer model, both in English (original from [15]) and Italian (translation).

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

2023. URL: https://arxiv.org/abs/2309.08600.

arXiv:2309 . 08600 .

[5]

Elhage ,

Hume ,

Olsson , N. Schiefer,

Thread ( 2022 ).

[6]

Scherlis ,

Sachan ,

A. S.

Jermyn , J. Benton,

ral networks , 2025 . URL: https://arxiv.org/abs/2210.

01892 . arXiv: 2210 . 01892 .

[7]

Anders ,

Neo ,

Hoelscher-Obermaier , J. N.

[8]

Bricken ,

Templeton ,

Batson ,

Chen ,

cuits Thread ( 2023 ). https://transformer-circuits.

pub/2023/monosemantic-features/index.html.

[9]

Templeton ,

Conerly ,

Marcus , J. Lind-

claude 3 sonnet , Transformer Circuits Thread

( 2024 ). URL: https://transformer-circuits.pub/2024/

[10]

Gao , T. D. la Tour,

Tillman , G. Goh, R. Troll,

and evaluating sparse autoencoders , arXiv preprint [1]

Shojaee , I. Mirzadeh,

Alizadeh ,

Horton , arXiv: 2406 .04093 ( 2024 ).

Bengio ,

Farajtabar , The illusion of thinking: [11]

Lieberum ,

Rajamanoharan ,

Conmy ,

Smith ,

2025. URL: https://ml-site.cdn-apple .com/papers/ autoencoders everywhere all at once on gemma

the-illusion-of-thinking . pdf . 2 , in: Y. Belinkov , N.

Kim , J.

Jumelet , H. Mo[2] C.

Olah , N.

Cammarata , L.

Schubert , G. Goh, hebbi, A. Mueller, H. Chen (Eds.), Proceedings of

Petrov ,

Carter , Zoom in: An introduction the 7th BlackboxNLP Workshop: Analyzing and

to circuits, Distill 5 ( 2020 ) e24 . Interpreting Neural Networks for NLP , Association [3]

B. A.

Olshausen ,

D. J.

Field , Sparse coding with for Computational Linguistics , Miami, Florida, US ,

an overcomplete basis set: A strategy employed by 2024 , pp. 278 - 300 . URL: https://aclanthology.org/

v1?, Vision research 37 ( 1997 ) 3311 - 3325 . 2024 .blackboxnlp- 1 .19/. doi: 10 .18653/v1/ 2024 . [4]

Cunningham ,

Ewart ,

Riggs , R. Huben, blackboxnlp- 1 . 19 .

Sharkey , Sparse autoencoders find highly [12]

Yuan ,

Li ,

Zhang ,

Chen ,

Liu ,

Zhang ,

arXiv preprint arXiv:2407.01513 ( 2024 ). ternational Conference on Computational Linguis[13]

M. Y.

Hu ,

Mueller ,

Ross , A.

Williams, tics, Language Resources and Evaluation (LREC-

Linzen ,

Zhuang ,

Cotterell , L. Choshen, COLING 2024 ), ELRA and ICCL , Torino , Italy, 2024 ,

Warstadt ,

E. G.

Wilcox , Findings of the sec- pp. 9422 - 9433 . URL: https://aclanthology.org/ 2024 .

ond BabyLM challenge: Sample-eficient pretrain- lrec-main . 823 .

ing on developmentally plausible corpora , in: M. Y. [20]

Foundation , Wikimedia downloads, ???? URL:

Zhuang ,

Choshen ,

Cotterell ,

Warstadt , [21]

Makhzani ,

Frey , K-sparse autoencoders , arXiv

E. G. Wilcox (Eds.), The 2nd BabyLM Challenge preprint arXiv:1312.5663 ( 2013 ).

at the 28th Conference on Computational Natural [22]

Xue ,

Constant ,

Roberts ,

Kale , R. Al-

Language

Learning , Association for Computational Rfou,

Siddhant ,

Barua ,

Rafel , mT5:

Linguistics , Miami, FL, USA, 2024 , pp. 1 - 21 . URL: A massively multilingual pre-trained text-to-text

https://aclanthology.org/ 2024 .conll-babylm.1/. transformer, in: Proceedings of the 2021 Con [14]

Capone ,

Bondielli , A . Lenci, ConcreteGPT: A ference of the North American Chapter of the

baby GPT-2 based on lexical concreteness and cur- Association for Computational Linguistics: Hu-

Williams ,

Linzen ,

Zhuang , L. Choshen, putational Linguistics, Online, 2021 , pp. 483 - 498 .

Cotterell ,

Warstadt , E. G. Wilcox (Eds.), The

URL

: https://aclanthology.org/ 2021 .naacl-main. 41 .

2nd BabyLM Challenge at the 28th Conference on doi:10 .18653/v1/ 2021 .naacl-main. 41 .

Computational Natural Language Learning , Asso- [23] A.

Grattafiori , A.

Dubey , A.

Jauhri , A.

Pandey , A . Ka-

USA , 2024 , pp. 189 - 196 . URL: https://aclanthology. ten, A. Vaughan , et al., The llama 3 herd of models,

org/ 2024 .conll-babylm. 16 /. arXiv preprint arXiv: 2407 .21783 ( 2024 ). [15]

Paulo ,

Mallen ,

Juang ,

Belrose , Auto- [24]

Razzhigaev ,

Mikhalchuk ,

Goncharova , I. Os-

language models , arXiv preprint arXiv:2410 . 13928 of learning: Anisotropy and intrinsic dimensions

( 2024 ). in transformer-based models , in: Y. Graham, [16]

Lan ,

Torr ,

Meek ,

Khakzar ,

Krueger , M. Purver (Eds.), Findings of the Association for

Barez , Sparse autoencoders reveal universal fea- Computational Linguistics: EACL 2024 , Association

preprint arXiv:2410.06981 ( 2024 ). 2024 , pp. 868 - 874 . URL: https://aclanthology.org/ [17]

Lindsey ,

Gurnee ,

Ameisen ,

Chen , 2024 .findings-eacl. 58 /.

Pearce ,

N. L.

Turner ,

Citro ,

Abrahams , [25]

Li ,

Lan ,

Fu ,

Cai , L. Liu,

Collier ,

T. B. Thompson , S.

Zimmerman , K.

Rivoire , T. Con- 37th International Conference on Neural Informa-

erly , C. Olah, J.

Batson , On the biology of a large tion Processing Systems , 2023 , pp. 72888 - 72903 .

language model , Transformer Circuits Thread [26]

Team , Almawave presents velvet: The sustain-

( 2025 ). URL: https://transformer-circuits.pub/2025/ able and high-performance italian ai , 2025 . URL:

attribution-graphs/biology .html. https://www.almawave.com. [18]

Orlando ,

Moroni , P.-L. Huguet Cabot , S. Co- [27] P.

Basile , E. Musacchio, M.

Polignano , L. Siciliani,

nia , E. Barba, S.

Orlandini , G. Fiameni, R. Nav- G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod -

language models trained from scratch on Italian 2023 . arXiv: 2312 . 09993 .

data, in: F. Dell'Orletta , A.

Lenci , S. Montemagni, [28] F. A.

Galatolo , M. G.

Cimino , Cerbero-7b: A leap for-

Sprugnoli (Eds.), Proceedings of the 10th Italian ward in language-specific llms through enhanced

it 2024 ), CEUR Workshop Proceedings, Pisa, Italy, preprint arXiv: 2311 .15698 ( 2023 ).

2024 , pp. 707 - 719 . URL: https://aclanthology.org/

2024.clicit- 1 .77/. [19]

Sarti , M.

Nissim, IT5: Text-to-text pretraining for

Xue (Eds.), Proceedings of the 2024 Joint InExplainer Prompt ( Eng) Explainer Prompt (Ita) You a r e a m e t i c u l o u s AI r e s e a r c h e r