1. Introduction and Motivation

The Role of Eye-Tracking Data in Encoder-Based Models: an In-depth Linguistic Analysis

Lucia Domenichelli

0 1

Luca Dini

0 1

Dominique Brunato

Felice Dell'Orletta

0 0 ItaliaNLP Lab, Istituto di Linguistica Computazionale “A. Zampolli” (CNR-ILC) , Pisa , Italy 1 University of Pisa , Pisa , Italy

2025

This paper falls within ongoing research aimed at enhancing the human interpretability of neural language models by incorporating physiological data. Specifically, we leverage eye-tracking data collected during reading to explore how such information can guide model behavior. We train a multilingual encoder model to predict eye-tracking features from the Multilingual Eye-tracking Corpus (MECO) and analyze the resulting shifts in model attention patterns, focusing on how attention redistributes across linguistically informed categories such as part of speech, word position, word length, and distance from the syntactic head after fine-tuning. Moreover, we test how this attention shift impacts the representation of the interested words in the embedding space. The study covers both Italian and English, enabling a cross-linguistic perspective on attention and representation shifts in multilingual encoders grounded in human reading behavior.

eol>Eye-tracking Neural Attention Multilingual models Embedding space Interpretability

1. Introduction and Motivation

attention patterns, as well as their performance on downstream tasks and representation space. Their findings Neural language models (NLMs) now match or even sur- show that this intermediate process increases the correlapass human benchmarks on many NLP tasks, yet the tion between model attention and human attention and it logic behind their predictions remains largely hidden be- leads to a compression of the embedding space, without hind billions of parameters. To make these systems more generally degrading performance on downstream tasks. transparent and data-eficient, researchers are increas- Building on this foundational framework, this paper ingly borrowing ideas from cognitive science, grounding aims to further highlight the efects of incorporating inboth training and evaluation in how people actually learn formation about human reading behavior in a NLM and process language (e.g. [1, 2, 3]). Among the most from a linguistically informed perspective. Specifiinformative cognitive signals of human language pro- cally, we examine how fine-tuning on eye-tracking sigcessing is eye-tracking (ET). Decades of psycholinguistic nals leads to shifts in model attention, and how these work show that fixation times, regressions, and skips shifts afect the structure of word representations. To mirror both early lexical access and later integrative pro- explore this, we extract a set of linguistic features, capturcesses underlying text comprehension [4, 5]. Leveraging ing progressively more complex language phenomena, these signals has already boosted model accuracy on a from the input text and analyze how attention is redisvariety of downstream tasks ranging from core linguistic tributed across word classes defined by these features. In tasks [6] to more applied tasks like sentiment analysis [7], parallel, we assess how these attention shifts influence language proficiency assessment [ 8], machine reading the embedding space, both at a global level and within the comprehension [9], while also giving us a new lens on local representational geometry of specific word classes. model interpretability. Studies by Sood et al. [10] and The code for our experiments is publicly available on Eberle et al. [11] found that transformer attention does GitHub. not always line up with human gaze, whereas Bensemann et al. [12] and Wang et al. [13] revealed stronger links in specific layers, hinting at a layered correspondence 2. Related work between reading behavior and neural representations.

Extending this direction, Dini et al. [14] investigate how injecting reading-related information into NLMs through diferent fine-tuning strategies on ET data afects their

Our study intersects two complementary lines of research

within NLMs interpretability. The first investigates ET data as a diagnostic signal to evaluate the alignment between model behavior and human cognitive processing, particularly through the lens of attention mechanisms. The second focuses on analysing model’s attention mechanisms (Section 2.2) and representational space (Section

2.1. Eye-tracking and NLMs

entity’s determiners or subjects attending to their verbs.

However, fine-tuning on syntactic or semantic tasks had minimal efect on altering self-attention patterns. Vig and Belinkov [26] conducted a comprehensive analysis of attention head interpretability in GPT-2 using both visualization and quantitative measures. Their results indicate a layer-specific linguistic sensitivity, with diferent types of linguistic information—such as PoS and syntactic dependencies—being more salient in particular layers.

They also found stronger alignment with syntactic dependencies in the model’s middle layers. Htut et al. [27] directly evaluated the extent to which attention aligns with gold-standard dependency parses. By computing the correspondence between attention distributions and syntactic head-dependent pairs, they showed that BERT’s attention does not systematically reflect syntactic dependency structures, particularly in deeper layers.

Taken together, these studies suggest that while attention mechanisms can exhibit linguistically meaningful behavior in isolated cases—especially in specific layers or individual heads—they do not consistently encode syntactic or morpho-syntactic structure.

In recent years, eye-tracking has emerged as a prominent physiological signal in NLP research due to its afordability and ease of collection compared to methods like fMRI or MEG. Public resources such as the GECO corpus [15], the MECO corpus [16], and the WE-RDD dataset [17] now let researchers probe gaze behaviour at scale across languages and reading paradigms.

Work with these corpora has split in two directions.

The former injects gaze-derived features, into neural architectures, typically lifting accuracy on downstream tasks. The latter, which motivates our study, treats ET as a diagnostic for a model’s internal workings.

The first systematic comparison came from Sood et al. [18], who matched attention maps from CNNs, LSTMs and Transformers against human fixations. Their findings reveal that while transformers performed the best, they showed the weakest alignment with gaze. Eberle et al. [11] confirmed that even after task-specific finetuning, large Transformers stayed distant from human reading patterns. Conversely, Bensemann et al. [12] reported that raw dwell times correlate strongly with the earliest BERT layers, a relation that persists as model size grows. Morger et al. [19] extended the inquiry cross- 2.3. Geometry of the embedding space lingually and found robust correlations, especially for Transformer models learn a high-dimensional embedding monolingual encoders, between human word-importance space in which every token is represented by a dense vecrankings and model saliency. Most recently, Wang et al. tor that encodes both meaning and syntax. A consistent [20] showed that deeper layers of NLMs once again echo ifnding is that these vectors occupy only a narrow cone ifxation metrics, hinting at a layered, non-monotonic link of the space, an anisotropic layout sometimes called the between model depth and cognitive fidelity. representation degradation efect [ 28, 29, 30]. In NLP, such behaviour is often viewed as harmful because it 2.2. Model Attention Dynamics can hide fine-grained linguistic cues [ 31, 32, 33]. Yet theory and broader machine-learning evidence show that The role of attention mechanisms in NLMs has been a anisotropy can arise naturally under stochastic gradisubject of extensive research and debate. While atten- ent descent and may even aid generalization, especially tion weights are often interpreted as providing insight when models project data onto low-dimensional maniinto model reasoning, a growing body of research has folds [34, 35, 36, 37]. In this respect, studying the impact questioned their reliability as faithful explanations of of various fine-tuning objectives and downstream tasks model decisions. Some studies suggest that attention provides important insights into how they shape the gecan highlight important input elements, yet others argue ometry of the embedding space [34, 35, 36]. While still that attention distributions can be manipulated with- relatively limited, a growing body of work has begun out significantly afecting predictions, casting doubt on to examine the relationship between embedding space their explanatory power [21, 22]. In response to these properties and linguistic phenomena. For example, Herconcerns, alternative attribution methods have been pro- nandez and Andreas [38] show that linguistic features posed—such as attention rollout [23] and gradient-based tend to be encoded in lower-dimensional subspaces in techniques [24]—which aim to better capture the path- the early layers of both ELMo and BERT and that relaways through which information influences predictions. tional features (like dependency relations between pairs As part of this debate, a parallel line of work has ex- of words) are encoded less compactly than categorical plored whether attention aligns with known linguistic features like part of speech. More recently, Cheng et al. structures, such as syntactic dependencies or PoS cate- [39] analyzed representation compression in pre-trained gories, ofering a complementary perspective on its in- language models from both geometric and informationterpretability. The foundational study by Clark et al. [25] theoretic perspectives. Their findings reveal a strong showed that certain attention heads in BERT consistently correlation between these two views and show that the focus on syntactic phenomena, such as attending to an intrinsic geometric dimension of linguistic data is predictive of its coding length under the language model.

To the best of our knowledge, no systematic study has examined how eye-tracking fine-tuning afects attention patterns and the resulting embedding representations across diferent linguistic phenomena. Moreover, crosslinguistic analyses of these changes following cognitively motivated fine-tuning remain scarce.

3. Dataset For our analysis, we leverage two distinct datasets: the

Multilingual Eye-tracking Corpus (MECO) to finetune the model on human gaze modeling and treebanks from the Universal Dependencies (UD) project to extract linguistically motivated features and compute model attention shifts and representation structure induced by ifne-tuning on ET data.

Italian Stanford Dependency Treebank (ISDT), which

contains ≈ 13, 000 sentences drawn from a variety of textual genres. For English, we used the training set of the English Web Treebank (EWT) [41], including ≈ 12, 000 sentences , also multi-genre. UD corpora were chosen due to their gold-standard syntactic and part-of-speech annotations, which provide a reliable foundation for our ifne-grained linguistic analyses. Additionally, the crosslinguistically consistent annotation schema ofered by UD enables meaningful comparisons across typologically distinct languages.

4. Our Approach

We propose a linguistically informed framework to investigate the impact of injecting human reading behaviour into a pre-trained NLM, focusing on its efects on attention and word representations. The approach 3.1. Eye-tracking data: The MECO Corpus consists of two main stages: first, we fine-tune the model on predicting several ET features; then, we compare the MECO [16] is a multilingual collection featuring read- pre-trained and fine-tuned models along three axes: i) ing behavior from both native (L1) and second-language Correlation between model attention and human attenspeakers across 13 languages. We focus on the L1 subsets tion; ii) Attention distribution over input tokens; iii) Senfor English and Italian, chosen for their typological di- tence representations in the embedding space. versity and data completeness, allowing for a controlled To enable a more fine-grained analysis of how ET fineyet cross-linguistic perspective on gaze modeling. tuning afects word representations, we condition our

Each participant in MECO read 12 encyclopedic-style evaluations on the following linguistic features extracted texts, covering general knowledge topics. To ensure con- from the UD treebanks: word length in characters, part sistency and limit computational costs, we selected the of speech category, position in the sentence, and dislargest subsets of users who had read the majority of sen- tance from the syntactic head. tences. For Italian, we included 9 participants who read For our experiments we used XLM-RoBERTa-base, a 12 all sentences. For English, since no participant completed layer multilingual encoder-based model. In what follows, the full set, we selected 25 participants who all read the we outline the methodological choices and implementasame set of sentences, missing only two in common. tion details of our experimental setting.

We used five ET features intended to represent early, late and contextual signals of human reading processes: 4.1. ET injection into the Model First Fixation Duration: the duration of the first fixation landing on the word; Gaze Duration: the summed duration of fixations on the word in the first pass, i.e., before the gaze leaves it for the first time; Total Reading Time: the cumulative amount of time spent reading a word, capturing both fixations and potential interruptions (e.g., regressions or pauses); First-run Number of Fixations: the number of fixations on a word during the ifrst pass; Total Number of Fixations: the number of discrete fixations on areas of interest overall.

To inject reading-related information into the model, we

leverage the set of eye-tracking features from MECO described in Section 3.1. Unlike most prior work—which typically aggregates eye-tracking data across participants, with few exceptions [42]—we treat each reader individually, conducting experiments separately for each subject.

This design choice is motivated by the intrinsic variability observed in reading behavior, even among skilled readers [43, 44, 45], and enables a more accurate modeling of reader-specific dynamics. 3.2. Universal Dependencies Treebanks After a hyperparameter tuning phase using 5-fold cross-validation, we fine-tune the model to predict To analyze how model attention weights and embedding five word-level eye-tracking features , training a sepspace shift following fine-tuning on eye-tracking data, arate model for each individual reader. we relied on linguistically annotated corpora from UD Since the MECO dataset provides annotations at the treebanks [40]. Specifically, for Italian, we employed word level, while the model’s tokenizer splits some words the subsection corresponding to the training set of the into subword units, we follow standard practice [46] and assign eye-tracking features only to the first sub-token of each word, ignoring the rest during training1.

To examine whether the fine-tuned model develops a more human-like attention pattern, we compute the correlation between model attention and human attention before and after fine-tuning. For model attention, we consider the attention weights received by each word when computing the representation of the beginning-of-sentence token (<s>), which is the only token used during the eye-tracking prediction phase and serves as a global summary of the sentence. To account for subword tokenization, we follow the same approach used during fine-tuning and associate attention scores to the first sub-token of each word. As a proxy for human attention, we choose the Total Reading Time feature (see Section 3.1). For each reader, we thus compute the correlation between their eye-tracking data and the attention patterns of both the pre-trained and the fine-tuned model across all layers, allowing us to assess whether the latter aligns more closely with human reading behavior.

4.2. Assessing the Role of ET fine-tuning on Word Representations The two metrics were computed on the first sub-token

of each word in the UD treebanks. In line with the other analyses, we compare the embedding spaces of the pretrained and fine-tuned models to assess whether ET finetuning leads to more compact or more isotropic representations, as reflected by changes in these metrics.

All reported scores are first computed for each user individually and then averaged across all users.

5. Results 5.1. Correlation between model and human attention To assess how fine-tuning on ET afects the model’s internal dynamics for attention and embedding space, we leverage the linguistic features from the treebanks described in Section 4.

Specifically, to compute the attention shifts, for each value of these features, we analyse the amount of attention the corresponding words receive before and after ifne-tuning. This allows us to characterize shifts in attention distribution across diferent linguistic phenomena and across all layers of the models. Firstly, we normalize the attention scores for each sentence (excluding BOS and EOS tokens) so that their sum is 1. Attention shifts are quantified as the percentage change in the average attention received by tokens with a given feature value, Figure 1: Correlation between model attention and human before fine-tuning. A positive shift indicates increased attention (p-value < 0.05). attention to these tokens after fine-tuning, while a negative shift reflects a decrease. This allows us to identify As a first evaluation step, we computed the correlation which linguistic categories gain or lose prominence after between human attention and model attention, both beincorporating eye-tracking supervision. fore and after fine-tuning on eye-tracking data. As we

To analyze the shifts in the embedding space, we are interested in the strength rather than the direction rely on two complementary metrics. (i) IsoScore [47] of- of the association, we considered the absolute values of fers a scale-invariant measure of isotropy: lower scores the correlation coeficients. For the fine-tuned models, indicate that the embedding variance is concentrated we computed the correlation between the model’s attenalong fewer directions, pointing to a more anisotropic tion weights and the Total Reading Time of the specific space. (ii) Linear Intrinsic Dimensionality (Linear-ID) [48] user on which each model was fine-tuned. For the preestimates the dimensionality of the smallest linear sub- trained model, which is not finetuned to any individual space that captures the embeddings, providing a proxy reader, we calculated the correlation between its attenfor their geometric complexity. tion weights and the Total Reading Time of each user independently, and subsequently averaged the resulting 1The fine-tuning is run for 50 epochs, using a learning rate of 5− 05, coeficients. Figure 1 reports the comparison of Spearman a weight decay of 0.01, and a warm-up ratio of 0.05. correlation coeficients, averaged across all users.

In line with results reported in [14, 49], fine-tuning on ET data consistently leads to stronger correlation coeficients between model and human attention, particularly in the deeper layers of the model. This efect is evident in both Italian and English. The overall patterns are remarkably similar across the two languages, although the correlation scores for Italian are slightly higher on average.

5.2. Analysis of the Attention Shifts This section reports the analysis of the attention shifts in

duced by fine-tuning on ET data. We grouped tokens into classes for the values of the linguistic features detailed in Section 4. To enhance readability and interpretability, for each linguistic feature we visualised only the most representative values. Rather than applying a strict frequency threshold, we heuristically excluded rare or degenerate cases (e.g., for token length, extremely long tokens such as URLs), retaining typical and frequent values that better reflect standard linguistic patterns. Each figure also includes an “AVG” column summarizing the average shift across all layers, ofering a high-level view of the attention reallocation patterns. and sentence interpretation. Additionally, a language- tive data from the used UD treebanks show that early specific efect is visible in Italian, where coordinating sentence positions largely correspond to syntactically conjunctions (CCONJ) gain notable attention across sev- central elements—particularly the root, which anchors eral layers. While similar shifts occur sporadically in the the clause and governs the structure of major compleEnglish model, they are less consistent and often ofset ments. The observed shift in attention may therefore by decreases in other layers. reflect the model’s increased sensitivity to syntactic or

As regards the attention shifts based on the word’s ganization cues at sentence onset, especially in specific position within the sentence (Figure 4), we noted that layers. This behavior is also well-documented in psyfor both languages tokens appearing earlier in the cholinguistic studies and indicative of incremental parssentence generally receive slightly more attention ing, where early elements guide syntactic and semantic after fine-tuning , whereas those occurring later receive expectations during sentence comprehension. less. An exception is observed for the first two tokens, Figure 5 shows the attention shifts for the headwhich deviate from this trend. Layer-specific behaviors dependent distance parameter. A positive value indicates also emerge: for instance, layers 2 and 9 tend to increase that the head follows the dependent, while a negative one attention toward later tokens, while most other layers that the head precedes it. The special value 0 is assigned show the opposite efect, emphasizing earlier positions. to the root of the sentence. On average, it emerged that Notably, layer 2 and layer 11 both show sharp increases tokens that are syntactically closer to their head in attention to the first token, suggesting a potential tend to receive more attention after fine-tuning, reweighting of sentence-initial information after expo- particularly when the head follows the dependent. sure to human reading patterns. Interestingly, quantita- This suggests that fine-tuning on ET data encourages

Acknowledgments This work has been supported by:

• FAIR - Future AI Research (PE00000013) projects under the NRRP MUR program funded by the

NextGenerationEU.; • The project “XAI-CARE” funded by the European Union - Next Generation EU - NRRP M6C2 “Investment 2.1 Enhancement and strengthening of biomedical rese arch in the NHS” (PNRR-MAD-2022 -12376692_VADALA’ – CUP

F83C22002470001) • The project “Human in Neural Language Models” (IsC93_HiNLM), funded by CINECA3 under the

ISCRA initiative; • Language Of Dreams: the relationship between sleep mentation, neurophysiology, and neurologi cal disorders - PRIN 2022 2022BNE97C_SH4_PRIN2022. ized word representations? Comparing the geome- [36] A. Machina, R. Mercer, Anisotropy is not intry of BERT, ELMo, and GPT-2 embeddings, in: herent to transformers, in: K. Duh, H. Gomez, K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceed- S. Bethard (Eds.), Proceedings of the 2024 Conings of the 2019 Conference on Empirical Meth- ference of the North American Chapter of the ods in Natural Language Processing and the 9th In- Association for Computational Linguistics: Huternational Joint Conference on Natural Language man Language Technologies (Volume 1: Long Processing (EMNLP-IJCNLP), Association for Com- Papers), Association for Computational Linguisputational Linguistics, Hong Kong, Chin a, 2019 , tics, Mexico City, Mexico, 2024, pp. 4892–4907. pp. 55–65. URL: https://aclanthology.org/D19-1006/. URL: https://aclanthology.org/2024.naacl-long.274/. doi:10.18653/v1/D19-1006. doi:10.18653/v1/2024.naacl-long.274. [29] N. Godey, É. Clergerie, B. Sagot, Anisotropy is in- [37] A. Ansuini, A. Laio, J. H. Macke, D. Zoccolan, Inherent to self-attention in transformers, in: Y. Gra- trinsic dimension of data representations in deep ham, M. Purver (Eds.), Proceedings of the 18th neural networks, Advances in Neural Information Conference of the European Chapter of the Asso- Processing Systems 32 (2019). ciation for Computational Linguistics (Volume 1: [38] E. Hernandez, J. Andreas, The low-dimensional Long Papers), Association for Computational Lin- linear geometry of contextualized word represenguistics, St. Julian’s, Malta, 2024, pp. 35–48. URL: tations, in: Conference on Computational Nathttps://aclanthology.org/2024.eacl-long.3/. ural Language Learning, 2021. URL: https://api. [30] J. Gao, D. He, X. Tan, T. Qin, L. Wang, T. Liu, Repre- semanticscholar.org/CorpusID:234742544. sentation degeneration problem in training natural [39] E. Cheng, C. Kervadec, M. Baroni, Bridging language generation models, in: International Con- information-theoretic and geometric compression ference on Le arning Representations, 2019 . URL: in language models, in: Proceedings of the 2023 https://openreview.net/forum?id=SkEYojRqtm. Conference on Empirical Methods in Natural Lan[31] X. Cai, J. Huang, Y. Bian, K. Church, Isotropy in guage Processing, Association for Computational the contextual embedding space: Clusters and man- Linguis tics, 2023 , p. 12397–12420. ifolds, in: International conference on learning [40] M.-C. de Marnefe, C. D. Manning, J. Nivre, D. Zerepresentations, 2021. man, Universal dependencies, Computational Lin[32] Z. Zhang, C. Gao, C. Xu, R. Miao, Q. Yang, J. Shao, guistics 47 (2021) 255–308. URL: https://doi.org/10.

Revisiting representation degeneration problem in 1162/coli_a_00402. doi:10.1162/coli_a_00402. language modeling, in: Findings of the Association [41] N. Silveira, T. Dozat, M.-C. de Marnefe, S. Bowman, for Computational Linguis tics: EMNLP 2020 , 2020, M. Connor, J. Bauer, C. D. Manning, A gold standard pp. 518–527. dependency corpus for English, in: Proceedings of [33] T. Mickus, D. Paperno, M. Constant, K. van Deemter, the Ninth International Conference on Language What do you mean, BERT? assessing bert as a distri- Resources and Evaluation (LREC-2014), 2014. butional semantics model, in: A. Ettinger, G. Jarosz, [42] S. Brandl, N. Hollenstein, Every word counts: A J. Pater (Eds.), Proceedings of the Society for Com- multilingual analysis of individual human alignputation in Linguis tics 2020 , Association for Com- ment with model attention, in: Y. He, H. Ji, S. Li, putational Linguis tics, New York, New York, 2020 , Y. Liu, C.-H. Chang (Eds.), Proceedings of the 2nd pp. 279–290. URL: https:// aclanthology.org/2020 . Conference of the Asia-Pacific Chapter of the Asscil-1.35/. sociation for Computational Linguistics and the [34] R. Diehl Martinez, Z. Goriely, A. Caines, P. But- 12th International Joint Conference on Natural Lantery, L. Beinborn, Mitigating frequency bias guage Processing (Volume 2: Short Papers), Asand anisotropy in language model pre-training sociation for Computational Linguistics, Online with syntactic smoothing, in: Y. Al-Onaizan, only, 2022 , pp. 72–77. URL: https://aclanthology. M. Bansal, Y.-N. Chen (Eds.), Proceedings of org/2022.aacl-short.10/. doi:10.18653/v1/2022. the 2024 Conference on Empirical Methods in aacl-short.10.

Natural Language Processing, Association for [43] A. J. Parker, T. J. Slattery, Spelling ability influences Computational Linguistics, Miami, Florida, USA, early letter encoding during reading: Evidence from 2024, pp. 5999–6011. URL: https://aclanthology.org/ return-sweep eye movements, Quarterly Journal of 2024.emnlp-main.344/. doi:10.18653/v1/2024. Experimental Psychology 74 (2021) 135–149. URL: emnlp-main.344. https://doi.org/10.1177/1747021820949150. doi:10. [35] W. Rudman, C. Eickhof, Stable anisotropic reg- 1177/1747021820949150, pMID: 32705948. ularization, in: The Twelfth International Con- [44] J. Ashby, K. Rayner, C. Clifton, Eye moveference on Learning Representations, 2024. URL: ments of highly skilled and average readers: Difhttps://openreview.net/forum?id=dbQH9AOVd5. ferential efects of frequency and predictability, The Quarterly Journal of Experimental Psychology Section A 58 (2005) 1065–1086. doi:10.1080/ 02724980443000476. [45] T. J. Slattery, M. Yates, Word skipping: Efects of word length, predictability, spelling and reading skill, Quarterly Journal of Experimental Psychology 71 (2018) 250–259. doi:10.1080/17470218.2017.

1310264. [46] N. Hollenstein, F. Pirovano, C. Zhang, L. Jäger,

L. Beinborn, Multilingual language models predict human reading behavior, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 106–123.

URL: https://aclanthology.org/2021.naacl-main.10/.

doi:10.18653/v1/2021.naacl-main.10. [47] W. Rudman, N. Gillman, T. Rayne, C. Eickhof, IsoScore: Measuring the uniformity of embedding space utilization, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Findings of the Association for Comput ational Linguistics: ACL 2022 , Association for Computational Linguistics, Dublin, Irel and, 2022 , pp. 3325–3339. URL: https: // aclanthology.org/2022 .findings-acl.262/. doi: 10.

18653/v1/2022.findings-acl.262. [48] J. H. Lee, T. Jiralerspong, L. Yu, Y. Bengio, E. Cheng,

Geometric signatures of compositionality across a language model’s lifetime (2025). URL: https://arxiv.

org/abs/2410.01444. arXiv:2410.01444. [49] L. Dini, L. Moroni, D. Brunato, F. Dell’Orletta, In the eyes of a language model: A comprehensive examination through eye-tracking data, Neurocomputing (2025). In press.

A. Shift in the embeddings space Extra features This Appendix section contains the analysis of Section

5.3 conducted on the remaining linguistic features: word length, Figures A.1 and A.2, and word index in sentence, Features A.3 and A.4. As in Section 5.3, a clear hierarchy emerges among the new feature classes. For word length, tokens 6–10 characters long retain the highest IsoScore and Linear-ID before collapsing, like all other bins, under ifne-tuning.

Figure A.2: Linear-ID before (left) and after (right) finetuning, shown for word length (up to 15 tokens).

B. Shift in the embedding space English dataset We report the scores on the English word embeddings. The results are comparable to those on the italian dataset. Further exploration of parallels and diferences will be the focus of future work.

Figure B.3: Isotropy before (top) and after (bottom) finetuning, grouped by syntactic head distance (up to 7 words of distance).

Figure B.4: Linear-ID before (top) and after (bottom) finetuning, grouped by syntactic head distance (up to 7 words of distance).

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Improve writing style and Formatting assistance. After using these tool(s)/service(s), the author(s) reviewed

Linguistics (Volume 2 : Short

Papers)

, 2016 , pp. 579 -

584. [7]

Mishra ,

Kanojia ,

Nagar ,

Dey , P. Bhat-

ceedings of the 20th SIGNLL Conference on Com-

2016 , pp. 156 - 166 . URL: https://aclanthology.org/

K16 - 1016 /. doi: 10 .18653/v1/ K16 -1016. [8]

Berzak ,

Katz , R. Levy , Assessing language

2018 Conference of the North American Chapter

Human Language

Technologies

, Volume 1 (Long

tics , New Orleans, Louisiana, 2018 , pp. 1986 - 1996 .

URL: https://aclanthology.org/N18-1180/. doi:10.

18653 /v1/ N18 -1180. [9]

Malmaud ,

Levy ,

Berzak , Bridging

Linzen (Eds.), Proceedings of the 24th Con-

tics , Online, 2020 , pp. 142 - 152 . URL: https://

aclanthology.org/ 2020 .conll- 1 .11/. doi: 10 .18653/

v1/ 2020 .conll- 1 . 11 . [1]

Hollenstein ,

Barrett ,

Troendle ,

Bigiolli ,

language processing signals , CoRR abs/ 1904 .02682

( 2019 ). [2]

Evanson ,

Lakretz ,

J.-R.

King , Language ac-

low similar learning stages? , in: Annual Meet- [10]

Sood ,

Tannert ,

Mueller , A . Bulling,

tics , 2023 . URL: https://api.semanticscholar. org/ with human gaze-guided neural attention , in:

CorpusID:259089351 . H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , [3] A.

Yedetore , T.

Linzen , R.

Frank , R. T. McCoy , How H. Lin (Eds.), Advances in Neural Information

poor is the stimulus? evaluating hierarchical gen-

Processing Systems , volume 33 , Curran

Associates

eralization in neural networks trained on child- Inc ., 2020 , pp. 6327 - 6341 . URL: https://proceedings.

directed speech , in: A. Rogers , J. Boyd-Graber, neurips .cc/paper_files/paper/2020/file/

Okazaki (Eds.), Proceedings of the 61st An- 460191c72f67e90150a093b4585e7eb4-Paper.pdf.

nual Meeting of the Association for Computational [11]

Eberle ,

Brandl ,

Pilot ,

Søgaard , Do

Linguistics (Volume 1 : Long

Papers)

, Association transformer models show similar attention pat-

2023 , pp. 9370 - 9393 . URL: https://aclanthology. san, P. Nakov, A . Villavicencio (Eds.), Proceedings

org/ 2023 . acl-long . 521 /. doi: 10 .18653/v1/ 2023 . of the 60th Annual Meeting of the Association

acl-long.521. for Computational Linguistics (Volume 1 : Long [4]

M. A.

Just ,

P. A.

Carpenter , A theory of reading: Papers), Association for Computational Linguis-

from eye fixations to comprehension ., Psychologi- tics, Dublin, Ireland, 2022 , pp. 4295 - 4309 . URL:

cal review 87 ( 1980 ) 329 . https://aclanthology.org/ 2022 . acl-long . 296 . doi:10. [5]

Rayner , Eye movements in reading and informa- 18653 /v1/ 2022 . acl-long . 296 .

tion processing: 20 years of research ., Psychological [12]

Bensemann ,

A. Y.

Peng ,

D. B.

Prado , Y. Chen,

bulletin 124 ( 1998 ) 372 . N. Ö. Tan,

P. M.

Corballis ,

Riddle ,

Witbrock , [6]

Barrett ,

Bingel , F. Keller, A. Søgaard, Eye gaze and self-attention: How humans and

eye-tracking data , in: Proceedings of the 54th An- ings of the Workshop on Cognitive Modeling and

nual Meeting of the Association for Computational Computational Linguistics (

2022 ). URL: https://api.

semanticscholar.org/CorpusID:248780077. [13]

Wang ,

Liang ,

Zhou ,

Xu , Gaze- spective, in: T. Dong , E.

Hinrichs , Z.

Han , K . Liu,

put. Appl . 36 ( 2024 ) 12461 - 12482 . URL: and Symbols for Natural Language Processing and

https://doi.org/10.1007/s00521-024-09725-8.

Knowledge

Graphs Reasoning (NeusymBridge) @

doi:10.1007/s00521-024-09725-8 . LREC-COLING- 2024 , ELRA and ICCL , Torino , Italia, [14]

Dini ,

Domenichelli ,

Brunato ,

Dell'Orletta , 2024 , pp. 1 - 7 . URL: https://aclanthology.org/ 2024 .

From human reading to NLM understanding:

Eval- neusymbridge-1 .1/.

uating the role of eye-tracking data in encoder- [21]

Jain ,

B. C.

Wallace , Attention is not Explanation,

M. T. Pilehvar (Eds.), Proceedings of the 63rd An- ings of the 2019 Conference of the North American

Linguistics (Volume 1 : Long

Papers)

, Association guistics: Human Language Technologies , Volume

for Computational

Linguistics

, Vienna, Austria, 1 (Long and Short Papers), Association for Com-

2025 , pp. 17796 - 17813 . URL: https://aclanthology. putational Linguistics, Minneapolis, Minnesota,

org/ 2025 . acl-long . 870 /. doi: 10 .18653/v1/ 2025 . 2019, pp. 3543 - 3556 . URL: https://aclanthology.org/

acl-long . 870 . N19 -1357/. doi: 10 .18653/v1/ N19 -1357. [15]

Cop ,

Dirix ,

Drieghe , W. Duyck, Pre- [22]

Serrano ,

N. A.

Smith , Is attention interpretable?,

gual and bilingual sentence reading, Behavior ceedings of the 57th Annual Meeting of the As-

Research Methods 49 ( 2017 ) 602 - 615 . URL: https: sociation for Computational Linguistics , Associa-

//api.semanticscholar.org/CorpusID:11567309. tion for Computational Linguistics , Florence, Italy, [16]

Siegelman ,

Schroeder ,

Acartürk , H. -D. Ahn , 2019 , pp. 2931 - 2951 . URL: https://aclanthology.org/

Alexeeva ,

Amenta ,

Bertram ,

Bonandrini , P19 - 1282 /. doi: 10 .18653/v1/ P19 -1282.

Brysbaert ,

Chernova , et al., Expanding hori- [23] S. Abnar , W. Zuidema, Quantifying attention

ior research methods ( 2022 ) 1 - 21 . 58th Annual Meeting of the Association for Com[17]

Raymond ,

Moldagali ,

N. Al

Madi , A dataset putational Linguistics , Association for Computa-

of underrepresented languages in eye tracking tional Linguistics , Online, 2020 , pp. 4190 - 4197 .

research, in: Proceedings of the 2023 Sympo- URL: https://aclanthology.org/ 2020 .acl-main. 385 /.

sium on Eye Tracking Research and Applications , doi:10.18653/v1/ 2020 .acl-main. 385 .

ETRA '23 , Association for Computing Machinery, [24] H.

Chefer , S.

Gur , L. Wolf, Transformer inter-

New York, NY, USA, 2023 . URL: https://doi.org/ pretability beyond attention visualization, in: Pro-

10.1145/3588015.3590128. doi: 10 .1145/3588015. ceedings of the IEEE/CVF Conference on Computer

3590128. Vision and Pattern Recognition (CVPR ), 2021 , pp. [18]

Sood ,

Tannert ,

Frassinelli ,

Bulling , N. T. 782 - 791 .

Vu , Interpreting attention models with human [25]

Clark ,

Khandelwal ,

Levy ,

C. D.

Manning ,

of the 24th Conference on Computational Natural D. Hupkes (Eds.) , Proceedings of the 2019 ACL

Linguistics , Online, 2020 , pp. 12 - 25 . URL: https: ing Neural Networks for NLP, Association for Com-

//aclanthology.org/ 2020 .conll- 1 .2/. doi: 10 .18653/ putational Linguistics, Florence, Italy, 2019 , pp.

v1/ 2020 .conll- 1 .2. 276 - 286 . URL: https://aclanthology.org/W19-4828/. [19]

Morger ,

Brandl ,

Beinborn , N. Hollenstein, doi:10.18653/v1/ W19 -4828.

A cross-lingual comparison of human and model [26]

Vig ,

Belinkov , Analyzing the structure

Sayeed (Eds.), Proceedings of the 2022 CLASP in: BlackboxNLP@ACL , 2019 . URL: https://api.

Conference on (Dis)embodiment, Association for semanticscholar .org/CorpusID:184486755.

Computational

Linguistics , Gothenburg, Sweden, [27] P. M. Htut , J.

Phang , S.

Bordia , S. R.

Bowman , Do

2022 , pp. 11 - 23 . URL: https://aclanthology.org/ 2022 . attention heads in bert track syntactic dependen-

clasp-1 .2. cies? ( 2019 ). URL: https://arxiv.org/abs/ 1911 .12246. [20]

Wang ,

Li ,

Biemann , Probing large arXiv: 1911 .12246.

language models from a human behavioral per- [28]

Ethayarajh , How contextual are contextual-