Comparative Evaluation of Computational Models Predicting Eye Fixation Patterns During Reading: Insights from Transformers and Simpler Architectures Alessandro Lento1,2 , Andrea Nadalini1 , Nadia Khlif1,3 , Vito Pirrelli1 , Claudia Marzi1 and Marcello Ferro1,* 1 Consiglio Nazionale delle Ricerche, Istituto di Linguistica Computazionale "A. Zampolli", Pisa, Italy 2 Università Campus Bio-Medico, Roma, Italy 3 University Mohammed First, Oujda, Morocco Abstract Eye tracking records of natural text reading are known to provide significant insights into the cognitive processes underlying word processing and text comprehension, with gaze patterns, such as fixation duration and saccadic movements, being modulated by morphological, lexical, and higher-level structural properties of the text being read. Although some of these effects have been simulated with computational models, it is still not clear how accurately computational modelling can predict complex fixation patterns in connected text reading. State-of-the-art neural architectures have shown promising results, with pre-trained transformer-based classifiers having recently been claimed to outperform other competitors, achieving beyond 95% accuracy. However, transformer-based models have neither been compared with alternative architectures nor adequately evaluated for their sensitivity to the linguistic factors affecting human reading. Here we address these issues by evaluating the performance of a pool of neural networks in classifying eye-fixation English data as a function of both lexical and contextual factors. We show that i) accuracy of transformer-based models has largely been overestimated, ii) other simpler models make comparable or even better predictions, iii) most models are sensitive to some of the major lexical factors accounting for at least 50% of human fixation variance, iv) most models fail to capture some significant context-sensitive interactions, such as those accounting for spillover effects in reading. The work shows the benefits of combining accuracy-based evaluation metrics with non-linear regression modelling of fixed and random effects on both real and simulated eye-tracking data. Keywords eye-tracking, eye fixation time prediction, neural network, contextual word embeddings, lexical features 1. Introduction considerable progress, leading to the development of sophisticated computational models accounting for fine- Eye-tracking records of natural text reading are a valu- grained aspects of eye movement behaviour during word able window on the cognitive processes underlying word and sentence reading (e.g. EZ-Reader[5], Swift[6]). A sig- processing and text comprehension. By looking at fix- nificant boost in this area came from large eye-tracking ation patterns it is possible to estimate the effects that corpora of natural reading (e.g. GECO[7], ZUCO[8], lexical properties (e.g. length, frequencies, orthographic MECO[9]), which allow for (deep) learning models to be similarity [1] [2]), contextual constraints (e.g. predictabil- tested in prediction tasks of eye tracking metrics. Of late, ity [3]) and higher-level structures (e.g. syntactic struc- Hollenstein and colleagues [10] reported that fine-tuned, ture or prosodic contour [4]) can have on human word pre-trained transformer language models can make re- identification and processing. While psycholinguistic ex- liable predictions on a wide range of eye-tracking mea- periments have reliably assessed how such effects modu- surements, covering both early and late stages of lexical late reading times, it is not clear to what extent computa- processing. The evidence suggests that transformers can tional models of reading can simulate actual behavioural inherently encode the relative prominence of language data such as gaze patterns and fixation durations. units in a text, in ways that accurately replicate human Over the past 30 years, research in this field has made reading skills and their underlying cognitive mechanisms. Although the accuracy of multilingual transformers is CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, validated across eye-tracking evidence from different lan- Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. guages, the paper neither compares the performance of $ alessandro.lento@ilc.cnr.it (A. Lento); andrea.nadalini@ilc.cnr.it transformers with the performance of other neural net- (A. Nadalini); nadia.khlif@ilc.cnr.it (N. Khlif); vito.pirrelli@ilc.cnr.it work classifiers trained on the same task, nor it shows (V. Pirrelli); claudia.marzi@ilc.cnr.it (C. Marzi); what specific knowledge is encoded and put to use by marcello.ferro@ilc.cnr.it (M. Ferro) transformers, by looking at the factors affecting their  0000-0002-5581-7451 (V. Pirrelli); 0000-0002-3427-2827 (C. Marzi); 0000-0002-1324-3699 (M. Ferro) behaviour. In the present paper, we address both issues © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License by assessing the performance of a pool of neural network Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings classifiers on the English batch of Hollenstein et al.’s [10] ii) log frequency (source: BNC [13]) data. In what follows, we first describe the English data set iii) part-of-Speech tag (source: Stanza [14]) and the pool of tested classifiers. Classifiers were selected iv) context surprisal/predictability (source: GPT-2 to include and test either simpler neural architectures [15, 16, 3]) than transformers (as is the case with multi-layer percep- trons), or cognitively more plausible processing models v) distance from the beginning of the sentence (num- (i.e. sequential long-short terms memories). Hybrid mod- ber of intervening tokens) els, resulting from the combination of different architec- tures, were also tested. We then move on to discussing vi) distance from the end of the sentence (number of the metrics used in [10] for evaluation, to suggest alter- intervening tokens) native ways to measure accuracy in a fixation prediction vii) presence of heavy punctuation after the token task. Finally, we investigate how sensitive each tested architecture is to a few linguistic factors that are known viii) presence of light punctuation after the token. to account for a sizeable amount of variance in human reading gaze patterns. Although some neural networks 2.2. BERT ++ turn out to be reasonably good at predicting fixation pat- terns and replicating some robust psycholinguistic effects To replicate results from [10], we used BERT [17] with that are found in human data, it is still unclear whether a linear layer on top of it. The linear layer gets BERT this ability is due to specific aspects of their architecture, contextual word embeddings as input, to predict FPD to the type of information they are provided in input, or and FPROP. to their space of trainable parameters. We conclude that, After sentence padding and tokenization, irrelevant contrary to recent over-enthusiastic reports, predicting and special subtokens were masked to enforce a corre- eye-fixation patterns of human natural reading is still a spondence between each vector in the target sequence big challenge for currently available neural architectures, and each vector in the output sequence, and train the including transformer-based ones. For this very reason, loss only on relevant tokens. Mean Square Error (MSE) we contend that the task is key to understanding the loss was used along with the AdamW optimizer (with inductive bias of these models, as well as assessing their no weight decay for the biases). The initial learning rate cognitive plausibility as models of language behaviour. was set to 5 · 10−5 , and a linear scheduler was used. We used a 16 sentences batch size and 100 training epochs, with an early stopping criterion (best model on the vali- 2. Data and Experiments dation set). The model was trained both with fine-tuning (i.e. by also training BERT internal weights: bert FT + All models described in the following paragraphs were layer) and without fine-tuning (by only training final trained, validated, and tested on data from the GECO layer weights: bert + layer). corpus [7]. We used a 5-fold cross-validation with 95% Finally, we used BERT also in combination with a se- training, 5% validation and 5% test. Experiments were quential LSTM network. This model (bert + LSTM) takes conducted using the PyTorch library [11] in Python or the pre-trained BERT contextual word embeddings MatLab [12]. (i.e. without fine-tuning) in input, along with the lexical features (i), (ii) and (iv), to predict FPD and FPROP. 2.1. Dataset The GECO corpus [7] contains data from 14 English na- 2.3. LSTM tive speakers whose eye movements were recorded while Reading is inherently sequential. Thus, recurrent neural reading Agatha Christie’s novel “The Mysterious Affair networks appear to offer a promising approach to mod- at Styles” (56410 tokens). Out of the eight word-level eye elling a fixation prediction task, and a good alternative tracking measurements used in [10], we focused on i) to transformers. Using the GECO dataset split into pages first-pass duration (FPD) (the time spent fixating a word rather than sentences, we trained an LSTM with 96 hid- the first time it is encountered, averaged over subjects, den units and a single layer, with a feed-forward network see Fig. 2) and ii) fixation proportion (FPROP) or proba- using tanh activation functions on top of it. The model bility (number of subjects that fixated a word, divided by (lstm) takes as input the lexical features (i)-(iv) for the the total number of subjects). target token and 4 tokens to its left and 3 to its right, to Word tokens in the original dataset were encoded with predict FPD and FPROP of the target token. MSE loss linguistic information including: was used along with the AdamW optimizer. The initial i) character length (removing punctuation) learning rate was set to 5 · 10−3 , with a linear scheduler and a batch containing the entire training dataset. The threshold. An offset value is needed to obtain a positive model was trained for 3000 epochs with an early stopping threshold also for zero target values. This is calculated criterion (best model on the validation set). as follows: 2.4. MLP 𝑁𝑠𝑒𝑡 1 ∑︁ 𝑎𝑐𝑐𝑆(𝑠𝑒𝑡) = 1 − 𝜃 [𝑒𝑖 − (𝛼 · 𝑡𝑖 + 𝜖)] A Multi-Layer-Perceptron (mlp) was trained using the 𝑁𝑠𝑒𝑡 𝑖∈𝑠𝑒𝑡 entire set of lexical features (i)-(viii) as input, with an input context consisting of the two words immediately where 𝑁𝑠𝑒𝑡 is the number of examples in the train- preceding and ensuing the target word. Several instances ing/test set, 𝜃 is the Heaviside step function, 𝜖 is a thresh- of this architecture were tested, but only the results of old and 𝛼 is a sensitivity coefficient. the best performing instance (with a single hidden layer As for FPD, which is a duration expressed in seconds, of 10 units, sigmoidal activation functions, the Adam we used 𝜖 = 25𝑚𝑠 and 𝛼 = 10% for accS, and 𝜖 = 50𝑚𝑠 optimiser, the MSE loss, a constant learning rate of 0.1, for accT. As for FPROP, which is a probability, we used and 1000 training epochs) are reported here. 𝜖 = 0.01 and 𝛼 = 10% for accS, and 𝜖 = 0.1 for accT. An identical MLP model (mlp UDT) was eventually Finally, the performance of our models was compared trained on a subset of GECO training data, obtained by against a baseline model (const) that always outputs the sampling target features uniformly. This was done to overall mean fixation duration (across both subjects and train the network with an equal number of tokens for items) in the training data. each bin of fixation times, and assess the impact of dif- ferent distributions of input data on the network’s per- formance on test data. 3. Results Models’ results for FPD prediction are summarised in 2.5. Evaluation Table 1 and plotted in Fig. 1. The accL results reported in We evaluated the performance of all our models using [10] for bert FT + layer are essentially replicated. How- three accuracy metrics based on the absolute error be- ever, being a simple average over all test instances, accL tween the predicted value 𝑜𝑖 and the target value 𝑡𝑖 on is blind to error magnitude, as well as the possible pres- the i-th token of the GECO dataset: ence of prediction biases for specific ranges of fixation values. Note that the const model, which predicts the 𝑒𝑖 = |𝑜𝑖 − 𝑡𝑖 | same average FDP for every token in the test set, scores a flattering 95.68% on accL, vs. 36.97% on accS, and Loss accuracy (accL) is a measure of the overall simi- 48.10% on accT larity between predicted and target values, calculated as Table 2 summarises accS values of all models, by bin- the complement to 1 of the Mean Absolute Error (MAE) ning them into three FPD ranges. after fitting the target data 𝑡𝑖 in the training set into the [0; 1] range with the min-max scaling: 𝑁𝑠𝑒𝑡 𝑎𝑐𝑐𝐿(𝑠𝑒𝑡) = 1 − 1 ∑︁ ˆ𝑒𝑖 4. Data analysis 𝑁𝑠𝑒𝑡 𝑖∈𝑠𝑒𝑡 To what extent are neural network models sensitive to some of the factors accounting for gaze patterns in hu- where ˆ𝑒𝑖 = |𝑜 ˆ𝑖 − ˆ𝑡𝑖 |, ˆ𝑡𝑖 = 𝑡𝑖 / max {𝑡𝑗 }, and 𝑗=𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑠𝑒𝑡 man natural reading? Are language models able to adapt 𝑜𝑖 is the model prediction for ˆ𝑡𝑖 . Loss accuracy is the ˆ themselves to both lexical properties and in-context fea- metric used in [10]. tures of a reading text, thus exhibiting a human-like per- Threshold accuracy (accT) measures how many times formance? the predicted value is close to the target value within a Human reading behaviour is shown to be affected by fixed threshold, and is calculated as follows: lexical features – e.g. word length and frequency, and morphological complexity – as well as by contextual fac- 𝑁𝑠𝑒𝑡 1 ∑︁ tors, with a facilitatory effect of contextual redundancy 𝑎𝑐𝑐𝑇 (𝑠𝑒𝑡) = 1 − 𝜃[𝑒𝑖 − 𝜖] 𝑁𝑠𝑒𝑡 𝑖∈𝑠𝑒𝑡 and predictability (18, 19) on reading duration and eye fixations. Accordingly, we modelled human FPDs as a Sensitivity accuracy (accS) counts how many times response variable resulting from the interaction of both the predicted value is close to the target value within lexical and contextual predictors: namely, word length, a threshold dynamically calculated on the basis of the a dichotomous classification of token POS into content target value: the higher the target value, the higher the versus function words, surprisal of the target word as a mlp (training) =0.79***** mlp (test) =0.78***** FPD accuracies 0.5 0.5 FPD (seconds) FPD (seconds) 0.4 0.4 test training 0.3 0.3 model accS accT accL accS accT accL 0.2 0.2 0.1 0.1 const 36.97% 48.10% 95.68% 37.07% 48.06% 95.69% 0 0 1 2 3 4 5 500 1000 1500 2000 2500 (0.83%) (1.00%) (0.05%) (0.04%) (0.05%) (0.00%) token 104 token bert 55.02% 67.82% 97.05% 58.11% 70.74% 97.25% lstm (training) =0.79***** lstm (test) =0.78***** 0.5 0.5 FPD (seconds) FPD (seconds) + layer (0.86%) (0.99%) (0.05%) (0.82%) (0.70%) (0.05%) 0.4 0.4 mlp UDT 56.41% 67.79% 96.21% 61.21% 72.37% 96.52% 0.3 0.3 0.2 0.2 (0.35%) (0.79%) (1.25%) (0.95%) (0.57%) (1.08%) 0.1 0.1 0 0 bert 58.49% 70.01% 95.38% 63.64% 75.89% 95.90% 1 2 3 4 5 500 1000 1500 2000 2500 + lstm (0.91%) (0.82%) (0.07%) (0.48%) (0.77%) (0.97%) token 104 token bert FT + layer (training) =0.98***** bert FT + layer (test) =0.78***** bert FT 57.80% 70.03% 97.23% 93.18% 94.81% 98.80% 0.5 0.5 FPD (seconds) FPD (seconds) 0.4 0.4 + layer (1.02%) (1.13%) (0.05%) (0.81%) (0.71%) (0.05%) 0.3 0.3 mlp 60.16% 73.05% 97.39% 60.63% 73.31% 97.40% 0.2 0.2 0.1 0.1 (0.85%) (0.78%) (0.04%) (0.37%) (0.24%) (0.01%) 0 0 1 2 3 4 5 500 1000 1500 2000 2500 lstm 60.01% 73.18% 97.39% 61.66% 74.27% 97.45% token 104 token (0.38%) (0.31%) (0.03%) (0.24%) (0.19%) (0.01%) Figure 1: Models predictions (red dots) plotted with target Table 1 FPD values (black dots), after ordering tokens for increasing Overall FPD prediction accuracy in the GECO dataset. For FPDs. Grey dots represent averaged FPD values plus\minus each model, three different accuracy scores are given as de- their standard deviation across participants. Left: training scribed in the text; const is used as a baseline; highest accu- data. Right: test data. From top to bottom: MLP, LSTM, racies in bold; lowest accuracies in italics. BERT fine-tuned. For each plot, the Spearman-𝜌 correlation coefficient between predicted and target values is shown along with the significance value. 3-bin FPD accuracy on test model low medium high const 0.00% 41.08% 0.00% bert + layer 21.43% 58.98% 23.02% In contrast, all models fail to capture some contextual mlp UDT 52.33% 56.91% 51.49% effects on test data, such as those observed in a context bert + lstm 24.19% 62.17% 26.61% window of – at least – two adjacent words. To illustrate, bert FT + layer 32.86% 62.65% 31.65% efficient syntactic chunking (e.g. of noun, verb and prepo- mlp 11.77% 64.38% 32.62% sitional phrases) has been shown to lead to faster and lstm 19.05% 64.26% 29.45% more accurate human reading (see, for example, [20]). Table 2 Conversely, most neural networks show no statistically Sensitivity accuracy (accS) values for three bins from the significant effect on fixation duration of the probability FPD distribution: low (FPD below the 5𝑡ℎ percentile = 36ms), of the immediately preceding word in context. This is medium (FPD ranging from the 5𝑡ℎ to the 95𝑡ℎ percentile), observed either is isolation (probMinus1) in LSTMs and and high (FPD above the 95𝑡ℎ percentile = 280ms). transformer-based models with BERT representations (either fine-tuned or not), or in interaction with the un- predictability of the target word (surprisal:probMinus1). The evidence shows that most neural models cannot repli- measure of how unexpected or unpredictable the word cate, among other things, so-called spillover effects of the is, and the probability of the word immediately preced- left-context on the reading time of ensuing words [21]. ing the target word in context (to account for so-called spill-over effects). Additionally, we used a Generalised Additive Model (GAM), with token log-frequency as a 5. General Discussion smooth term, to model for possibly non-linear effects Transformer-based neural networks appear to reason- of predictors. Models’ coefficients and effect plots are ably predict fixation probability and first-pass duration shown in Appendix C (Figure 3 and Table 4). of words in human reading of English connected texts. GAMs with identical independent variables have been Our present investigation basically supports this con- run to model the FPDs predicted by all our neural net- clusion, while providing new evidence on two related works, on both training and test data. Inspection of effect questions. Two questions naturally arise in this context. plots and model coefficients – as reported in Appendix How accurate are transformer-based predictions com- C – shows a behavioural alignment of all models with pared with the best predictions of other neural network human data for what concerns the modulation of fixa- classifiers trained on the same task? How cognitively tion times by lexical features, in both train and test data. plausible are the mechanisms underpinning this perfor- mance? Here, we addressed both questions by testing gests that one can gain non trivial insights in a model’s various models on the task of predicting human reading behaviour by analysing to what extent the behaviour is measurements from the GECO corpus, using different sensitive to the same linguistic factors human readers are evaluation metrics and regressing network predictions known to be sensitive to. On the one hand, this is a step on a few linguistic factors that are known to account for towards understanding what information a neural model human reading behaviour. is actually learning and putting to use for the task. On Our first observation is that assessing a network’s per- the other hand, this is instrumental in developing better formance by looking at its MAE loss function provides a models, as it shows what type of input information is rather gross evaluation of the effective power of a neural more needed to successfully carry out a task, at least if network simulating human reading behaviour. A base- one is trying to simulate the way the same task is carried line model assigning each token a constant gaze dura- out by speakers. tion that equals the average of all FPD values attested In the end, it may well be the case that a 70% fixed- in GECO achieves a 95.7% loss-based accuracy on both threshold accuracy in simulating average gaze patterns in test and training data. That a transformer-based classi- human reading is not as disappointing as it might seem. fication scores 97.2% on the same metric and the same Given the wide variability in human reading behaviour test data cannot be held, as such, as a sign of outstanding (and even in a single reader when confronted with differ- performance. In fact, it turns out that the MAE loss func- ent texts), a considerable amount of variance in our data tion is blind to both the magnitude of a network error, may simply be accounted for by by-subject (or by-token) and possible biases in the prediction of very low/high random effects. In some experiments not reported here target values. Thus, it provides an inflated estimate of we trained our models to predict single-reader behaviour. a model’s accuracy. We suggest that binary evaluation All architectures fared rather poorly on the task, a re- metrics, based on a fixed threshold partially overcome sult which is in line with similar disappointing results these limitations. Yet, as single word fixation times typ- on other output features reported in [10]. Looking back ically range between tens to hundreds of milliseconds, at Figure 1, it can be noted that all models’ predictions application of a fixed threshold will differently affect to- fall into a 𝜇𝑖 ± 𝜎𝑖 range, where 𝜇𝑖 and 𝜎𝑖 are, respec- kens with different fixation times. We conclude that a tively, the by-reader mean and standard deviation of FPD relative threshold based on each word’s fixation time is a values for token 𝑖 (see also Table 2). This pattern may fairer way to measure prediction accuracy. Clearly, this suggest that models’ predictions are in fact bounded by comes at a cost. When assessed with a relative threshold, the standard deviation we observe in human behaviour the accuracy of a transformer-based architecture on test and cannot reach out of these bounds. Conversely, this data drops from 70% down to 57.8%. evidence may be interpreted as suggesting that more It turned out that all other network models tested for input features are needed to build more accurate classi- the present purposes showed accuracy levels that are fiers. Further experiments are needed to test the merits comparable to the accuracy of a transformer-based archi- of either conjecture. tecture. Since the former are trained on a more restricted set of lexical and contextual input features than the lat- ter, this seems to suggest that word embeddings are of 6. Limitations and outlook limited use in the task at hand. Although fine-tuned In the present paper, we replicated recent experimental word embeddings actually appear to score much higher data of transformer-based architectures simulating word on training data (even using accT and accS), we observe fixation duration in reading a connected text [10], with a that this is due to data overfitting, as clearly shown by view to assessing their relative performance compared the considerably poorer performance of the fine-tuned with reading times by humans and other neural archi- model on test data. tectures. This justifies our exclusive focus on fixation An analysis of the psychometric plausibility of the gaze duration, which is, admittedly, only one behavioural cor- patterns simulated with our neural models reveals that a relate of a complex, inherently multimodal task such as relatively small set of linguistic factors that are known reading. In fact, reading requires the fine coordination to account for a sizeable amount of variance in human of eye movements and articulatory movements for text fixation times can also account for the bulk of variance decoding and comprehension. The eye provides access in models’ behaviour. This is relatively unsurprising, as to the visual stimuli needed for voice articulation to un- most of these models were trained on input features that fold at a relatively constant rate. In turn, articulation can encode at least some of these factors. Nonetheless, we be- feedback oculomotor control for eye movements to be lieve that the result is interesting for at least two reasons. directed when and where processing difficulties arise. First, it shows a promising convergence between com- Incidentally, this is also true of silent reading as shown putational metrics of model accuracy and quantitative by evidence supporting the Implicit Prosody Hypothesis models of psychometric assessment. Secondly, it sug- [22], i.e. the idea that, in silent reading, readers activate References prosodic representations that are similar to those they would produce when reading the text aloud. Hence, a [1] S. Gerth, J. Festman, Reading development, word reader must always rely on a tight control strategy to length and frequency effects: An eye-tracking study ensure that fixation and articulation are optimally coor- with slow and fast readers, Frontiers in Communi- dinated. cation 6 (2021) 743113. A clear limitation of our current work and all exper- [2] S. Schroeder, T. Häikiö, A. Pagán, J. H. Dickins, iments reported here is that we are only focusing on J. Hyönä, S. P. Liversedge, Eye movements of chil- one dimension of a complex, multimodal behaviour like dren and adults reading in three different orthogra- reading. Recently, we showed that there is a lot about phies., Journal of Experimental Psychology: Learn- gaze patterns that we can understand by correlating eye ing, Memory, and Cognition 48 (2022) 1518. movements with voice articulation [23]. This informa- [3] L. Salicchi, E. Chersoni, A. Lenci, A study on sur- tion, which cannot be represented in a dataset structured prisal and semantic relatedness for eye-tracking at the word level, may be critical for a model to accurately data prediction, Frontiers in Psychology 14 (2023) learn and mimic the cognitive mechanisms underlying 1112365. natural reading. Likewise, as correctly pointed out by [4] M. Hirotani, L. Frazier, K. Rayner, Punctuation and one of our reviewers, focusing on fixation times while intonation effects on clause and sentence wrap-up: ignoring saccadic movements may seriously detract from Evidence from eye movements, Journal of Memory the explanatory power of any computational model of and Language 54 (2006) 425–443. human reading. In fact, this could be tantamount to tim- [5] E. D. Reichle, K. Rayner, A. Pollatsek, The E-Z ing a bike rider’s speed, while ignoring if she is climbing Reader model of eye-movement control in reading: up a hill or approaching a sharp turn. More realistic Comparisons to other models, Behavioral and Brain models of reading are bound to include more aspects of Sciences 26 (2003) 445–476. reading behaviour in more ecologically valid tasks. In the [6] R. Engbert, A. Nuthmann, E. Richter, R. Kliegl, end, it may well be the case that the task of predicting SWIFT: A Dynamical Model of Saccade Generation gaze patterns of human reading should be conceptual- During Reading., Psychological review 112 (2005) ized differently, by anchoring these patterns not only to 777–813. the syntagmatic dimension of a written text, but also to [7] U. Cop, N. Dirix, D. Drieghe, W. Duyck, Present- the time-line of the different movements and multimodal ing GECO: An eyetracking corpus of monolingual processes that unfold during reading. and bilingual sentence reading, Behavior Research Methods 49 (2017) 602–615. [8] N. Hollenstein, J. Rotsztejn, M. Troendle, A. Pedroni, Acknowledgments C. Zhang, N. Langer, ZuCo, a simultaneous EEG and eye-tracking resource for natural sentence reading, The present study has partly been funded by the Read- Scientific Data 5 (2018) 180291. Ground research grant from the National Research Coun- [9] N. Siegelman, S. Schroeder, C. Acartürk, H.-D. Ahn, cil (CNR), and the ReMind and Braillet PRIN grants, from S. Alexeeva, S. Amenta, R. Bertram, R. Bonan- the Ministry of University and Research (MUR). drini, M. Brysbaert, D. Chernova, S. M. Da Fonseca, Alessandro Lento is a PhD student enrolled in the Na- N. Dirix, W. Duyck, A. Fella, R. Frost, C. A. Gattei, tional PhD in Artificial Intelligence, XXXVII cycle, course A. Kalaitzi, N. Kwon, K. Lõo, M. Marelli, T. C. Pa- on Health and Life sciences, organized by Università padopoulos, A. Protopapas, S. Savo, D. E. Shalom, Campus Bio-Medico in Rome. N. Slioussar, R. Stein, L. Sui, A. Taboh, V. Tønnesen, Nadia Khlif is a PhD student in the Computer Science K. A. Usal, V. Kuperman, Expanding horizons of Research Laboratory, Faculty of Sciences, at the University cross-linguistic research on reading: The Multilin- Mohammed First of Oujda, Morocco. gual Eye-movement Corpus (MECO), Behavior Re- Andrea Nadalini’s work is kindly covered by the search Methods 54 (2022) 2843–2863. “RAISE - Robotics and AI for Socio-economic Empow- [10] N. Hollenstein, F. Pirovano, C. Zhang, L. Jäger, erment” grant (ECS00000035), funded by the European L. Beinborn, Multilingual language models predict Union - NextGenerationEU and by the Ministry of Uni- human reading behavior, in: Proceedings of the versity and Research (MUR), National Recovery and Re- 2021 Conference of the North American Chapter silience Plan (NRRP), Mission 4, Component 2, Invest- of the Association for Computational Linguistics: ment 1.5. Human Language Technologies, 2021, pp. 106–123. [11] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Consta- ble, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, D. Crepaldi, V. Pirrelli, Eye-voice and finger-voice J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalam- spans in adults’ oral reading of connected texts. barkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, Implications for reading research and assessment, J. Liang, Y. Lu, C. K. Luk, B. Maher, Y. Pan, The Mental Lexicon (2024). URL: https://benjamins. C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, com/catalog/ml.00025.nad. H. Suk, S. Zhang, M. Suo, P. Tillet, X. Zhao, E. Wang, [24] R Core Team, R: A Language and Environment for K. Zhou, R. Zou, X. Wang, A. Mathews, W. Wen, Statistical Computing, R Foundation for Statistical G. Chanan, P. Wu, S. Chintala, PyTorch 2: Faster Computing, Vienna, Austria, 2023. URL: https:// Machine Learning Through Dynamic Python Byte- www.R-project.org/. code Transformation and Graph Compilation, in: Proceedings of the 29th ACM International Con- ference on Architectural Support for Programming Languages and Operating Systems, Volume 2, vol- A. GeCO FPD data ume 2 of ASPLOS ’24, Association for Computing Machinery, 2024, pp. 929–947. participant #1 participant #2 participant #10 average distribution [12] T. M. Inc., Matlab version: 9.7.0.1190202 (r2019b), 0.5 0.5 0.5 0.5 2019. 0.4 0.4 0.4 0.4 FPD (seconds) [13] B. Consortium, The british national corpus, xml 0.3 0.3 0.3 0.3 edition, 2007. 0.2 0.2 0.2 0.2 [14] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Man- 0.1 0.1 0.1 0.1 ning, Stanza: A Python natural language processing 0 0 0 0 toolkit for many human languages, in: Proceed- 1 2 3 part 4 1 2 3 part 4 1 2 3 part 4 1 2 3 part 4 ings of the 58th Annual Meeting of the Association 0.5 per-token behaviour averaged accross all participants for Computational Linguistics: System Demonstra- tions, 2020. 0.4 FPD (seconds) [15] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, 0.3 I. Sutskever, Language models are unsupervised 0.2 multitask learners (2019). 0.1 [16] J. A. Michaelov, B. K. Bergen, Do language models make human-like predictions about the coreferents 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 token (sorted by FPD) 104 of italian anaphoric zero pronouns?, arXiv preprint arXiv:2208.14554 (2022). Figure 2: A view of FPD data in the GECO dataset, consisting [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: of eye-tracking patterns of 14 adult participants reading the Pre-training of Deep Bidirectional Transformers for novel "The Mysterious Affair at Styles" by Agata Christie. Top Language Understanding, 2019. ArXiv:1810.04805 panel: distributions of FPD data, with chapters grouped into [cs] version: 2. 4 parts, for participant #1 (with 3 more participants showing [18] K. E. Stanovich, Attentional and automatic con- a similar distribution), participant #2 (with 8 more partici- pants showing a similar distribution) and participant #10. The text effects in reading, in: Interactive processes in rightmost box plot shows the average distribution across all reading, Routledge, 2017, pp. 241–267. 14 participants. Bottom panel: plot of all 56410 tokens in [19] G. B. Simpson, R. R. Peterson, M. A. Casteel, the dataset, in ascending order of mean FPD (dashed black C. Burgess, Lexical and sentence context effects in line). For each token, the standard deviation calculated on the word recognition., Journal of Experimental Psychol- distribution of the FPDs of the 14 participants is shown both ogy: Learning, Memory, and Cognition 15 (1989) above and below the mean value (gray dots). 88. [20] K. Rayner, K. H. Chace, T. J. Slattery, J. Ashby, Eye movements as reflections of comprehension pro- cesses in reading, Scientific studies of reading 10 (2006) 241–255. [21] N. J. Smith, R. Levy, The effect of word predictability on reading time is logarithmic, Cognition 128 (2013) 302–319. [22] M. Breen, Empirical investigations of the role of implicit prosody in sentence processing, Language and Linguistics Compass 8 (2014) 37–50. [23] A. Nadalini, C. Marzi, M. Ferro, L. Taxitari, A. Lento, B. FPROP accuracy FPROP accuracies test training model accS accT accL accS accT accL const 2.70% 7.17% 51.44% 2.82% 7.37% 51.71% (0.37%) (0.70%) (0.57%) (0.02%) (0.04%) (0.03%) bert 33.84% 44.86% 86.34% 37.47% 48.84% 87.68% + layer (1.28%) (0.89%) (0.15%) (1.24%) (1.24%) (0.28%) mlp UDT 36.24% 48.75% 86.90% 43.40% 58.64% 89.49% (0.37%) (0.83%) (0.21%) (0.71%) (0.61%) (0.09%) bert 38.00% 48.46% 87.50% 42.78% 54.78% 89.16% + lstm (0.76%) (1.01%) (0.43%) (0.88%) (0.70%) (0.12%) bert FT 36.39% 47.60% 87.00% 75.10% 90.66% 95.28% + layer (1.09%) (1.23%) (0.33%) (1.78%) (1.85%) (0.26%) mlp 38.96% 51.23% 88.10% 39.45% 51.78% 88.34% (1.05%) (1.08%) (0.19%) (0.27%) (0.15%) (0.02%) lstm 37.91% 49.95% 87.93% 39.42% 51.63% 88.34% Figure 3: Effects of surprisal, probability of the preceding (0.85%) (0.78%) (0.11%) (0.46%) (0.42%) (0.12%) token (probMinus1), word length (len) as predictors, and word log-frequency (logFreq) as a smooth term, on human fixation Table 3 first-pass duration (fixFPD) as a response variable. Accuracy values of neural models predicting the fixation prob- abilities of the GECO dataset. For each model three different accuracy metrics are used, as described in the paper. The "const" model was used as a baseline; highest accuracy scores are highlighted in bold: lowest scores are shown in italic C. Data analysis In this section, coefficients of Generalised Additive Mod- els (GAMs) are detailed for each neural model. Statistical non-significant p-values on GAM predicting terms are given in bold-face. GAMs are fitted using the package gamm4 version 0.2-6 of the R statistical software [24], as they do not assume a linear relation between the fitted variable and its predictors. All plots were created via the ggplot2 package, version 3.5. Human FPD parametric coeff. estimate std. error t value pr(>|t|) Intercept (content) 6.960e-02 7.858e-04 88.568 < 2𝑒 − 16 surprisal 1.928e-03 5.002e-05 38.539 < 2𝑒 − 16 probMinus1 -1.395e-02 1.363e-03 -10.233 < 2𝑒 − 16 Intercept (function) -2.599e-02 1.143e-03 -22.746 < 2𝑒 − 16 length (content) 1.562e-02 1.423e-04 109.767 < 2𝑒 − 16 length (function) 5.499e-03 2.791e-04 19.704 < 2𝑒 − 16 surprisal:probMinus1 4.692e-04 1.776e-04 2.642 < 0.01 s(logFreq) < 2𝑒 − 16 R2 58.4% Table 4 GAM coefficients fitting human fixation FPD: FPD ∼ surprisal Figure 4: MLP effects in training (top panel) and test × probMinus1 + POSgroup × wordlength + s(logFreq). (bottom panel) data, with surprisal, probability of the pre- ceding token (probMinus1), word length (len) as predictors, word log-frequency as a smooth term (logFreq), and fixation first-pass duration as response variable. MLP FPD LSTM FPD parametric coeff. estimate std. error t value pr(>|t|) parametric coeff. estimate std. error t value pr(>|t|) Intercept (content) 7.252e-02 2.729e-04 265.71 < 2𝑒 − 16 Intercept (content) 7.051e-02 3.259e-04 216.317 < 2𝑒 − 16 surprisal 9.028e-04 1.734e-05 52.064 < 2𝑒 − 16 surprisal 7.615e-04 2.069e-05 36.802 < 2𝑒 − 16 probMinus1 -1.417e-02 4.723e-04 -29.995 < 2𝑒 − 16 probMinus1 2.120e-03 5.644e-04 3.756 < 0.001 Intercept (function) -2.312e-02 3.973e-04 -58.2006 < 2𝑒 − 16 Intercept (function) -1.600e-02 4.778e-04 -33.492 < 2𝑒 − 16 length (content) 1.651e-02 4.935e-05 334.512 < 2𝑒 − 16 length (content) 1.649e-02 5.896e-05 279.739 < 2𝑒 − 16 length (function) 4.324e-03 9.698e-05 44.584 < 2𝑒 − 16 length (function) 2.801e-03 1.170e-04 23.945 < 2𝑒 − 16 surprisal:probMinus1 1.810e-04 6.166e-05 2.936 < 0.005 surprisal:probMinus1 -3.385e-04 7.325e-05 -4.621 < 0.001 s(logFreq) < 2𝑒 − 16 s(logFreq) < 2𝑒 − 16 R2 92.2% R2 89.6% Intercept (content) 7.148e-02 1.183e-03 60.42 < 2𝑒 − 16 Intercept (content) 6.812e-02 1.407e-03 48.431 < 2𝑒 − 16 surprisal 7.585e-04 7.619e-05 9.956 < 2𝑒 − 16 surprisal 6.837e-04 9.284e-05 7.364 < 2.3𝑒 − 13 probMinus1 -1.061e-02 2.044e-03 -5.188 < 2.2𝑒 − 07 probMinus1 3.293e-03 2.458e-03 1.340 0.18 Intercept (function) -1.919e-02 1.658e-03 -11.573 < 2𝑒 − 16 Intercept (function) -1.255e-02 1.936e-03 -6.480 < 1.1𝑒 − 10 length (content) 1.677e-02 2.136e-04 78.502 < 2𝑒 − 16 length (content) 0.0152041 0.0004032 37.709 < 2𝑒 − 16 length (function) 3.399e-03 3.963e-04 8.5774 < 2𝑒 − 16 length (function) 0.0042481 0.0007472 5.685 1 < 1.4𝑒 − 08 surprisal:probMinus1 -1.408e-04 2.480e-04 -0.568 0.57 surprisal:probMinus1 -0.0001970 0.0004701 -0.419 0.67 s(logFreq) < 2𝑒 − 16 s(logFreq) < 2𝑒 − 16 2 2 R 92.6% R 89.9% Table 5 Table 6 GAM coefficients fitting MLP fixation FPD in training (top) GAM coefficients fitting LSTM fixation FPD in training (top) and test (bottom) data: FPD ∼ surprisal × probMinus1 + and test (bottom) data: FPD ∼ surprisal × probMinus1 + POSgroup × wordlength + s(logFreq). POSgroup × wordlength + s(logFreq). Figure 5: LSTM effects in training (top panel) and test (bottom panel) data, with surprisal, probability of the pre- Figure 6: fine-tuned BERT effects in training (top panel) and ceding token (probMinus1), word length (len) as predictors, test (bottom panel) data, with surprisal, probability of the word log-frequency as a smooth term (logFreq), and fixation preceding token (probMinus1), word length (len) as predictors, first-pass duration as response variable. word log-frequency as a smooth term (logFreq), and fixation first-pass duration as response variable. BERT+fine-tuning FPD BERT FPD parametric coeff. estimate std. error t value pr(>|t|) parametric coeff. estimate std. error t value pr(>|t|) Intercept (content) 6.950e-02 8.572e-04 81.075 < 2𝑒 − 16 Intercept (content) 9.626e-02 4.765e-04 202.020 < 2𝑒 − 16 surprisal 2.013e-03 5.446e-05 36.9562 < 2𝑒 − 16 surprisal 1.319e-03 3.027e-05 43.586 < 2𝑒 − 16 probMinus1 -1.475e-02 1.483e-03 -9.9416 < 2𝑒 − 16 probMinus1 -4.998e-03 8.245e-04 -6.0616 < 1.3𝑒 − 09 Intercept (function) -2.631e-02 1.248e-03 -21.0852 < 2𝑒 − 16 Intercept (function) -2.293e-02 6.937e-04 -33.053 < 2𝑒 − 16 length (content) 1.570e-02 1.550e-04 101.307 < 2𝑒 − 16 length (content) 1.019e-02 8.616e-05 118.232 < 2𝑒 − 16 length (function) 5.528e-03 3.046e-04 18.148 < 2𝑒 − 16 length (function) 2.892e-03 1.693e-04 17.0848 < 2𝑒 − 16 surprisal:probMinus1 5.024e-04 1.937e-04 2.594 < 0.01 surprisal:probMinus1 -3.874e-04 1.077e-04 -3.599 < 0.001 s(logFreq) < 2𝑒 − 16 s(logFreq) < 2𝑒 − 16 R2 57.5% R2 75.6% Intercept (content) 0.0714503 0.0022332 31.99 < 2𝑒 − 16 Intercept (content) 0.0960782 0.0021829 44.014 < 2𝑒 − 16 surprisal 0.0014206 0.0001441 9.859 < 2.3𝑒 − 13 surprisal 0.0012786 0.0001409 9.073 < 2.3𝑒 − 13 probMinus1 -0.0017461 0.0038742 -0.451 0.65 probMinus1 -0.0013508 0.0037907 -0.356 0.72 Intercept (function) -0.0239773 0.0031336 -7.652 < 2.7𝑒 − 14 Intercept (function) -0.0192904 0.0030629 -6.298 < 3.4𝑒 − 10 length (content) 1.707e-02 2.499e-04 68.321 < 2𝑒 − 16 length (content) 0.0102735 0.0003941 26.069 < 2𝑒 − 16 length (function) 1.579e-03 4.627e-04 3.411 < 0.001 length (function) 0.0027876 0.0007299 3.819 < 0.001 surprisal:probMinus1 -5.244e-04 3.561e-04 -1.473 0.14 surprisal:probMinus1 -0.0008111 0.0004600 -1.763 0.08 s(logFreq) < 2𝑒 − 16 s(logFreq) < 2𝑒 − 16 2 2 R 78.4% R 73.5% Table 7 Table 8 GAM coefficients fitting BERT+fine-tuning fixation FPD in GAM coefficients fitting BERT fixation FPD for the training training (top) and test (bottom) data: FPD ∼ surprisal × (top) and test (bottom) settings: FPD ∼ surprisal × probMi- probMinus1 + POSgroup × wordlength + s(logFreq). nus1 + POSgroup × wordlength + s(logFreq). Figure 7: untuned BERT effects in training (top panel) and test (bottom panel) data, with surprisal, probability of the preceding token (probMinus1), word length (len) as predictors, word log-frequency as a smooth term (logFreq), and fixation first-pass duration as response variable.