1. Introduction

Comparative Evaluation of Computational Models Predicting Eye Fixation Patterns During Reading: Insights from Transformers and Simpler Architectures

Alessandro Lento

0 1

Andrea Nadalini

Nadia Khlif

0 2

Vito Pirrelli

Claudia Marzi

Marcello Ferro

0 0 Consiglio Nazionale delle Ricerche, Istituto di Linguistica Computazionale "A. Zampolli" , Pisa , Italy 1 Università Campus Bio-Medico , Roma , Italy 2 University Mohammed First , Oujda , Morocco

Eye tracking records of natural text reading are known to provide significant insights into the cognitive processes underlying word processing and text comprehension, with gaze patterns, such as fixation duration and saccadic movements, being modulated by morphological, lexical, and higher-level structural properties of the text being read. Although some of these efects have been simulated with computational models, it is still not clear how accurately computational modelling can predict complex fixation patterns in connected text reading. State-of-the-art neural architectures have shown promising results, with pre-trained transformer-based classifiers having recently been claimed to outperform other competitors, achieving beyond 95% accuracy. However, transformer-based models have neither been compared with alternative architectures nor adequately evaluated for their sensitivity to the linguistic factors afecting human reading. Here we address these issues by evaluating the performance of a pool of neural networks in classifying eye-fixation English data as a function of both lexical and contextual factors. We show that i) accuracy of transformer-based models has largely been overestimated, ii) other simpler models make comparable or even better predictions, iii) most models are sensitive to some of the major lexical factors accounting for at least 50% of human fixation variance, iv) most models fail to capture some significant context-sensitive interactions, such as those accounting for spillover efects in reading. The work shows the benefits of combining accuracy-based evaluation metrics with non-linear regression modelling of fixed and random efects on both real and simulated eye-tracking data.

eol>eye-tracking eye fixation time prediction neural network contextual word embeddings lexical features

1. Introduction

considerable progress, leading to the development of sophisticated computational models accounting for fineEye-tracking records of natural text reading are a valu- grained aspects of eye movement behaviour during word able window on the cognitive processes underlying word and sentence reading (e.g. EZ-Reader[ 5 ], Swift[ 6 ]). A sigprocessing and text comprehension. By looking at fix- nificant boost in this area came from large eye-tracking ation patterns it is possible to estimate the efects that corpora of natural reading (e.g. GECO[ 7 ], ZUCO[8], lexical properties (e.g. length, frequencies, orthographic MECO[9]), which allow for (deep) learning models to be similarity [ 1 ] [ 2 ]), contextual constraints (e.g. predictabil- tested in prediction tasks of eye tracking metrics. Of late, ity [ 3 ]) and higher-level structures (e.g. syntactic struc- Hollenstein and colleagues [10] reported that fine-tuned, ture or prosodic contour [ 4 ]) can have on human word pre-trained transformer language models can make reidentification and processing. While psycholinguistic ex- liable predictions on a wide range of eye-tracking meaperiments have reliably assessed how such efects modu- surements, covering both early and late stages of lexical late reading times, it is not clear to what extent computa- processing. The evidence suggests that transformers can tional models of reading can simulate actual behavioural inherently encode the relative prominence of language data such as gaze patterns and fixation durations. units in a text, in ways that accurately replicate human Over the past 30 years, research in this field has made reading skills and their underlying cognitive mechanisms.

Although the accuracy of multilingual transformers is CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, validated across eye-tracking evidence from diferent lan*DCecor0r4es—po0n6,d2in02g4a, uPtihsao,r.Italy guages, the paper neither compares the performance of $ alessandro.lento@ilc.cnr.it (A. Lento); andrea.nadalini@ilc.cnr.it transformers with the performance of other neural net(A. Nadalini); nadia.khlif@ilc.cnr.it (N. Khlif); vito.pirrelli@ilc.cnr.it work classifiers trained on the same task, nor it shows (V. Pirrelli); claudia.marzi@ilc.cnr.it (C. Marzi); what specific knowledge is encoded and put to use by marcello.ferro@ilc.cnr.it (M. Ferro) transformers, by looking at the factors afecting their (C. 0M00a0r-z0i)0;0020-0505-8010-0724-51132(V4.-3P6i9rr9e(lMli).; 0F0e0rr0o-0)002-3427-2827 behaviour. In the present paper, we address both issues © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License by assessing the performance of a pool of neural network Attribution 4.0 International (CC BY 4.0). classifiers on the English batch of Hollenstein et al.’s [10] ii) log frequency (source: BNC [13]) data.

In what follows, we first describe the English data set iii) part-of-Speech tag (source: Stanza [14]) and the pool of tested classifiers. Classifiers were selected iv) context surprisal/predictability (source: GPT-2 to include and test either simpler neural architectures [ 15, 16, 3 ]) than transformers (as is the case with multi-layer perceptrons), or cognitively more plausible processing models v) distance from the beginning of the sentence (num(i.e. sequential long-short terms memories). Hybrid mod- ber of intervening tokens) els, resulting from the combination of diferent architectures, were also tested. We then move on to discussing vi) distance from the end of the sentence (number of the metrics used in [10] for evaluation, to suggest alter- intervening tokens) native ways to measure accuracy in a fixation prediction vii) presence of heavy punctuation after the token task. Finally, we investigate how sensitive each tested architecture is to a few linguistic factors that are known viii) presence of light punctuation after the token. to account for a sizeable amount of variance in human reading gaze patterns. Although some neural networks 2.2. BERT ++ turn out to be reasonably good at predicting fixation patterns and replicating some robust psycholinguistic efects To replicate results from [10], we used BERT [17] with that are found in human data, it is still unclear whether a linear layer on top of it. The linear layer gets BERT this ability is due to specific aspects of their architecture, contextual word embeddings as input, to predict FPD to the type of information they are provided in input, or and FPROP. to their space of trainable parameters. We conclude that, After sentence padding and tokenization, irrelevant contrary to recent over-enthusiastic reports, predicting and special subtokens were masked to enforce a correeye-fixation patterns of human natural reading is still a spondence between each vector in the target sequence big challenge for currently available neural architectures, and each vector in the output sequence, and train the including transformer-based ones. For this very reason, loss only on relevant tokens. Mean Square Error (MSE) we contend that the task is key to understanding the loss was used along with the AdamW optimizer (with inductive bias of these models, as well as assessing their no weight decay for the biases). The initial learning rate cognitive plausibility as models of language behaviour. was set to 5 · 10− 5, and a linear scheduler was used. We used a 16 sentences batch size and 100 training epochs, with an early stopping criterion (best model on the vali2. Data and Experiments dation set). The model was trained both with fine-tuning (i.e. by also training BERT internal weights: bert FT + All models described in the following paragraphs were layer) and without fine-tuning (by only training final trained, validated, and tested on data from the GECO layer weights: bert + layer). corpus [ 7 ]. We used a 5-fold cross-validation with 95% Finally, we used BERT also in combination with a setraining, 5% validation and 5% test. Experiments were quential LSTM network. This model (bert + LSTM) takes conducted using the PyTorch library [11] in Python or the pre-trained BERT contextual word embeddings MatLab [12]. (i.e. without fine-tuning) in input, along with the lexical features (i), (ii) and (iv), to predict FPD and FPROP. 2.1. Dataset The GECO corpus [ 7 ] contains data from 14 English native speakers whose eye movements were recorded while reading Agatha Christie’s novel “The Mysterious Afair at Styles” (56410 tokens). Out of the eight word-level eye tracking measurements used in [10], we focused on i) ifrst-pass duration (FPD) (the time spent fixating a word the first time it is encountered, averaged over subjects, see Fig. 2) and ii) fixation proportion (FPROP) or probability (number of subjects that fixated a word, divided by the total number of subjects).

Word tokens in the original dataset were encoded with linguistic information including: i) character length (removing punctuation) 2.3. LSTM Reading is inherently sequential. Thus, recurrent neural networks appear to ofer a promising approach to modelling a fixation prediction task, and a good alternative to transformers. Using the GECO dataset split into pages rather than sentences, we trained an LSTM with 96 hidden units and a single layer, with a feed-forward network using tanh activation functions on top of it. The model (lstm) takes as input the lexical features (i)-(iv) for the target token and 4 tokens to its left and 3 to its right, to predict FPD and FPROP of the target token. MSE loss was used along with the AdamW optimizer. The initial learning rate was set to 5 · 10− 3, with a linear scheduler and a batch containing the entire training dataset. The model was trained for 3000 epochs with an early stopping criterion (best model on the validation set). threshold. An ofset value is needed to obtain a positive threshold also for zero target values. This is calculated as follows: 2.4. MLP () = 1 − A Multi-Layer-Perceptron (mlp) was trained using the entire set of lexical features (i)-(viii) as input, with an input context consisting of the two words immediately where is the number of examples in the trainpreceding and ensuing the target word. Several instances ing/test set, is the Heaviside step function, is a threshof this architecture were tested, but only the results of old and is a sensitivity coeficient. the best performing instance (with a single hidden layer As for FPD, which is a duration expressed in seconds, of 10 units, sigmoidal activation functions, the Adam we used = 25 and = 10% for accS, and = 50 optimiser, the MSE loss, a constant learning rate of 0.1, for accT. As for FPROP, which is a probability, we used and 1000 training epochs) are reported here. = 0.01 and = 10% for accS, and = 0.1 for accT.

An identical MLP model (mlp UDT) was eventually Finally, the performance of our models was compared trained on a subset of GECO training data, obtained by against a baseline model (const) that always outputs the sampling target features uniformly. This was done to overall mean fixation duration (across both subjects and train the network with an equal number of tokens for items) in the training data. each bin of fixation times, and assess the impact of different distributions of input data on the network’s performance on test data. 3. Results 1

∑︁ [ − ( · + )] ∈

Loss accuracy (accL) is a measure of the overall simi

larity between predicted and target values, calculated as the complement to 1 of the Mean Absolute Error (MAE) after fitting the target data in the training set into the [0; 1] range with the min-max scaling:

ˆ|, ˆ = / =max{ }, and ˆ is the model prediction for ˆ. Loss accuracy is the metric used in [10].

Threshold accuracy (accT) measures how many times the predicted value is close to the target value within a ifxed threshold, and is calculated as follows: () = 1 − 1

∑︁ [ − ] ∈

Sensitivity accuracy (accS) counts how many times

the predicted value is close to the target value within a threshold dynamically calculated on the basis of the target value: the higher the target value, the higher the

4. Data analysis To what extent are neural network models sensitive to

some of the factors accounting for gaze patterns in human natural reading? Are language models able to adapt themselves to both lexical properties and in-context features of a reading text, thus exhibiting a human-like performance?

Human reading behaviour is shown to be afected by lexical features – e.g. word length and frequency, and morphological complexity – as well as by contextual factors, with a facilitatory efect of contextual redundancy and predictability (18, 19) on reading duration and eye ifxations. Accordingly, we modelled human FPDs as a response variable resulting from the interaction of both lexical and contextual predictors: namely, word length, a dichotomous classification of token POS into content versus function words, surprisal of the target word as a mlp (training) =0.79***** 1 2 3 4

token lstm (training) =0.79***** )0.5 sd0.4 n co0.3 (se0.2 PD0.1

F 0 1 2 3 4 5

token 104 )0.5 bert FT + layer (training) =0.98***** sd0.4 n co0.3 (se0.2 PD0.1 F 0

token 1 2 3 4

In contrast, all models fail to capture some contextual

efects on test data, such as those observed in a context window of – at least – two adjacent words. To illustrate, eficient syntactic chunking (e.g. of noun, verb and prepositional phrases) has been shown to lead to faster and more accurate human reading (see, for example, [20]).

Table 2 Conversely, most neural networks show no statistically Sensitivity accuracy (accS) values for three bins from the significant efect on fixation duration of the probability FPD distribution: low (FPD below the 5ℎ percentile = 36ms), of the immediately preceding word in context. This is medium (FPD ranging from the 5ℎ to the 95ℎ percentile), observed either is isolation (probMinus1) in LSTMs and and high (FPD above the 95ℎ percentile = 280ms). transformer-based models with BERT representations (either fine-tuned or not), or in interaction with the unpredictability of the target word (surprisal:probMinus1).

The evidence shows that most neural models cannot replimeasure of how unexpected or unpredictable the word cate, among other things, so-called spillover efects of the is, and the probability of the word immediately preced- left-context on the reading time of ensuing words [21]. ing the target word in context (to account for so-called spill-over efects). Additionally, we used a Generalised 5. General Discussion Additive Model (GAM), with token log-frequency as a smooth term, to model for possibly non-linear efects Transformer-based neural networks appear to reasonof predictors. Models’ coeficients and efect plots are ably predict fixation probability and first-pass duration shown in Appendix C (Figure 3 and Table 4). of words in human reading of English connected texts.

GAMs with identical independent variables have been Our present investigation basically supports this conrun to model the FPDs predicted by all our neural net- clusion, while providing new evidence on two related works, on both training and test data. Inspection of efect questions. Two questions naturally arise in this context. plots and model coeficients – as reported in Appendix How accurate are transformer-based predictions comC – shows a behavioural alignment of all models with pared with the best predictions of other neural network human data for what concerns the modulation of fixa- classifiers trained on the same task? How cognitively tion times by lexical features, in both train and test data. plausible are the mechanisms underpinning this performance? Here, we addressed both questions by testing gests that one can gain non trivial insights in a model’s various models on the task of predicting human reading behaviour by analysing to what extent the behaviour is measurements from the GECO corpus, using diferent sensitive to the same linguistic factors human readers are evaluation metrics and regressing network predictions known to be sensitive to. On the one hand, this is a step on a few linguistic factors that are known to account for towards understanding what information a neural model human reading behaviour. is actually learning and putting to use for the task. On

Our first observation is that assessing a network’s per- the other hand, this is instrumental in developing better formance by looking at its MAE loss function provides a models, as it shows what type of input information is rather gross evaluation of the efective power of a neural more needed to successfully carry out a task, at least if network simulating human reading behaviour. A base- one is trying to simulate the way the same task is carried line model assigning each token a constant gaze dura- out by speakers. tion that equals the average of all FPD values attested In the end, it may well be the case that a 70% fixedin GECO achieves a 95.7% loss-based accuracy on both threshold accuracy in simulating average gaze patterns in test and training data. That a transformer-based classi- human reading is not as disappointing as it might seem. ifcation scores 97.2% on the same metric and the same Given the wide variability in human reading behaviour test data cannot be held, as such, as a sign of outstanding (and even in a single reader when confronted with diferperformance. In fact, it turns out that the MAE loss func- ent texts), a considerable amount of variance in our data tion is blind to both the magnitude of a network error, may simply be accounted for by by-subject (or by-token) and possible biases in the prediction of very low/high random efects. In some experiments not reported here target values. Thus, it provides an inflated estimate of we trained our models to predict single-reader behaviour. a model’s accuracy. We suggest that binary evaluation All architectures fared rather poorly on the task, a remetrics, based on a fixed threshold partially overcome sult which is in line with similar disappointing results these limitations. Yet, as single word fixation times typ- on other output features reported in [10]. Looking back ically range between tens to hundreds of milliseconds, at Figure 1, it can be noted that all models’ predictions application of a fixed threshold will diferently afect to- fall into a ± range, where and are, respeckens with diferent fixation times. We conclude that a tively, the by-reader mean and standard deviation of FPD relative threshold based on each word’s fixation time is a values for token (see also Table 2). This pattern may fairer way to measure prediction accuracy. Clearly, this suggest that models’ predictions are in fact bounded by comes at a cost. When assessed with a relative threshold, the standard deviation we observe in human behaviour the accuracy of a transformer-based architecture on test and cannot reach out of these bounds. Conversely, this data drops from 70% down to 57.8%. evidence may be interpreted as suggesting that more

It turned out that all other network models tested for input features are needed to build more accurate classithe present purposes showed accuracy levels that are ifers. Further experiments are needed to test the merits comparable to the accuracy of a transformer-based archi- of either conjecture. tecture. Since the former are trained on a more restricted set of lexical and contextual input features than the latter, this seems to suggest that word embeddings are of 6. Limitations and outlook limited use in the task at hand. Although fine-tuned word embeddings actually appear to score much higher In the present paper, we replicated recent experimental on training data (even using accT and accS), we observe data of transformer-based architectures simulating word that this is due to data overfitting, as clearly shown by ifxation duration in reading a connected text [ 10], with a the considerably poorer performance of the fine-tuned view to assessing their relative performance compared model on test data. with reading times by humans and other neural archi

An analysis of the psychometric plausibility of the gaze tectures. This justifies our exclusive focus on fixation patterns simulated with our neural models reveals that a duration, which is, admittedly, only one behavioural correlatively small set of linguistic factors that are known relate of a complex, inherently multimodal task such as to account for a sizeable amount of variance in human reading. In fact, reading requires the fine coordination ifxation times can also account for the bulk of variance of eye movements and articulatory movements for text in models’ behaviour. This is relatively unsurprising, as decoding and comprehension. The eye provides access most of these models were trained on input features that to the visual stimuli needed for voice articulation to unencode at least some of these factors. Nonetheless, we be- fold at a relatively constant rate. In turn, articulation can lieve that the result is interesting for at least two reasons. feedback oculomotor control for eye movements to be First, it shows a promising convergence between com- directed when and where processing dificulties arise. putational metrics of model accuracy and quantitative Incidentally, this is also true of silent reading as shown models of psychometric assessment. Secondly, it sug- by evidence supporting the Implicit Prosody Hypothesis [22], i.e. the idea that, in silent reading, readers activate prosodic representations that are similar to those they would produce when reading the text aloud. Hence, a reader must always rely on a tight control strategy to ensure that fixation and articulation are optimally coordinated.

A clear limitation of our current work and all experiments reported here is that we are only focusing on one dimension of a complex, multimodal behaviour like reading. Recently, we showed that there is a lot about gaze patterns that we can understand by correlating eye movements with voice articulation [23]. This information, which cannot be represented in a dataset structured at the word level, may be critical for a model to accurately learn and mimic the cognitive mechanisms underlying natural reading. Likewise, as correctly pointed out by one of our reviewers, focusing on fixation times while ignoring saccadic movements may seriously detract from the explanatory power of any computational model of human reading. In fact, this could be tantamount to timing a bike rider’s speed, while ignoring if she is climbing up a hill or approaching a sharp turn. More realistic models of reading are bound to include more aspects of reading behaviour in more ecologically valid tasks. In the end, it may well be the case that the task of predicting gaze patterns of human reading should be conceptualized diferently, by anchoring these patterns not only to the syntagmatic dimension of a written text, but also to the time-line of the diferent movements and multimodal processes that unfold during reading. ble, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, D. Crepaldi, V. Pirrelli, Eye-voice and finger-voice J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalam- spans in adults’ oral reading of connected texts. barkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, Implications for reading research and assessment, J. Liang, Y. Lu, C. K. Luk, B. Maher, Y. Pan, The Mental Lexicon (2024). URL: https://benjamins. C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, com/catalog/ml.00025.nad.

H. Suk, S. Zhang, M. Suo, P. Tillet, X. Zhao, E. Wang, [24] R Core Team, R: A Language and Environment for K. Zhou, R. Zou, X. Wang, A. Mathews, W. Wen, Statistical Computing, R Foundation for Statistical G. Chanan, P. Wu, S. Chintala, PyTorch 2: Faster Computing, Vienna, Austria, 2023. URL: https:// Machine Learning Through Dynamic Python Byte- www.R-project.org/. code Transformation and Graph Compilation, in: Proceedings of the 29th ACM International Conference on Architectural Support for Programming A. GeCO FPD data Languages and Operating Systems, Volume 2, volume 2 of ASPLOS ’24, Association for Computing

Machinery, 2024, pp. 929–947. participant #1 participant #2 participant #10 average distribution [12] T. M. Inc., Matlab version: 9.7.0.1190202 (r2019b), 0.5 0.5 0.5 0.5 2019. )0.4 0.4 0.4 0.4 [[1143]] ePBd.. iQCtiooi,nnY,s.o20rZt0hi7ua.mn g,,TYh.eZbhraitnisgh, Jn. aBtioolntoanl ,cCor.pDu.s,Mxamnl- (cssednoFPD000...123 000...123 000...123 000...123 ning, Stanza: A Python natural language processing toolkit for many human languages, in: Proceed- 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 ifnogrsCoofmthpeut5a8ttihonAanlnLuinagluMiseteictisn: gSyosfttehme ADsesmocoinasttioran- 0.5 part per-token bpeahratviour averaged accproasrts all participants part tions, 2020. )0.4 [15] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, scnod0.3

I. Sutskever, Language models are unsupervised (seD0.2 multitask learners (2019). FP0.1 [16] J. A. Michaelov, B. K. Bergen, Do language models make human-like predictions about the coreferents 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 of italian anaphoric zero pronouns?, arXiv preprint token (sorted by FPD) 104 arXiv:2208.14554 (2022). Figure 2: A view of FPD data in the GECO dataset, consisting [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: of eye-tracking patterns of 14 adult participants reading the Pre-training of Deep Bidirectional Transformers for novel "The Mysterious Afair at Styles" by Agata Christie. Top Language Understanding, 2019. ArXiv:1810.04805 panel: distributions of FPD data, with chapters grouped into [cs] version: 2. 4 parts, for participant #1 (with 3 more participants showing [18] K. E. Stanovich, Attentional and automatic con- a similar distribution), participant #2 (with 8 more particitext efects in reading, in: Interactive processes in rpiagnhttsmsohsotwbinogx palsoitmsihlaorwdsistthriebuatvieorna)gaenddisptarribtiuctipioannta#c1ro0s.sTahlel reading, Routledge, 2017, pp. 241–267. 14 participants. Bottom panel: plot of all 56410 tokens in [19] G. B. Simpson, R. R. Peterson, M. A. Casteel, the dataset, in ascending order of mean FPD (dashed black C. Burgess, Lexical and sentence context efects in line). For each token, the standard deviation calculated on the word recognition., Journal of Experimental Psychol- distribution of the FPDs of the 14 participants is shown both ogy: Learning, Memory, and Cognition 15 (1989) above and below the mean value (gray dots). 88. [20] K. Rayner, K. H. Chace, T. J. Slattery, J. Ashby, Eye movements as reflections of comprehension processes in reading, Scientific studies of reading 10 (2006) 241–255. [21] N. J. Smith, R. Levy, The efect of word predictability on reading time is logarithmic, Cognition 128 (2013) 302–319. [22] M. Breen, Empirical investigations of the role of implicit prosody in sentence processing, Language and Linguistics Compass 8 (2014) 37–50. [23] A. Nadalini, C. Marzi, M. Ferro, L. Taxitari, A. Lento,

B. FPROP accuracy C. Data analysis

In this section, coeficients of Generalised Additive Models (GAMs) are detailed for each neural model. Statistical non-significant p-values on GAM predicting terms are given in bold-face. GAMs are fitted using the package gamm4 version 0.2-6 of the R statistical software [24], as they do not assume a linear relation between the fitted variable and its predictors. All plots were created via the ggplot2 package, version 3.5.

Human FPD

parametric coef. estimate std. error t value pr(>|t|) Intercept (content) 6.960e-02 7.858e-04 88.568 < 2 − surprisal 1.928e-03 5.002e-05 38.539 < 2 − probMinus1 -1.395e-02 1.363e-03 -10.233 < 2 − Intercept (function) -2.599e-02 1.143e-03 -22.746 < 2 − length (content) 1.562e-02 1.423e-04 109.767 < 2 − length (function) 5.499e-03 2.791e-04 19.704 < 2 − surprisal:probMinus1 4.692e-04 1.776e-04 2.642 < 0.01 s(logFreq) < 2 − R2 58.4%

BERT FPD

parametric coef. estimate std. error t value pr(>|t|) Intercept (content) 9.626e-02 4.765e-04 202.020 < 2 − 16 surprisal 1.319e-03 3.027e-05 43.586 < 2 − 16 probMinus1 -4.998e-03 8.245e-04 -6.0616 < 1.3 − 09 Intercept (function) -2.293e-02 6.937e-04 -33.053 < 2 − 16 length (content) 1.019e-02 8.616e-05 118.232 < 2 − 16 length (function) 2.892e-03 1.693e-04 17.0848 < 2 − 16 surprisal:probMinus1 -3.874e-04 1.077e-04 -3.599 < 0.001 s(logFreq) < 2 − 16

R2 75.6% Intercept (content) 0.0960782 0.0021829 44.014 < 2 − 16 surprisal 0.0012786 0.0001409 9.073 < 2.3 − 13 probMinus1 -0.0013508 0.0037907 -0.356 0.72 Intercept (function) -0.0192904 0.0030629 -6.298 < 3.4 − 10 length (content) 0.0102735 0.0003941 26.069 < 2 − 16 length (function) 0.0027876 0.0007299 3.819 < 0.001 surprisal:probMinus1 -0.0008111 0.0004600 -1.763 0.08 s(logFreq) < 2 − 16 R2 73.5%

[1]

Gerth ,

Festman , Reading development, word length and frequency efects: An eye-tracking study with slow and fast readers, Frontiers in Communication 6 ( 2021 ) 743113 .

[2]

Schroeder ,

Häikiö ,

Pagán ,

J. H.

Dickins ,

Hyönä ,

S. P.

Liversedge , Eye movements of children and adults reading in three diferent orthographies ., Journal of Experimental Psychology: Learning, Memory, and Cognition 48 ( 2022 ) 1518 .

[3]

Salicchi ,

Chersoni ,

Lenci , A study on surprisal and semantic relatedness for eye-tracking data prediction , Frontiers in Psychology 14 ( 2023 ) 1112365 .

[4]

Hirotani ,

Frazier ,

Rayner , Punctuation and intonation efects on clause and sentence wrap-up: Evidence from eye movements , Journal of Memory and Language 54 ( 2006 ) 425 - 443 .

[5]

E. D.

Reichle ,

Rayner ,

Pollatsek , The E-Z Reader model of eye-movement control in reading: Comparisons to other models , Behavioral and Brain Sciences 26 ( 2003 ) 445 - 476 .

[6]

Engbert ,

Nuthmann , E. Richter,

Kliegl , SWIFT: A Dynamical Model of Saccade Generation During Reading ., Psychological review 112 ( 2005 ) 777 - 813 .

[7]

Cop ,

Dirix ,

Drieghe , W. Duyck, Presenting

GECO

: An eyetracking corpus of monolingual and bilingual sentence reading , Behavior Research Methods 49 ( 2017 ) 602 - 615 .