=Paper= {{Paper |id=Vol-3878/56_main_long |storemode=property |title=Comparative Evaluation of Computational Models Predicting Eye Fixation Patterns During Reading: Insights from Transformers and Simpler Architectures |pdfUrl=https://ceur-ws.org/Vol-3878/56_main_long.pdf |volume=Vol-3878 |authors=Alessandro Lento,Andrea Nadalini,Nadia Khlif,Vito Pirrelli,Claudia Marzi,Marcello Ferro |dblpUrl=https://dblp.org/rec/conf/clic-it/LentoNKPMF24 }} ==Comparative Evaluation of Computational Models Predicting Eye Fixation Patterns During Reading: Insights from Transformers and Simpler Architectures== https://ceur-ws.org/Vol-3878/56_main_long.pdf

Comparative Evaluation of Computational Models
Predicting Eye Fixation Patterns During Reading: Insights
from Transformers and Simpler Architectures
Alessandro Lento1,2 , Andrea Nadalini1 , Nadia Khlif1,3 , Vito Pirrelli1 , Claudia Marzi1 and
Marcello Ferro1,*
1
Consiglio Nazionale delle Ricerche, Istituto di Linguistica Computazionale "A. Zampolli", Pisa, Italy
2
Università Campus Bio-Medico, Roma, Italy
3
University Mohammed First, Oujda, Morocco

Abstract
Eye tracking records of natural text reading are known to provide significant insights into the cognitive processes underlying
word processing and text comprehension, with gaze patterns, such as fixation duration and saccadic movements, being
modulated by morphological, lexical, and higher-level structural properties of the text being read. Although some of these
effects have been simulated with computational models, it is still not clear how accurately computational modelling can predict
complex fixation patterns in connected text reading. State-of-the-art neural architectures have shown promising results, with
pre-trained transformer-based classifiers having recently been claimed to outperform other competitors, achieving beyond
95% accuracy. However, transformer-based models have neither been compared with alternative architectures nor adequately
evaluated for their sensitivity to the linguistic factors affecting human reading. Here we address these issues by evaluating the
performance of a pool of neural networks in classifying eye-fixation English data as a function of both lexical and contextual
factors. We show that i) accuracy of transformer-based models has largely been overestimated, ii) other simpler models make
comparable or even better predictions, iii) most models are sensitive to some of the major lexical factors accounting for at
least 50% of human fixation variance, iv) most models fail to capture some significant context-sensitive interactions, such
as those accounting for spillover effects in reading. The work shows the benefits of combining accuracy-based evaluation
metrics with non-linear regression modelling of fixed and random effects on both real and simulated eye-tracking data.

Keywords
eye-tracking, eye fixation time prediction, neural network, contextual word embeddings, lexical features

1. Introduction considerable progress, leading to the development of
sophisticated computational models accounting for fine-
Eye-tracking records of natural text reading are a valu- grained aspects of eye movement behaviour during word
able window on the cognitive processes underlying word and sentence reading (e.g. EZ-Reader[5], Swift[6]). A sig-
processing and text comprehension. By looking at fix- nificant boost in this area came from large eye-tracking
ation patterns it is possible to estimate the effects that corpora of natural reading (e.g. GECO[7], ZUCO[8],
lexical properties (e.g. length, frequencies, orthographic MECO[9]), which allow for (deep) learning models to be
similarity [1] [2]), contextual constraints (e.g. predictabil- tested in prediction tasks of eye tracking metrics. Of late,
ity [3]) and higher-level structures (e.g. syntactic struc- Hollenstein and colleagues [10] reported that fine-tuned,
ture or prosodic contour [4]) can have on human word pre-trained transformer language models can make re-
identification and processing. While psycholinguistic ex- liable predictions on a wide range of eye-tracking mea-
periments have reliably assessed how such effects modu- surements, covering both early and late stages of lexical
late reading times, it is not clear to what extent computa- processing. The evidence suggests that transformers can
tional models of reading can simulate actual behavioural inherently encode the relative prominence of language
data such as gaze patterns and fixation durations. units in a text, in ways that accurately replicate human
Over the past 30 years, research in this field has made reading skills and their underlying cognitive mechanisms.
Although the accuracy of multilingual transformers is
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
validated across eye-tracking evidence from different lan-
Dec 04 — 06, 2024, Pisa, Italy
*
Corresponding author. guages, the paper neither compares the performance of
$ alessandro.lento@ilc.cnr.it (A. Lento); andrea.nadalini@ilc.cnr.it transformers with the performance of other neural net-
(A. Nadalini); nadia.khlif@ilc.cnr.it (N. Khlif); vito.pirrelli@ilc.cnr.it work classifiers trained on the same task, nor it shows
(V. Pirrelli); claudia.marzi@ilc.cnr.it (C. Marzi); what specific knowledge is encoded and put to use by
marcello.ferro@ilc.cnr.it (M. Ferro)
transformers, by looking at the factors affecting their
0000-0002-5581-7451 (V. Pirrelli); 0000-0002-3427-2827
(C. Marzi); 0000-0002-1324-3699 (M. Ferro) behaviour. In the present paper, we address both issues
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License by assessing the performance of a pool of neural network
Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
classifiers on the English batch of Hollenstein et al.’s [10] ii) log frequency (source: BNC [13])
data.
In what follows, we first describe the English data set iii) part-of-Speech tag (source: Stanza [14])
and the pool of tested classifiers. Classifiers were selected iv) context surprisal/predictability (source: GPT-2
to include and test either simpler neural architectures [15, 16, 3])
than transformers (as is the case with multi-layer percep-
trons), or cognitively more plausible processing models v) distance from the beginning of the sentence (num-
(i.e. sequential long-short terms memories). Hybrid mod- ber of intervening tokens)
els, resulting from the combination of different architec-
tures, were also tested. We then move on to discussing vi) distance from the end of the sentence (number of
the metrics used in [10] for evaluation, to suggest alter- intervening tokens)
native ways to measure accuracy in a fixation prediction vii) presence of heavy punctuation after the token
task. Finally, we investigate how sensitive each tested
architecture is to a few linguistic factors that are known viii) presence of light punctuation after the token.
to account for a sizeable amount of variance in human
reading gaze patterns. Although some neural networks 2.2. BERT ++
turn out to be reasonably good at predicting fixation pat-
terns and replicating some robust psycholinguistic effects To replicate results from [10], we used BERT [17] with
that are found in human data, it is still unclear whether a linear layer on top of it. The linear layer gets BERT
this ability is due to specific aspects of their architecture, contextual word embeddings as input, to predict FPD
to the type of information they are provided in input, or and FPROP.
to their space of trainable parameters. We conclude that, After sentence padding and tokenization, irrelevant
contrary to recent over-enthusiastic reports, predicting and special subtokens were masked to enforce a corre-
eye-fixation patterns of human natural reading is still a spondence between each vector in the target sequence
big challenge for currently available neural architectures, and each vector in the output sequence, and train the
including transformer-based ones. For this very reason, loss only on relevant tokens. Mean Square Error (MSE)
we contend that the task is key to understanding the loss was used along with the AdamW optimizer (with
inductive bias of these models, as well as assessing their no weight decay for the biases). The initial learning rate
cognitive plausibility as models of language behaviour. was set to 5 · 10−5 , and a linear scheduler was used. We
used a 16 sentences batch size and 100 training epochs,
with an early stopping criterion (best model on the vali-
2. Data and Experiments dation set). The model was trained both with fine-tuning
(i.e. by also training BERT internal weights: bert FT +
All models described in the following paragraphs were layer) and without fine-tuning (by only training final
trained, validated, and tested on data from the GECO layer weights: bert + layer).
corpus [7]. We used a 5-fold cross-validation with 95% Finally, we used BERT also in combination with a se-
training, 5% validation and 5% test. Experiments were quential LSTM network. This model (bert + LSTM) takes
conducted using the PyTorch library [11] in Python or the pre-trained BERT contextual word embeddings
MatLab [12]. (i.e. without fine-tuning) in input, along with the lexical
features (i), (ii) and (iv), to predict FPD and FPROP.
2.1. Dataset
The GECO corpus [7] contains data from 14 English na- 2.3. LSTM
tive speakers whose eye movements were recorded while Reading is inherently sequential. Thus, recurrent neural
reading Agatha Christie’s novel “The Mysterious Affair networks appear to offer a promising approach to mod-
at Styles” (56410 tokens). Out of the eight word-level eye elling a fixation prediction task, and a good alternative
tracking measurements used in [10], we focused on i) to transformers. Using the GECO dataset split into pages
first-pass duration (FPD) (the time spent fixating a word rather than sentences, we trained an LSTM with 96 hid-
the first time it is encountered, averaged over subjects, den units and a single layer, with a feed-forward network
see Fig. 2) and ii) fixation proportion (FPROP) or proba- using tanh activation functions on top of it. The model
bility (number of subjects that fixated a word, divided by (lstm) takes as input the lexical features (i)-(iv) for the
the total number of subjects). target token and 4 tokens to its left and 3 to its right, to
Word tokens in the original dataset were encoded with predict FPD and FPROP of the target token. MSE loss
linguistic information including: was used along with the AdamW optimizer. The initial
i) character length (removing punctuation) learning rate was set to 5 · 10−3 , with a linear scheduler
and a batch containing the entire training dataset. The threshold. An offset value is needed to obtain a positive
model was trained for 3000 epochs with an early stopping threshold also for zero target values. This is calculated
criterion (best model on the validation set). as follows:

2.4. MLP 𝑁𝑠𝑒𝑡
1 ∑︁
𝑎𝑐𝑐𝑆(𝑠𝑒𝑡) = 1 − 𝜃 [𝑒𝑖 − (𝛼 · 𝑡𝑖 + 𝜖)]
A Multi-Layer-Perceptron (mlp) was trained using the 𝑁𝑠𝑒𝑡 𝑖∈𝑠𝑒𝑡
entire set of lexical features (i)-(viii) as input, with an
input context consisting of the two words immediately where 𝑁𝑠𝑒𝑡 is the number of examples in the train-
preceding and ensuing the target word. Several instances ing/test set, 𝜃 is the Heaviside step function, 𝜖 is a thresh-
of this architecture were tested, but only the results of old and 𝛼 is a sensitivity coefficient.
the best performing instance (with a single hidden layer As for FPD, which is a duration expressed in seconds,
of 10 units, sigmoidal activation functions, the Adam we used 𝜖 = 25𝑚𝑠 and 𝛼 = 10% for accS, and 𝜖 = 50𝑚𝑠
optimiser, the MSE loss, a constant learning rate of 0.1, for accT. As for FPROP, which is a probability, we used
and 1000 training epochs) are reported here. 𝜖 = 0.01 and 𝛼 = 10% for accS, and 𝜖 = 0.1 for accT.
An identical MLP model (mlp UDT) was eventually Finally, the performance of our models was compared
trained on a subset of GECO training data, obtained by against a baseline model (const) that always outputs the
sampling target features uniformly. This was done to overall mean fixation duration (across both subjects and
train the network with an equal number of tokens for items) in the training data.
each bin of fixation times, and assess the impact of dif-
ferent distributions of input data on the network’s per-
formance on test data. 3. Results
Models’ results for FPD prediction are summarised in
2.5. Evaluation Table 1 and plotted in Fig. 1. The accL results reported in
We evaluated the performance of all our models using [10] for bert FT + layer are essentially replicated. How-
three accuracy metrics based on the absolute error be- ever, being a simple average over all test instances, accL
tween the predicted value 𝑜𝑖 and the target value 𝑡𝑖 on is blind to error magnitude, as well as the possible pres-
the i-th token of the GECO dataset: ence of prediction biases for specific ranges of fixation
values. Note that the const model, which predicts the
𝑒𝑖 = |𝑜𝑖 − 𝑡𝑖 | same average FDP for every token in the test set, scores
a flattering 95.68% on accL, vs. 36.97% on accS, and
Loss accuracy (accL) is a measure of the overall simi- 48.10% on accT
larity between predicted and target values, calculated as Table 2 summarises accS values of all models, by bin-
the complement to 1 of the Mean Absolute Error (MAE) ning them into three FPD ranges.
after fitting the target data 𝑡𝑖 in the training set into the
[0; 1] range with the min-max scaling:
𝑁𝑠𝑒𝑡
𝑎𝑐𝑐𝐿(𝑠𝑒𝑡) = 1 −
1 ∑︁
ˆ𝑒𝑖
4. Data analysis
𝑁𝑠𝑒𝑡 𝑖∈𝑠𝑒𝑡
To what extent are neural network models sensitive to
some of the factors accounting for gaze patterns in hu-
where ˆ𝑒𝑖 = |𝑜
ˆ𝑖 − ˆ𝑡𝑖 |, ˆ𝑡𝑖 = 𝑡𝑖 / max {𝑡𝑗 }, and
𝑗=𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑠𝑒𝑡 man natural reading? Are language models able to adapt
𝑜𝑖 is the model prediction for ˆ𝑡𝑖 . Loss accuracy is the
ˆ themselves to both lexical properties and in-context fea-
metric used in [10]. tures of a reading text, thus exhibiting a human-like per-
Threshold accuracy (accT) measures how many times formance?
the predicted value is close to the target value within a Human reading behaviour is shown to be affected by
fixed threshold, and is calculated as follows: lexical features – e.g. word length and frequency, and
morphological complexity – as well as by contextual fac-
𝑁𝑠𝑒𝑡
1 ∑︁ tors, with a facilitatory effect of contextual redundancy
𝑎𝑐𝑐𝑇 (𝑠𝑒𝑡) = 1 − 𝜃[𝑒𝑖 − 𝜖]
𝑁𝑠𝑒𝑡 𝑖∈𝑠𝑒𝑡 and predictability (18, 19) on reading duration and eye
fixations. Accordingly, we modelled human FPDs as a
Sensitivity accuracy (accS) counts how many times response variable resulting from the interaction of both
the predicted value is close to the target value within lexical and contextual predictors: namely, word length,
a threshold dynamically calculated on the basis of the a dichotomous classification of token POS into content
target value: the higher the target value, the higher the versus function words, surprisal of the target word as a
mlp (training) =0.79***** mlp (test) =0.78*****
FPD accuracies 0.5 0.5

FPD (seconds)

FPD (seconds)
0.4 0.4
test training 0.3 0.3
model accS accT accL accS accT accL 0.2 0.2
0.1 0.1
const 36.97% 48.10% 95.68% 37.07% 48.06% 95.69% 0 0
1 2 3 4 5 500 1000 1500 2000 2500
(0.83%) (1.00%) (0.05%) (0.04%) (0.05%) (0.00%) token 104 token
bert 55.02% 67.82% 97.05% 58.11% 70.74% 97.25% lstm (training) =0.79***** lstm (test) =0.78*****
0.5 0.5

FPD (seconds)

FPD (seconds)
+ layer (0.86%) (0.99%) (0.05%) (0.82%) (0.70%) (0.05%) 0.4 0.4
mlp UDT 56.41% 67.79% 96.21% 61.21% 72.37% 96.52% 0.3 0.3
0.2 0.2
(0.35%) (0.79%) (1.25%) (0.95%) (0.57%) (1.08%) 0.1 0.1
0 0
bert 58.49% 70.01% 95.38% 63.64% 75.89% 95.90% 1 2 3 4 5 500 1000 1500 2000 2500
+ lstm (0.91%) (0.82%) (0.07%) (0.48%) (0.77%) (0.97%) token 104 token
bert FT + layer (training) =0.98***** bert FT + layer (test) =0.78*****
bert FT 57.80% 70.03% 97.23% 93.18% 94.81% 98.80% 0.5 0.5

FPD (seconds)

FPD (seconds)
0.4 0.4
+ layer (1.02%) (1.13%) (0.05%) (0.81%) (0.71%) (0.05%) 0.3 0.3
mlp 60.16% 73.05% 97.39% 60.63% 73.31% 97.40% 0.2 0.2
0.1 0.1
(0.85%) (0.78%) (0.04%) (0.37%) (0.24%) (0.01%) 0 0
1 2 3 4 5 500 1000 1500 2000 2500
lstm 60.01% 73.18% 97.39% 61.66% 74.27% 97.45% token 104 token
(0.38%) (0.31%) (0.03%) (0.24%) (0.19%) (0.01%)
Figure 1: Models predictions (red dots) plotted with target
Table 1 FPD values (black dots), after ordering tokens for increasing
Overall FPD prediction accuracy in the GECO dataset. For FPDs. Grey dots represent averaged FPD values plus\minus
each model, three different accuracy scores are given as de- their standard deviation across participants. Left: training
scribed in the text; const is used as a baseline; highest accu- data. Right: test data. From top to bottom: MLP, LSTM,
racies in bold; lowest accuracies in italics. BERT fine-tuned. For each plot, the Spearman-𝜌 correlation
coefficient between predicted and target values is shown along
with the significance value.
3-bin FPD accuracy on test
model low medium high
const 0.00% 41.08% 0.00%
bert + layer 21.43% 58.98% 23.02% In contrast, all models fail to capture some contextual
mlp UDT 52.33% 56.91% 51.49% effects on test data, such as those observed in a context
bert + lstm 24.19% 62.17% 26.61% window of – at least – two adjacent words. To illustrate,
bert FT + layer 32.86% 62.65% 31.65% efficient syntactic chunking (e.g. of noun, verb and prepo-
mlp 11.77% 64.38% 32.62% sitional phrases) has been shown to lead to faster and
lstm 19.05% 64.26% 29.45% more accurate human reading (see, for example, [20]).
Table 2 Conversely, most neural networks show no statistically
Sensitivity accuracy (accS) values for three bins from the significant effect on fixation duration of the probability
FPD distribution: low (FPD below the 5𝑡ℎ percentile = 36ms), of the immediately preceding word in context. This is
medium (FPD ranging from the 5𝑡ℎ to the 95𝑡ℎ percentile), observed either is isolation (probMinus1) in LSTMs and
and high (FPD above the 95𝑡ℎ percentile = 280ms). transformer-based models with BERT representations
(either fine-tuned or not), or in interaction with the un-
predictability of the target word (surprisal:probMinus1).
The evidence shows that most neural models cannot repli-
measure of how unexpected or unpredictable the word cate, among other things, so-called spillover effects of the
is, and the probability of the word immediately preced- left-context on the reading time of ensuing words [21].
ing the target word in context (to account for so-called
spill-over effects). Additionally, we used a Generalised
Additive Model (GAM), with token log-frequency as a
5. General Discussion
smooth term, to model for possibly non-linear effects Transformer-based neural networks appear to reason-
of predictors. Models’ coefficients and effect plots are ably predict fixation probability and first-pass duration
shown in Appendix C (Figure 3 and Table 4). of words in human reading of English connected texts.
GAMs with identical independent variables have been Our present investigation basically supports this con-
run to model the FPDs predicted by all our neural net- clusion, while providing new evidence on two related
works, on both training and test data. Inspection of effect questions. Two questions naturally arise in this context.
plots and model coefficients – as reported in Appendix How accurate are transformer-based predictions com-
C – shows a behavioural alignment of all models with pared with the best predictions of other neural network
human data for what concerns the modulation of fixa- classifiers trained on the same task? How cognitively
tion times by lexical features, in both train and test data. plausible are the mechanisms underpinning this perfor-
mance? Here, we addressed both questions by testing gests that one can gain non trivial insights in a model’s
various models on the task of predicting human reading behaviour by analysing to what extent the behaviour is
measurements from the GECO corpus, using different sensitive to the same linguistic factors human readers are
evaluation metrics and regressing network predictions known to be sensitive to. On the one hand, this is a step
on a few linguistic factors that are known to account for towards understanding what information a neural model
human reading behaviour. is actually learning and putting to use for the task. On
Our first observation is that assessing a network’s per- the other hand, this is instrumental in developing better
formance by looking at its MAE loss function provides a models, as it shows what type of input information is
rather gross evaluation of the effective power of a neural more needed to successfully carry out a task, at least if
network simulating human reading behaviour. A base- one is trying to simulate the way the same task is carried
line model assigning each token a constant gaze dura- out by speakers.
tion that equals the average of all FPD values attested In the end, it may well be the case that a 70% fixed-
in GECO achieves a 95.7% loss-based accuracy on both threshold accuracy in simulating average gaze patterns in
test and training data. That a transformer-based classi- human reading is not as disappointing as it might seem.
fication scores 97.2% on the same metric and the same Given the wide variability in human reading behaviour
test data cannot be held, as such, as a sign of outstanding (and even in a single reader when confronted with differ-
performance. In fact, it turns out that the MAE loss func- ent texts), a considerable amount of variance in our data
tion is blind to both the magnitude of a network error, may simply be accounted for by by-subject (or by-token)
and possible biases in the prediction of very low/high random effects. In some experiments not reported here
target values. Thus, it provides an inflated estimate of we trained our models to predict single-reader behaviour.
a model’s accuracy. We suggest that binary evaluation All architectures fared rather poorly on the task, a re-
metrics, based on a fixed threshold partially overcome sult which is in line with similar disappointing results
these limitations. Yet, as single word fixation times typ- on other output features reported in [10]. Looking back
ically range between tens to hundreds of milliseconds, at Figure 1, it can be noted that all models’ predictions
application of a fixed threshold will differently affect to- fall into a 𝜇𝑖 ± 𝜎𝑖 range, where 𝜇𝑖 and 𝜎𝑖 are, respec-
kens with different fixation times. We conclude that a tively, the by-reader mean and standard deviation of FPD
relative threshold based on each word’s fixation time is a values for token 𝑖 (see also Table 2). This pattern may
fairer way to measure prediction accuracy. Clearly, this suggest that models’ predictions are in fact bounded by
comes at a cost. When assessed with a relative threshold, the standard deviation we observe in human behaviour
the accuracy of a transformer-based architecture on test and cannot reach out of these bounds. Conversely, this
data drops from 70% down to 57.8%. evidence may be interpreted as suggesting that more
It turned out that all other network models tested for input features are needed to build more accurate classi-
the present purposes showed accuracy levels that are fiers. Further experiments are needed to test the merits
comparable to the accuracy of a transformer-based archi- of either conjecture.
tecture. Since the former are trained on a more restricted
set of lexical and contextual input features than the lat-
ter, this seems to suggest that word embeddings are of 6. Limitations and outlook
limited use in the task at hand. Although fine-tuned
In the present paper, we replicated recent experimental
word embeddings actually appear to score much higher
data of transformer-based architectures simulating word
on training data (even using accT and accS), we observe
fixation duration in reading a connected text [10], with a
that this is due to data overfitting, as clearly shown by
view to assessing their relative performance compared
the considerably poorer performance of the fine-tuned
with reading times by humans and other neural archi-
model on test data.
tectures. This justifies our exclusive focus on fixation
An analysis of the psychometric plausibility of the gaze
duration, which is, admittedly, only one behavioural cor-
patterns simulated with our neural models reveals that a
relate of a complex, inherently multimodal task such as
relatively small set of linguistic factors that are known
reading. In fact, reading requires the fine coordination
to account for a sizeable amount of variance in human
of eye movements and articulatory movements for text
fixation times can also account for the bulk of variance
decoding and comprehension. The eye provides access
in models’ behaviour. This is relatively unsurprising, as
to the visual stimuli needed for voice articulation to un-
most of these models were trained on input features that
fold at a relatively constant rate. In turn, articulation can
encode at least some of these factors. Nonetheless, we be-
feedback oculomotor control for eye movements to be
lieve that the result is interesting for at least two reasons.
directed when and where processing difficulties arise.
First, it shows a promising convergence between com-
Incidentally, this is also true of silent reading as shown
putational metrics of model accuracy and quantitative
by evidence supporting the Implicit Prosody Hypothesis
models of psychometric assessment. Secondly, it sug-
[22], i.e. the idea that, in silent reading, readers activate References
prosodic representations that are similar to those they
would produce when reading the text aloud. Hence, a [1] S. Gerth, J. Festman, Reading development, word
reader must always rely on a tight control strategy to length and frequency effects: An eye-tracking study
ensure that fixation and articulation are optimally coor- with slow and fast readers, Frontiers in Communi-
dinated. cation 6 (2021) 743113.
A clear limitation of our current work and all exper- [2] S. Schroeder, T. Häikiö, A. Pagán, J. H. Dickins,
iments reported here is that we are only focusing on J. Hyönä, S. P. Liversedge, Eye movements of chil-
one dimension of a complex, multimodal behaviour like dren and adults reading in three different orthogra-
reading. Recently, we showed that there is a lot about phies., Journal of Experimental Psychology: Learn-
gaze patterns that we can understand by correlating eye ing, Memory, and Cognition 48 (2022) 1518.
movements with voice articulation [23]. This informa- [3] L. Salicchi, E. Chersoni, A. Lenci, A study on sur-
tion, which cannot be represented in a dataset structured prisal and semantic relatedness for eye-tracking
at the word level, may be critical for a model to accurately data prediction, Frontiers in Psychology 14 (2023)
learn and mimic the cognitive mechanisms underlying 1112365.
natural reading. Likewise, as correctly pointed out by [4] M. Hirotani, L. Frazier, K. Rayner, Punctuation and
one of our reviewers, focusing on fixation times while intonation effects on clause and sentence wrap-up:
ignoring saccadic movements may seriously detract from Evidence from eye movements, Journal of Memory
the explanatory power of any computational model of and Language 54 (2006) 425–443.
human reading. In fact, this could be tantamount to tim- [5] E. D. Reichle, K. Rayner, A. Pollatsek, The E-Z
ing a bike rider’s speed, while ignoring if she is climbing Reader model of eye-movement control in reading:
up a hill or approaching a sharp turn. More realistic Comparisons to other models, Behavioral and Brain
models of reading are bound to include more aspects of Sciences 26 (2003) 445–476.
reading behaviour in more ecologically valid tasks. In the [6] R. Engbert, A. Nuthmann, E. Richter, R. Kliegl,
end, it may well be the case that the task of predicting SWIFT: A Dynamical Model of Saccade Generation
gaze patterns of human reading should be conceptual- During Reading., Psychological review 112 (2005)
ized differently, by anchoring these patterns not only to 777–813.
the syntagmatic dimension of a written text, but also to [7] U. Cop, N. Dirix, D. Drieghe, W. Duyck, Present-
the time-line of the different movements and multimodal ing GECO: An eyetracking corpus of monolingual
processes that unfold during reading. and bilingual sentence reading, Behavior Research
Methods 49 (2017) 602–615.
[8] N. Hollenstein, J. Rotsztejn, M. Troendle, A. Pedroni,
Acknowledgments C. Zhang, N. Langer, ZuCo, a simultaneous EEG and
eye-tracking resource for natural sentence reading,
The present study has partly been funded by the Read- Scientific Data 5 (2018) 180291.
Ground research grant from the National Research Coun- [9] N. Siegelman, S. Schroeder, C. Acartürk, H.-D. Ahn,
cil (CNR), and the ReMind and Braillet PRIN grants, from S. Alexeeva, S. Amenta, R. Bertram, R. Bonan-
the Ministry of University and Research (MUR). drini, M. Brysbaert, D. Chernova, S. M. Da Fonseca,
Alessandro Lento is a PhD student enrolled in the Na- N. Dirix, W. Duyck, A. Fella, R. Frost, C. A. Gattei,
tional PhD in Artificial Intelligence, XXXVII cycle, course A. Kalaitzi, N. Kwon, K. Lõo, M. Marelli, T. C. Pa-
on Health and Life sciences, organized by Università padopoulos, A. Protopapas, S. Savo, D. E. Shalom,
Campus Bio-Medico in Rome. N. Slioussar, R. Stein, L. Sui, A. Taboh, V. Tønnesen,
Nadia Khlif is a PhD student in the Computer Science K. A. Usal, V. Kuperman, Expanding horizons of
Research Laboratory, Faculty of Sciences, at the University cross-linguistic research on reading: The Multilin-
Mohammed First of Oujda, Morocco. gual Eye-movement Corpus (MECO), Behavior Re-
Andrea Nadalini’s work is kindly covered by the search Methods 54 (2022) 2843–2863.
“RAISE - Robotics and AI for Socio-economic Empow- [10] N. Hollenstein, F. Pirovano, C. Zhang, L. Jäger,
erment” grant (ECS00000035), funded by the European L. Beinborn, Multilingual language models predict
Union - NextGenerationEU and by the Ministry of Uni- human reading behavior, in: Proceedings of the
versity and Research (MUR), National Recovery and Re- 2021 Conference of the North American Chapter
silience Plan (NRRP), Mission 4, Component 2, Invest- of the Association for Computational Linguistics:
ment 1.5. Human Language Technologies, 2021, pp. 106–123.
[11] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain,
M. Voznesensky, B. Bao, P. Bell, D. Berard,
E. Burovski, G. Chauhan, A. Chourdia, W. Consta-
ble, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, D. Crepaldi, V. Pirrelli, Eye-voice and finger-voice
J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalam- spans in adults’ oral reading of connected texts.
barkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, Implications for reading research and assessment,
J. Liang, Y. Lu, C. K. Luk, B. Maher, Y. Pan, The Mental Lexicon (2024). URL: https://benjamins.
C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, com/catalog/ml.00025.nad.
H. Suk, S. Zhang, M. Suo, P. Tillet, X. Zhao, E. Wang, [24] R Core Team, R: A Language and Environment for
K. Zhou, R. Zou, X. Wang, A. Mathews, W. Wen, Statistical Computing, R Foundation for Statistical
G. Chanan, P. Wu, S. Chintala, PyTorch 2: Faster Computing, Vienna, Austria, 2023. URL: https://
Machine Learning Through Dynamic Python Byte- www.R-project.org/.
code Transformation and Graph Compilation, in:
Proceedings of the 29th ACM International Con-
ference on Architectural Support for Programming
Languages and Operating Systems, Volume 2, vol-
A. GeCO FPD data
ume 2 of ASPLOS ’24, Association for Computing
Machinery, 2024, pp. 929–947. participant #1 participant #2 participant #10 average distribution

[12] T. M. Inc., Matlab version: 9.7.0.1190202 (r2019b), 0.5 0.5 0.5 0.5

2019. 0.4 0.4 0.4 0.4

FPD (seconds)
[13] B. Consortium, The british national corpus, xml 0.3 0.3 0.3 0.3

edition, 2007. 0.2 0.2 0.2 0.2

[14] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Man- 0.1 0.1 0.1 0.1
ning, Stanza: A Python natural language processing
0 0 0 0
toolkit for many human languages, in: Proceed- 1 2 3
part
4 1 2 3
part
4 1 2 3
part
4 1 2 3
part
4

ings of the 58th Annual Meeting of the Association 0.5
per-token behaviour averaged accross all participants

for Computational Linguistics: System Demonstra-
tions, 2020. 0.4
FPD (seconds)

[15] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, 0.3

I. Sutskever, Language models are unsupervised 0.2

multitask learners (2019). 0.1
[16] J. A. Michaelov, B. K. Bergen, Do language models
make human-like predictions about the coreferents
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
token (sorted by FPD) 104
of italian anaphoric zero pronouns?, arXiv preprint
arXiv:2208.14554 (2022). Figure 2: A view of FPD data in the GECO dataset, consisting
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: of eye-tracking patterns of 14 adult participants reading the
Pre-training of Deep Bidirectional Transformers for novel "The Mysterious Affair at Styles" by Agata Christie. Top
Language Understanding, 2019. ArXiv:1810.04805 panel: distributions of FPD data, with chapters grouped into
[cs] version: 2. 4 parts, for participant #1 (with 3 more participants showing
[18] K. E. Stanovich, Attentional and automatic con- a similar distribution), participant #2 (with 8 more partici-
pants showing a similar distribution) and participant #10. The
text effects in reading, in: Interactive processes in
rightmost box plot shows the average distribution across all
reading, Routledge, 2017, pp. 241–267. 14 participants. Bottom panel: plot of all 56410 tokens in
[19] G. B. Simpson, R. R. Peterson, M. A. Casteel, the dataset, in ascending order of mean FPD (dashed black
C. Burgess, Lexical and sentence context effects in line). For each token, the standard deviation calculated on the
word recognition., Journal of Experimental Psychol- distribution of the FPDs of the 14 participants is shown both
ogy: Learning, Memory, and Cognition 15 (1989) above and below the mean value (gray dots).
88.
[20] K. Rayner, K. H. Chace, T. J. Slattery, J. Ashby, Eye
movements as reflections of comprehension pro-
cesses in reading, Scientific studies of reading 10
(2006) 241–255.
[21] N. J. Smith, R. Levy, The effect of word predictability
on reading time is logarithmic, Cognition 128 (2013)
302–319.
[22] M. Breen, Empirical investigations of the role of
implicit prosody in sentence processing, Language
and Linguistics Compass 8 (2014) 37–50.
[23] A. Nadalini, C. Marzi, M. Ferro, L. Taxitari, A. Lento,
B. FPROP accuracy

FPROP accuracies
test training
model accS accT accL accS accT accL
const 2.70% 7.17% 51.44% 2.82% 7.37% 51.71%
(0.37%) (0.70%) (0.57%) (0.02%) (0.04%) (0.03%)
bert 33.84% 44.86% 86.34% 37.47% 48.84% 87.68%
+ layer (1.28%) (0.89%) (0.15%) (1.24%) (1.24%) (0.28%)
mlp UDT 36.24% 48.75% 86.90% 43.40% 58.64% 89.49%
(0.37%) (0.83%) (0.21%) (0.71%) (0.61%) (0.09%)
bert 38.00% 48.46% 87.50% 42.78% 54.78% 89.16%
+ lstm (0.76%) (1.01%) (0.43%) (0.88%) (0.70%) (0.12%)
bert FT 36.39% 47.60% 87.00% 75.10% 90.66% 95.28%
+ layer (1.09%) (1.23%) (0.33%) (1.78%) (1.85%) (0.26%)
mlp 38.96% 51.23% 88.10% 39.45% 51.78% 88.34%
(1.05%) (1.08%) (0.19%) (0.27%) (0.15%) (0.02%)
lstm 37.91% 49.95% 87.93% 39.42% 51.63% 88.34% Figure 3: Effects of surprisal, probability of the preceding
(0.85%) (0.78%) (0.11%) (0.46%) (0.42%) (0.12%) token (probMinus1), word length (len) as predictors, and word
log-frequency (logFreq) as a smooth term, on human fixation
Table 3 first-pass duration (fixFPD) as a response variable.
Accuracy values of neural models predicting the fixation prob-
abilities of the GECO dataset. For each model three different
accuracy metrics are used, as described in the paper. The
"const" model was used as a baseline; highest accuracy scores
are highlighted in bold: lowest scores are shown in italic

C. Data analysis
In this section, coefficients of Generalised Additive Mod-
els (GAMs) are detailed for each neural model. Statistical
non-significant p-values on GAM predicting terms are
given in bold-face. GAMs are fitted using the package
gamm4 version 0.2-6 of the R statistical software [24], as
they do not assume a linear relation between the fitted
variable and its predictors. All plots were created via the
ggplot2 package, version 3.5.

Human FPD
parametric coeff. estimate std. error t value pr(>|t|)
Intercept (content) 6.960e-02 7.858e-04 88.568 < 2𝑒 − 16
surprisal 1.928e-03 5.002e-05 38.539 < 2𝑒 − 16
probMinus1 -1.395e-02 1.363e-03 -10.233 < 2𝑒 − 16
Intercept (function) -2.599e-02 1.143e-03 -22.746 < 2𝑒 − 16
length (content) 1.562e-02 1.423e-04 109.767 < 2𝑒 − 16
length (function) 5.499e-03 2.791e-04 19.704 < 2𝑒 − 16
surprisal:probMinus1 4.692e-04 1.776e-04 2.642 < 0.01
s(logFreq) < 2𝑒 − 16
R2 58.4%

Table 4
GAM coefficients fitting human fixation FPD: FPD ∼ surprisal Figure 4: MLP effects in training (top panel) and test
× probMinus1 + POSgroup × wordlength + s(logFreq). (bottom panel) data, with surprisal, probability of the pre-
ceding token (probMinus1), word length (len) as predictors,
word log-frequency as a smooth term (logFreq), and fixation
first-pass duration as response variable.
MLP FPD LSTM FPD
parametric coeff. estimate std. error t value pr(>|t|) parametric coeff. estimate std. error t value pr(>|t|)
Intercept (content) 7.252e-02 2.729e-04 265.71 < 2𝑒 − 16 Intercept (content) 7.051e-02 3.259e-04 216.317 < 2𝑒 − 16
surprisal 9.028e-04 1.734e-05 52.064 < 2𝑒 − 16 surprisal 7.615e-04 2.069e-05 36.802 < 2𝑒 − 16
probMinus1 -1.417e-02 4.723e-04 -29.995 < 2𝑒 − 16 probMinus1 2.120e-03 5.644e-04 3.756 < 0.001
Intercept (function) -2.312e-02 3.973e-04 -58.2006 < 2𝑒 − 16 Intercept (function) -1.600e-02 4.778e-04 -33.492 < 2𝑒 − 16
length (content) 1.651e-02 4.935e-05 334.512 < 2𝑒 − 16 length (content) 1.649e-02 5.896e-05 279.739 < 2𝑒 − 16
length (function) 4.324e-03 9.698e-05 44.584 < 2𝑒 − 16 length (function) 2.801e-03 1.170e-04 23.945 < 2𝑒 − 16
surprisal:probMinus1 1.810e-04 6.166e-05 2.936 < 0.005 surprisal:probMinus1 -3.385e-04 7.325e-05 -4.621 < 0.001
s(logFreq) < 2𝑒 − 16 s(logFreq) < 2𝑒 − 16
R2 92.2% R2 89.6%
Intercept (content) 7.148e-02 1.183e-03 60.42 < 2𝑒 − 16 Intercept (content) 6.812e-02 1.407e-03 48.431 < 2𝑒 − 16
surprisal 7.585e-04 7.619e-05 9.956 < 2𝑒 − 16 surprisal 6.837e-04 9.284e-05 7.364 < 2.3𝑒 − 13
probMinus1 -1.061e-02 2.044e-03 -5.188 < 2.2𝑒 − 07 probMinus1 3.293e-03 2.458e-03 1.340 0.18
Intercept (function) -1.919e-02 1.658e-03 -11.573 < 2𝑒 − 16 Intercept (function) -1.255e-02 1.936e-03 -6.480 < 1.1𝑒 − 10
length (content) 1.677e-02 2.136e-04 78.502 < 2𝑒 − 16 length (content) 0.0152041 0.0004032 37.709 < 2𝑒 − 16
length (function) 3.399e-03 3.963e-04 8.5774 < 2𝑒 − 16 length (function) 0.0042481 0.0007472 5.685 1 < 1.4𝑒 − 08
surprisal:probMinus1 -1.408e-04 2.480e-04 -0.568 0.57 surprisal:probMinus1 -0.0001970 0.0004701 -0.419 0.67
s(logFreq) < 2𝑒 − 16 s(logFreq) < 2𝑒 − 16
2 2
R 92.6% R 89.9%

Table 5 Table 6
GAM coefficients fitting MLP fixation FPD in training (top) GAM coefficients fitting LSTM fixation FPD in training (top)
and test (bottom) data: FPD ∼ surprisal × probMinus1 + and test (bottom) data: FPD ∼ surprisal × probMinus1 +
POSgroup × wordlength + s(logFreq). POSgroup × wordlength + s(logFreq).

Figure 5: LSTM effects in training (top panel) and test
(bottom panel) data, with surprisal, probability of the pre- Figure 6: fine-tuned BERT effects in training (top panel) and
ceding token (probMinus1), word length (len) as predictors, test (bottom panel) data, with surprisal, probability of the
word log-frequency as a smooth term (logFreq), and fixation preceding token (probMinus1), word length (len) as predictors,
first-pass duration as response variable. word log-frequency as a smooth term (logFreq), and fixation
first-pass duration as response variable.
BERT+fine-tuning FPD BERT FPD
parametric coeff. estimate std. error t value pr(>|t|) parametric coeff. estimate std. error t value pr(>|t|)
Intercept (content) 6.950e-02 8.572e-04 81.075 < 2𝑒 − 16 Intercept (content) 9.626e-02 4.765e-04 202.020 < 2𝑒 − 16
surprisal 2.013e-03 5.446e-05 36.9562 < 2𝑒 − 16 surprisal 1.319e-03 3.027e-05 43.586 < 2𝑒 − 16
probMinus1 -1.475e-02 1.483e-03 -9.9416 < 2𝑒 − 16 probMinus1 -4.998e-03 8.245e-04 -6.0616 < 1.3𝑒 − 09
Intercept (function) -2.631e-02 1.248e-03 -21.0852 < 2𝑒 − 16 Intercept (function) -2.293e-02 6.937e-04 -33.053 < 2𝑒 − 16
length (content) 1.570e-02 1.550e-04 101.307 < 2𝑒 − 16 length (content) 1.019e-02 8.616e-05 118.232 < 2𝑒 − 16
length (function) 5.528e-03 3.046e-04 18.148 < 2𝑒 − 16 length (function) 2.892e-03 1.693e-04 17.0848 < 2𝑒 − 16
surprisal:probMinus1 5.024e-04 1.937e-04 2.594 < 0.01 surprisal:probMinus1 -3.874e-04 1.077e-04 -3.599 < 0.001
s(logFreq) < 2𝑒 − 16 s(logFreq) < 2𝑒 − 16
R2 57.5% R2 75.6%
Intercept (content) 0.0714503 0.0022332 31.99 < 2𝑒 − 16 Intercept (content) 0.0960782 0.0021829 44.014 < 2𝑒 − 16
surprisal 0.0014206 0.0001441 9.859 < 2.3𝑒 − 13 surprisal 0.0012786 0.0001409 9.073 < 2.3𝑒 − 13
probMinus1 -0.0017461 0.0038742 -0.451 0.65 probMinus1 -0.0013508 0.0037907 -0.356 0.72
Intercept (function) -0.0239773 0.0031336 -7.652 < 2.7𝑒 − 14 Intercept (function) -0.0192904 0.0030629 -6.298 < 3.4𝑒 − 10
length (content) 1.707e-02 2.499e-04 68.321 < 2𝑒 − 16 length (content) 0.0102735 0.0003941 26.069 < 2𝑒 − 16
length (function) 1.579e-03 4.627e-04 3.411 < 0.001 length (function) 0.0027876 0.0007299 3.819 < 0.001
surprisal:probMinus1 -5.244e-04 3.561e-04 -1.473 0.14 surprisal:probMinus1 -0.0008111 0.0004600 -1.763 0.08
s(logFreq) < 2𝑒 − 16 s(logFreq) < 2𝑒 − 16
2 2
R 78.4% R 73.5%

Table 7 Table 8
GAM coefficients fitting BERT+fine-tuning fixation FPD in GAM coefficients fitting BERT fixation FPD for the training
training (top) and test (bottom) data: FPD ∼ surprisal × (top) and test (bottom) settings: FPD ∼ surprisal × probMi-
probMinus1 + POSgroup × wordlength + s(logFreq). nus1 + POSgroup × wordlength + s(logFreq).

Figure 7: untuned BERT effects in training (top panel) and
test (bottom panel) data, with surprisal, probability of the
preceding token (probMinus1), word length (len) as predictors,
word log-frequency as a smooth term (logFreq), and fixation
first-pass duration as response variable.