=Paper= {{Paper |id=Vol-3878/56_main_long |storemode=property |title=Comparative Evaluation of Computational Models Predicting Eye Fixation Patterns During Reading: Insights from Transformers and Simpler Architectures |pdfUrl=https://ceur-ws.org/Vol-3878/56_main_long.pdf |volume=Vol-3878 |authors=Alessandro Lento,Andrea Nadalini,Nadia Khlif,Vito Pirrelli,Claudia Marzi,Marcello Ferro |dblpUrl=https://dblp.org/rec/conf/clic-it/LentoNKPMF24 }} ==Comparative Evaluation of Computational Models Predicting Eye Fixation Patterns During Reading: Insights from Transformers and Simpler Architectures== https://ceur-ws.org/Vol-3878/56_main_long.pdf
                                Comparative Evaluation of Computational Models
                                Predicting Eye Fixation Patterns During Reading: Insights
                                from Transformers and Simpler Architectures
                                Alessandro Lento1,2 , Andrea Nadalini1 , Nadia Khlif1,3 , Vito Pirrelli1 , Claudia Marzi1 and
                                Marcello Ferro1,*
                                1
                                  Consiglio Nazionale delle Ricerche, Istituto di Linguistica Computazionale "A. Zampolli", Pisa, Italy
                                2
                                  Università Campus Bio-Medico, Roma, Italy
                                3
                                  University Mohammed First, Oujda, Morocco


                                                Abstract
                                                Eye tracking records of natural text reading are known to provide significant insights into the cognitive processes underlying
                                                word processing and text comprehension, with gaze patterns, such as fixation duration and saccadic movements, being
                                                modulated by morphological, lexical, and higher-level structural properties of the text being read. Although some of these
                                                effects have been simulated with computational models, it is still not clear how accurately computational modelling can predict
                                                complex fixation patterns in connected text reading. State-of-the-art neural architectures have shown promising results, with
                                                pre-trained transformer-based classifiers having recently been claimed to outperform other competitors, achieving beyond
                                                95% accuracy. However, transformer-based models have neither been compared with alternative architectures nor adequately
                                                evaluated for their sensitivity to the linguistic factors affecting human reading. Here we address these issues by evaluating the
                                                performance of a pool of neural networks in classifying eye-fixation English data as a function of both lexical and contextual
                                                factors. We show that i) accuracy of transformer-based models has largely been overestimated, ii) other simpler models make
                                                comparable or even better predictions, iii) most models are sensitive to some of the major lexical factors accounting for at
                                                least 50% of human fixation variance, iv) most models fail to capture some significant context-sensitive interactions, such
                                                as those accounting for spillover effects in reading. The work shows the benefits of combining accuracy-based evaluation
                                                metrics with non-linear regression modelling of fixed and random effects on both real and simulated eye-tracking data.

                                                Keywords
                                                eye-tracking, eye fixation time prediction, neural network, contextual word embeddings, lexical features



                                1. Introduction                                                                                         considerable progress, leading to the development of
                                                                                                                                        sophisticated computational models accounting for fine-
                                Eye-tracking records of natural text reading are a valu- grained aspects of eye movement behaviour during word
                                able window on the cognitive processes underlying word and sentence reading (e.g. EZ-Reader[5], Swift[6]). A sig-
                                processing and text comprehension. By looking at fix- nificant boost in this area came from large eye-tracking
                                ation patterns it is possible to estimate the effects that corpora of natural reading (e.g. GECO[7], ZUCO[8],
                                lexical properties (e.g. length, frequencies, orthographic MECO[9]), which allow for (deep) learning models to be
                                similarity [1] [2]), contextual constraints (e.g. predictabil- tested in prediction tasks of eye tracking metrics. Of late,
                                ity [3]) and higher-level structures (e.g. syntactic struc- Hollenstein and colleagues [10] reported that fine-tuned,
                                ture or prosodic contour [4]) can have on human word pre-trained transformer language models can make re-
                                identification and processing. While psycholinguistic ex- liable predictions on a wide range of eye-tracking mea-
                                periments have reliably assessed how such effects modu- surements, covering both early and late stages of lexical
                                late reading times, it is not clear to what extent computa- processing. The evidence suggests that transformers can
                                tional models of reading can simulate actual behavioural inherently encode the relative prominence of language
                                data such as gaze patterns and fixation durations.                                                      units in a text, in ways that accurately replicate human
                                   Over the past 30 years, research in this field has made reading skills and their underlying cognitive mechanisms.
                                                                                                                                        Although the accuracy of multilingual transformers is
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                                                                                                                        validated across eye-tracking evidence from different lan-
                                Dec 04 — 06, 2024, Pisa, Italy
                                *
                                  Corresponding author.                                                                                 guages, the paper neither compares the performance of
                                $ alessandro.lento@ilc.cnr.it (A. Lento); andrea.nadalini@ilc.cnr.it transformers with the performance of other neural net-
                                (A. Nadalini); nadia.khlif@ilc.cnr.it (N. Khlif); vito.pirrelli@ilc.cnr.it work classifiers trained on the same task, nor it shows
                                (V. Pirrelli); claudia.marzi@ilc.cnr.it (C. Marzi);                                                     what specific knowledge is encoded and put to use by
                                marcello.ferro@ilc.cnr.it (M. Ferro)
                                                                                                                                        transformers, by looking at the factors affecting their
                                 0000-0002-5581-7451 (V. Pirrelli); 0000-0002-3427-2827
                                (C. Marzi); 0000-0002-1324-3699 (M. Ferro)                                                              behaviour. In the present paper, we address both issues
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License by assessing the performance of a pool of neural network
                                           Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
classifiers on the English batch of Hollenstein et al.’s [10]        ii) log frequency (source: BNC [13])
data.
    In what follows, we first describe the English data set         iii) part-of-Speech tag (source: Stanza [14])
and the pool of tested classifiers. Classifiers were selected       iv) context surprisal/predictability (source: GPT-2
to include and test either simpler neural architectures                 [15, 16, 3])
than transformers (as is the case with multi-layer percep-
trons), or cognitively more plausible processing models              v) distance from the beginning of the sentence (num-
(i.e. sequential long-short terms memories). Hybrid mod-                ber of intervening tokens)
els, resulting from the combination of different architec-
tures, were also tested. We then move on to discussing              vi) distance from the end of the sentence (number of
the metrics used in [10] for evaluation, to suggest alter-              intervening tokens)
native ways to measure accuracy in a fixation prediction           vii) presence of heavy punctuation after the token
task. Finally, we investigate how sensitive each tested
architecture is to a few linguistic factors that are known         viii) presence of light punctuation after the token.
to account for a sizeable amount of variance in human
reading gaze patterns. Although some neural networks             2.2. BERT ++
turn out to be reasonably good at predicting fixation pat-
terns and replicating some robust psycholinguistic effects       To replicate results from [10], we used BERT [17] with
that are found in human data, it is still unclear whether        a linear layer on top of it. The linear layer gets BERT
this ability is due to specific aspects of their architecture,   contextual word embeddings as input, to predict FPD
to the type of information they are provided in input, or        and FPROP.
to their space of trainable parameters. We conclude that,            After sentence padding and tokenization, irrelevant
contrary to recent over-enthusiastic reports, predicting         and special subtokens were masked to enforce a corre-
eye-fixation patterns of human natural reading is still a        spondence between each vector in the target sequence
big challenge for currently available neural architectures,      and each vector in the output sequence, and train the
including transformer-based ones. For this very reason,          loss only on relevant tokens. Mean Square Error (MSE)
we contend that the task is key to understanding the             loss was used along with the AdamW optimizer (with
inductive bias of these models, as well as assessing their       no weight decay for the biases). The initial learning rate
cognitive plausibility as models of language behaviour.          was set to 5 · 10−5 , and a linear scheduler was used. We
                                                                 used a 16 sentences batch size and 100 training epochs,
                                                                 with an early stopping criterion (best model on the vali-
2. Data and Experiments                                          dation set). The model was trained both with fine-tuning
                                                                 (i.e. by also training BERT internal weights: bert FT +
All models described in the following paragraphs were            layer) and without fine-tuning (by only training final
trained, validated, and tested on data from the GECO             layer weights: bert + layer).
corpus [7]. We used a 5-fold cross-validation with 95%               Finally, we used BERT also in combination with a se-
training, 5% validation and 5% test. Experiments were            quential LSTM network. This model (bert + LSTM) takes
conducted using the PyTorch library [11] in Python or            the pre-trained BERT contextual word embeddings
MatLab [12].                                                     (i.e. without fine-tuning) in input, along with the lexical
                                                                 features (i), (ii) and (iv), to predict FPD and FPROP.
2.1. Dataset
The GECO corpus [7] contains data from 14 English na-            2.3. LSTM
tive speakers whose eye movements were recorded while            Reading is inherently sequential. Thus, recurrent neural
reading Agatha Christie’s novel “The Mysterious Affair           networks appear to offer a promising approach to mod-
at Styles” (56410 tokens). Out of the eight word-level eye       elling a fixation prediction task, and a good alternative
tracking measurements used in [10], we focused on i)             to transformers. Using the GECO dataset split into pages
first-pass duration (FPD) (the time spent fixating a word        rather than sentences, we trained an LSTM with 96 hid-
the first time it is encountered, averaged over subjects,        den units and a single layer, with a feed-forward network
see Fig. 2) and ii) fixation proportion (FPROP) or proba-        using tanh activation functions on top of it. The model
bility (number of subjects that fixated a word, divided by       (lstm) takes as input the lexical features (i)-(iv) for the
the total number of subjects).                                   target token and 4 tokens to its left and 3 to its right, to
   Word tokens in the original dataset were encoded with         predict FPD and FPROP of the target token. MSE loss
linguistic information including:                                was used along with the AdamW optimizer. The initial
     i) character length (removing punctuation)                  learning rate was set to 5 · 10−3 , with a linear scheduler
and a batch containing the entire training dataset. The              threshold. An offset value is needed to obtain a positive
model was trained for 3000 epochs with an early stopping             threshold also for zero target values. This is calculated
criterion (best model on the validation set).                        as follows:

2.4. MLP                                                                                          𝑁𝑠𝑒𝑡
                                                                                              1 ∑︁
                                                                        𝑎𝑐𝑐𝑆(𝑠𝑒𝑡) = 1 −                 𝜃 [𝑒𝑖 − (𝛼 · 𝑡𝑖 + 𝜖)]
A Multi-Layer-Perceptron (mlp) was trained using the                                         𝑁𝑠𝑒𝑡 𝑖∈𝑠𝑒𝑡
entire set of lexical features (i)-(viii) as input, with an
input context consisting of the two words immediately                   where 𝑁𝑠𝑒𝑡 is the number of examples in the train-
preceding and ensuing the target word. Several instances             ing/test set, 𝜃 is the Heaviside step function, 𝜖 is a thresh-
of this architecture were tested, but only the results of            old and 𝛼 is a sensitivity coefficient.
the best performing instance (with a single hidden layer                As for FPD, which is a duration expressed in seconds,
of 10 units, sigmoidal activation functions, the Adam                we used 𝜖 = 25𝑚𝑠 and 𝛼 = 10% for accS, and 𝜖 = 50𝑚𝑠
optimiser, the MSE loss, a constant learning rate of 0.1,            for accT. As for FPROP, which is a probability, we used
and 1000 training epochs) are reported here.                         𝜖 = 0.01 and 𝛼 = 10% for accS, and 𝜖 = 0.1 for accT.
   An identical MLP model (mlp UDT) was eventually                      Finally, the performance of our models was compared
trained on a subset of GECO training data, obtained by               against a baseline model (const) that always outputs the
sampling target features uniformly. This was done to                 overall mean fixation duration (across both subjects and
train the network with an equal number of tokens for                 items) in the training data.
each bin of fixation times, and assess the impact of dif-
ferent distributions of input data on the network’s per-
formance on test data.                                               3. Results
                                                              Models’ results for FPD prediction are summarised in
2.5. Evaluation                                               Table 1 and plotted in Fig. 1. The accL results reported in
We evaluated the performance of all our models using [10] for bert FT + layer are essentially replicated. How-
three accuracy metrics based on the absolute error be- ever, being a simple average over all test instances, accL
tween the predicted value 𝑜𝑖 and the target value 𝑡𝑖 on is blind to error magnitude, as well as the possible pres-
the i-th token of the GECO dataset:                           ence of prediction biases for specific ranges of fixation
                                                              values. Note that the const model, which predicts the
                       𝑒𝑖 = |𝑜𝑖 − 𝑡𝑖 |                        same average FDP for every token in the test set, scores
                                                              a flattering 95.68% on accL, vs. 36.97% on accS, and
   Loss accuracy (accL) is a measure of the overall simi- 48.10% on accT
larity between predicted and target values, calculated as        Table 2 summarises accS values of all models, by bin-
the complement to 1 of the Mean Absolute Error (MAE) ning them into three FPD ranges.
after fitting the target data 𝑡𝑖 in the training set into the
[0; 1] range with the min-max scaling:
                                       𝑁𝑠𝑒𝑡
             𝑎𝑐𝑐𝐿(𝑠𝑒𝑡) = 1 −
                                   1 ∑︁
                                             ˆ𝑒𝑖
                                                                     4. Data analysis
                                  𝑁𝑠𝑒𝑡 𝑖∈𝑠𝑒𝑡
                                                                     To what extent are neural network models sensitive to
                                                                     some of the factors accounting for gaze patterns in hu-
  where ˆ𝑒𝑖 = |𝑜
               ˆ𝑖 − ˆ𝑡𝑖 |, ˆ𝑡𝑖 = 𝑡𝑖 /       max         {𝑡𝑗 }, and
                                        𝑗=𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑠𝑒𝑡                man natural reading? Are language models able to adapt
𝑜𝑖 is the model prediction for ˆ𝑡𝑖 . Loss accuracy is the
ˆ                                                                    themselves to both lexical properties and in-context fea-
metric used in [10].                                                 tures of a reading text, thus exhibiting a human-like per-
   Threshold accuracy (accT) measures how many times                 formance?
the predicted value is close to the target value within a               Human reading behaviour is shown to be affected by
fixed threshold, and is calculated as follows:                       lexical features – e.g. word length and frequency, and
                                                                     morphological complexity – as well as by contextual fac-
                                   𝑁𝑠𝑒𝑡
                               1 ∑︁                                  tors, with a facilitatory effect of contextual redundancy
         𝑎𝑐𝑐𝑇 (𝑠𝑒𝑡) = 1 −                𝜃[𝑒𝑖 − 𝜖]
                              𝑁𝑠𝑒𝑡 𝑖∈𝑠𝑒𝑡                             and predictability (18, 19) on reading duration and eye
                                                                     fixations. Accordingly, we modelled human FPDs as a
   Sensitivity accuracy (accS) counts how many times                 response variable resulting from the interaction of both
the predicted value is close to the target value within              lexical and contextual predictors: namely, word length,
a threshold dynamically calculated on the basis of the               a dichotomous classification of token POS into content
target value: the higher the target value, the higher the            versus function words, surprisal of the target word as a
                                                                                             mlp (training) =0.79*****                                            mlp (test) =0.78*****
                        FPD accuracies                                            0.5                                                                0.5




                                                                  FPD (seconds)




                                                                                                                                     FPD (seconds)
                                                                                  0.4                                                                0.4
                   test                 training                                  0.3                                                                0.3
  model    accS   accT     accL    accS   accT     accL                           0.2                                                                0.2
                                                                                  0.1                                                                0.1
  const 36.97% 48.10% 95.68% 37.07% 48.06% 95.69%                                   0                                                                  0
                                                                                             1      2        3       4       5                               500    1000 1500 2000 2500
         (0.83%) (1.00%) (0.05%) (0.04%) (0.05%) (0.00%)                                                 token               104                                        token
   bert  55.02% 67.82% 97.05% 58.11% 70.74% 97.25%                                           lstm (training) =0.79*****                                         lstm (test) =0.78*****
                                                                                  0.5                                                                0.5




                                                                  FPD (seconds)




                                                                                                                                     FPD (seconds)
 + layer (0.86%) (0.99%) (0.05%) (0.82%) (0.70%) (0.05%)                          0.4                                                                0.4
mlp UDT 56.41% 67.79% 96.21% 61.21% 72.37% 96.52%                                 0.3                                                                0.3
                                                                                  0.2                                                                0.2
         (0.35%) (0.79%) (1.25%) (0.95%) (0.57%) (1.08%)                          0.1                                                                0.1
                                                                                    0                                                                  0
   bert  58.49% 70.01% 95.38% 63.64% 75.89% 95.90%                                            1      2           3       4       5                          500     1000 1500 2000 2500
 + lstm (0.91%) (0.82%) (0.07%) (0.48%) (0.77%) (0.97%)                                                  token            104                                            token
                                                                                        bert FT + layer (training) =0.98*****                              bert FT + layer (test) =0.78*****
bert FT 57.80% 70.03% 97.23% 93.18% 94.81% 98.80%                                 0.5                                                                0.5




                                                                  FPD (seconds)




                                                                                                                                     FPD (seconds)
                                                                                  0.4                                                                0.4
 + layer (1.02%) (1.13%) (0.05%) (0.81%) (0.71%) (0.05%)                          0.3                                                                0.3
   mlp   60.16% 73.05% 97.39% 60.63% 73.31% 97.40%                                0.2                                                                0.2
                                                                                  0.1                                                                0.1
         (0.85%) (0.78%) (0.04%) (0.37%) (0.24%) (0.01%)                            0                                                                  0
                                                                                              1      2           3       4       5                           500    1000 1500 2000 2500
  lstm   60.01% 73.18% 97.39% 61.66% 74.27% 97.45%                                                       token               104                                        token
         (0.38%) (0.31%) (0.03%) (0.24%) (0.19%) (0.01%)
                                                                  Figure 1: Models predictions (red dots) plotted with target
Table 1                                                           FPD values (black dots), after ordering tokens for increasing
Overall FPD prediction accuracy in the GECO dataset. For          FPDs. Grey dots represent averaged FPD values plus\minus
each model, three different accuracy scores are given as de-      their standard deviation across participants. Left: training
scribed in the text; const is used as a baseline; highest accu-   data. Right: test data. From top to bottom: MLP, LSTM,
racies in bold; lowest accuracies in italics.                     BERT fine-tuned. For each plot, the Spearman-𝜌 correlation
                                                                  coefficient between predicted and target values is shown along
                                                                  with the significance value.
                         3-bin FPD accuracy on test
           model           low   medium      high
           const          0.00% 41.08%      0.00%
       bert + layer      21.43% 58.98%     23.02%            In contrast, all models fail to capture some contextual
         mlp UDT         52.33% 56.91%     51.49%            effects on test data, such as those observed in a context
        bert + lstm      24.19% 62.17%     26.61%            window of – at least – two adjacent words. To illustrate,
      bert FT + layer    32.86% 62.65%     31.65%            efficient syntactic chunking (e.g. of noun, verb and prepo-
            mlp          11.77% 64.38%     32.62%            sitional phrases) has been shown to lead to faster and
           lstm          19.05% 64.26%     29.45%            more accurate human reading (see, for example, [20]).
Table 2                                                      Conversely, most neural networks show no statistically
Sensitivity accuracy (accS) values for three bins from the significant effect on fixation duration of the probability
FPD distribution: low (FPD below the 5𝑡ℎ percentile = 36ms), of the immediately preceding word in context. This is
medium (FPD ranging from the 5𝑡ℎ to the 95𝑡ℎ percentile), observed either is isolation (probMinus1) in LSTMs and
and high (FPD above the 95𝑡ℎ percentile = 280ms).            transformer-based models with BERT representations
                                                             (either fine-tuned or not), or in interaction with the un-
                                                             predictability of the target word (surprisal:probMinus1).
                                                             The evidence shows that most neural models cannot repli-
measure of how unexpected or unpredictable the word cate, among other things, so-called spillover effects of the
is, and the probability of the word immediately preced- left-context on the reading time of ensuing words [21].
ing the target word in context (to account for so-called
spill-over effects). Additionally, we used a Generalised
Additive Model (GAM), with token log-frequency as a
                                                             5. General Discussion
smooth term, to model for possibly non-linear effects Transformer-based neural networks appear to reason-
of predictors. Models’ coefficients and effect plots are ably predict fixation probability and first-pass duration
shown in Appendix C (Figure 3 and Table 4).                  of words in human reading of English connected texts.
   GAMs with identical independent variables have been Our present investigation basically supports this con-
run to model the FPDs predicted by all our neural net- clusion, while providing new evidence on two related
works, on both training and test data. Inspection of effect questions. Two questions naturally arise in this context.
plots and model coefficients – as reported in Appendix How accurate are transformer-based predictions com-
C – shows a behavioural alignment of all models with pared with the best predictions of other neural network
human data for what concerns the modulation of fixa- classifiers trained on the same task? How cognitively
tion times by lexical features, in both train and test data. plausible are the mechanisms underpinning this perfor-
mance? Here, we addressed both questions by testing              gests that one can gain non trivial insights in a model’s
various models on the task of predicting human reading           behaviour by analysing to what extent the behaviour is
measurements from the GECO corpus, using different               sensitive to the same linguistic factors human readers are
evaluation metrics and regressing network predictions            known to be sensitive to. On the one hand, this is a step
on a few linguistic factors that are known to account for        towards understanding what information a neural model
human reading behaviour.                                         is actually learning and putting to use for the task. On
   Our first observation is that assessing a network’s per-      the other hand, this is instrumental in developing better
formance by looking at its MAE loss function provides a          models, as it shows what type of input information is
rather gross evaluation of the effective power of a neural       more needed to successfully carry out a task, at least if
network simulating human reading behaviour. A base-              one is trying to simulate the way the same task is carried
line model assigning each token a constant gaze dura-            out by speakers.
tion that equals the average of all FPD values attested             In the end, it may well be the case that a 70% fixed-
in GECO achieves a 95.7% loss-based accuracy on both             threshold accuracy in simulating average gaze patterns in
test and training data. That a transformer-based classi-         human reading is not as disappointing as it might seem.
fication scores 97.2% on the same metric and the same            Given the wide variability in human reading behaviour
test data cannot be held, as such, as a sign of outstanding      (and even in a single reader when confronted with differ-
performance. In fact, it turns out that the MAE loss func-       ent texts), a considerable amount of variance in our data
tion is blind to both the magnitude of a network error,          may simply be accounted for by by-subject (or by-token)
and possible biases in the prediction of very low/high           random effects. In some experiments not reported here
target values. Thus, it provides an inflated estimate of         we trained our models to predict single-reader behaviour.
a model’s accuracy. We suggest that binary evaluation            All architectures fared rather poorly on the task, a re-
metrics, based on a fixed threshold partially overcome           sult which is in line with similar disappointing results
these limitations. Yet, as single word fixation times typ-       on other output features reported in [10]. Looking back
ically range between tens to hundreds of milliseconds,           at Figure 1, it can be noted that all models’ predictions
application of a fixed threshold will differently affect to-     fall into a 𝜇𝑖 ± 𝜎𝑖 range, where 𝜇𝑖 and 𝜎𝑖 are, respec-
kens with different fixation times. We conclude that a           tively, the by-reader mean and standard deviation of FPD
relative threshold based on each word’s fixation time is a       values for token 𝑖 (see also Table 2). This pattern may
fairer way to measure prediction accuracy. Clearly, this         suggest that models’ predictions are in fact bounded by
comes at a cost. When assessed with a relative threshold,        the standard deviation we observe in human behaviour
the accuracy of a transformer-based architecture on test         and cannot reach out of these bounds. Conversely, this
data drops from 70% down to 57.8%.                               evidence may be interpreted as suggesting that more
   It turned out that all other network models tested for        input features are needed to build more accurate classi-
the present purposes showed accuracy levels that are             fiers. Further experiments are needed to test the merits
comparable to the accuracy of a transformer-based archi-         of either conjecture.
tecture. Since the former are trained on a more restricted
set of lexical and contextual input features than the lat-
ter, this seems to suggest that word embeddings are of           6. Limitations and outlook
limited use in the task at hand. Although fine-tuned
                                                                 In the present paper, we replicated recent experimental
word embeddings actually appear to score much higher
                                                                 data of transformer-based architectures simulating word
on training data (even using accT and accS), we observe
                                                                 fixation duration in reading a connected text [10], with a
that this is due to data overfitting, as clearly shown by
                                                                 view to assessing their relative performance compared
the considerably poorer performance of the fine-tuned
                                                                 with reading times by humans and other neural archi-
model on test data.
                                                                 tectures. This justifies our exclusive focus on fixation
   An analysis of the psychometric plausibility of the gaze
                                                                 duration, which is, admittedly, only one behavioural cor-
patterns simulated with our neural models reveals that a
                                                                 relate of a complex, inherently multimodal task such as
relatively small set of linguistic factors that are known
                                                                 reading. In fact, reading requires the fine coordination
to account for a sizeable amount of variance in human
                                                                 of eye movements and articulatory movements for text
fixation times can also account for the bulk of variance
                                                                 decoding and comprehension. The eye provides access
in models’ behaviour. This is relatively unsurprising, as
                                                                 to the visual stimuli needed for voice articulation to un-
most of these models were trained on input features that
                                                                 fold at a relatively constant rate. In turn, articulation can
encode at least some of these factors. Nonetheless, we be-
                                                                 feedback oculomotor control for eye movements to be
lieve that the result is interesting for at least two reasons.
                                                                 directed when and where processing difficulties arise.
First, it shows a promising convergence between com-
                                                                 Incidentally, this is also true of silent reading as shown
putational metrics of model accuracy and quantitative
                                                                 by evidence supporting the Implicit Prosody Hypothesis
models of psychometric assessment. Secondly, it sug-
[22], i.e. the idea that, in silent reading, readers activate   References
prosodic representations that are similar to those they
would produce when reading the text aloud. Hence, a              [1] S. Gerth, J. Festman, Reading development, word
reader must always rely on a tight control strategy to               length and frequency effects: An eye-tracking study
ensure that fixation and articulation are optimally coor-            with slow and fast readers, Frontiers in Communi-
dinated.                                                             cation 6 (2021) 743113.
   A clear limitation of our current work and all exper-         [2] S. Schroeder, T. Häikiö, A. Pagán, J. H. Dickins,
iments reported here is that we are only focusing on                 J. Hyönä, S. P. Liversedge, Eye movements of chil-
one dimension of a complex, multimodal behaviour like                dren and adults reading in three different orthogra-
reading. Recently, we showed that there is a lot about               phies., Journal of Experimental Psychology: Learn-
gaze patterns that we can understand by correlating eye              ing, Memory, and Cognition 48 (2022) 1518.
movements with voice articulation [23]. This informa-            [3] L. Salicchi, E. Chersoni, A. Lenci, A study on sur-
tion, which cannot be represented in a dataset structured            prisal and semantic relatedness for eye-tracking
at the word level, may be critical for a model to accurately         data prediction, Frontiers in Psychology 14 (2023)
learn and mimic the cognitive mechanisms underlying                  1112365.
natural reading. Likewise, as correctly pointed out by           [4] M. Hirotani, L. Frazier, K. Rayner, Punctuation and
one of our reviewers, focusing on fixation times while               intonation effects on clause and sentence wrap-up:
ignoring saccadic movements may seriously detract from               Evidence from eye movements, Journal of Memory
the explanatory power of any computational model of                  and Language 54 (2006) 425–443.
human reading. In fact, this could be tantamount to tim-         [5] E. D. Reichle, K. Rayner, A. Pollatsek, The E-Z
ing a bike rider’s speed, while ignoring if she is climbing          Reader model of eye-movement control in reading:
up a hill or approaching a sharp turn. More realistic                Comparisons to other models, Behavioral and Brain
models of reading are bound to include more aspects of               Sciences 26 (2003) 445–476.
reading behaviour in more ecologically valid tasks. In the       [6] R. Engbert, A. Nuthmann, E. Richter, R. Kliegl,
end, it may well be the case that the task of predicting             SWIFT: A Dynamical Model of Saccade Generation
gaze patterns of human reading should be conceptual-                 During Reading., Psychological review 112 (2005)
ized differently, by anchoring these patterns not only to            777–813.
the syntagmatic dimension of a written text, but also to         [7] U. Cop, N. Dirix, D. Drieghe, W. Duyck, Present-
the time-line of the different movements and multimodal              ing GECO: An eyetracking corpus of monolingual
processes that unfold during reading.                                and bilingual sentence reading, Behavior Research
                                                                     Methods 49 (2017) 602–615.
                                                                 [8] N. Hollenstein, J. Rotsztejn, M. Troendle, A. Pedroni,
Acknowledgments                                                      C. Zhang, N. Langer, ZuCo, a simultaneous EEG and
                                                                     eye-tracking resource for natural sentence reading,
The present study has partly been funded by the Read-                Scientific Data 5 (2018) 180291.
Ground research grant from the National Research Coun-           [9] N. Siegelman, S. Schroeder, C. Acartürk, H.-D. Ahn,
cil (CNR), and the ReMind and Braillet PRIN grants, from             S. Alexeeva, S. Amenta, R. Bertram, R. Bonan-
the Ministry of University and Research (MUR).                       drini, M. Brysbaert, D. Chernova, S. M. Da Fonseca,
   Alessandro Lento is a PhD student enrolled in the Na-             N. Dirix, W. Duyck, A. Fella, R. Frost, C. A. Gattei,
tional PhD in Artificial Intelligence, XXXVII cycle, course          A. Kalaitzi, N. Kwon, K. Lõo, M. Marelli, T. C. Pa-
on Health and Life sciences, organized by Università                 padopoulos, A. Protopapas, S. Savo, D. E. Shalom,
Campus Bio-Medico in Rome.                                           N. Slioussar, R. Stein, L. Sui, A. Taboh, V. Tønnesen,
   Nadia Khlif is a PhD student in the Computer Science              K. A. Usal, V. Kuperman, Expanding horizons of
Research Laboratory, Faculty of Sciences, at the University          cross-linguistic research on reading: The Multilin-
Mohammed First of Oujda, Morocco.                                    gual Eye-movement Corpus (MECO), Behavior Re-
   Andrea Nadalini’s work is kindly covered by the                   search Methods 54 (2022) 2843–2863.
“RAISE - Robotics and AI for Socio-economic Empow-              [10] N. Hollenstein, F. Pirovano, C. Zhang, L. Jäger,
erment” grant (ECS00000035), funded by the European                  L. Beinborn, Multilingual language models predict
Union - NextGenerationEU and by the Ministry of Uni-                 human reading behavior, in: Proceedings of the
versity and Research (MUR), National Recovery and Re-                2021 Conference of the North American Chapter
silience Plan (NRRP), Mission 4, Component 2, Invest-                of the Association for Computational Linguistics:
ment 1.5.                                                            Human Language Technologies, 2021, pp. 106–123.
                                                                [11] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain,
                                                                     M. Voznesensky, B. Bao, P. Bell, D. Berard,
                                                                     E. Burovski, G. Chauhan, A. Chourdia, W. Consta-
     ble, A. Desmaison, Z. DeVito, E. Ellison, W. Feng,             D. Crepaldi, V. Pirrelli, Eye-voice and finger-voice
     J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalam-            spans in adults’ oral reading of connected texts.
     barkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang,             Implications for reading research and assessment,
     J. Liang, Y. Lu, C. K. Luk, B. Maher, Y. Pan,                  The Mental Lexicon (2024). URL: https://benjamins.
     C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi,              com/catalog/ml.00025.nad.
     H. Suk, S. Zhang, M. Suo, P. Tillet, X. Zhao, E. Wang,    [24] R Core Team, R: A Language and Environment for
     K. Zhou, R. Zou, X. Wang, A. Mathews, W. Wen,                  Statistical Computing, R Foundation for Statistical
     G. Chanan, P. Wu, S. Chintala, PyTorch 2: Faster               Computing, Vienna, Austria, 2023. URL: https://
     Machine Learning Through Dynamic Python Byte-                  www.R-project.org/.
     code Transformation and Graph Compilation, in:
     Proceedings of the 29th ACM International Con-
     ference on Architectural Support for Programming
     Languages and Operating Systems, Volume 2, vol-
                                                               A. GeCO FPD data
     ume 2 of ASPLOS ’24, Association for Computing
     Machinery, 2024, pp. 929–947.                                                participant #1                participant #2             participant #10       average distribution

[12] T. M. Inc., Matlab version: 9.7.0.1190202 (r2019b),                        0.5                       0.5                        0.5                         0.5

     2019.                                                                      0.4                       0.4                        0.4                         0.4




                                                                FPD (seconds)
[13] B. Consortium, The british national corpus, xml                            0.3                       0.3                        0.3                         0.3

     edition, 2007.                                                             0.2                       0.2                        0.2                         0.2

[14] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Man-                           0.1                       0.1                        0.1                         0.1
     ning, Stanza: A Python natural language processing
                                                                                 0                         0                           0                          0
     toolkit for many human languages, in: Proceed-                                   1   2 3
                                                                                          part
                                                                                                 4              1     2 3
                                                                                                                       part
                                                                                                                               4            1   2 3
                                                                                                                                                 part
                                                                                                                                                        4               1    2 3
                                                                                                                                                                              part
                                                                                                                                                                                     4

     ings of the 58th Annual Meeting of the Association                         0.5
                                                                                                         per-token behaviour averaged accross all participants

     for Computational Linguistics: System Demonstra-
     tions, 2020.                                                               0.4
                                                                FPD (seconds)




[15] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,                           0.3

     I. Sutskever, Language models are unsupervised                             0.2

     multitask learners (2019).                                                 0.1
[16] J. A. Michaelov, B. K. Bergen, Do language models
     make human-like predictions about the coreferents
                                                                                 0
                                                                                           0.5       1          1.5      2        2.5      3      3.5        4         4.5      5        5.5
                                                                                                                              token (sorted by FPD)                                      104
     of italian anaphoric zero pronouns?, arXiv preprint
     arXiv:2208.14554 (2022).                                  Figure 2: A view of FPD data in the GECO dataset, consisting
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:       of eye-tracking patterns of 14 adult participants reading the
     Pre-training of Deep Bidirectional Transformers for       novel "The Mysterious Affair at Styles" by Agata Christie. Top
     Language Understanding, 2019. ArXiv:1810.04805            panel: distributions of FPD data, with chapters grouped into
     [cs] version: 2.                                          4 parts, for participant #1 (with 3 more participants showing
[18] K. E. Stanovich, Attentional and automatic con-           a similar distribution), participant #2 (with 8 more partici-
                                                               pants showing a similar distribution) and participant #10. The
     text effects in reading, in: Interactive processes in
                                                               rightmost box plot shows the average distribution across all
     reading, Routledge, 2017, pp. 241–267.                    14 participants. Bottom panel: plot of all 56410 tokens in
[19] G. B. Simpson, R. R. Peterson, M. A. Casteel,             the dataset, in ascending order of mean FPD (dashed black
     C. Burgess, Lexical and sentence context effects in       line). For each token, the standard deviation calculated on the
     word recognition., Journal of Experimental Psychol-       distribution of the FPDs of the 14 participants is shown both
     ogy: Learning, Memory, and Cognition 15 (1989)            above and below the mean value (gray dots).
     88.
[20] K. Rayner, K. H. Chace, T. J. Slattery, J. Ashby, Eye
     movements as reflections of comprehension pro-
     cesses in reading, Scientific studies of reading 10
     (2006) 241–255.
[21] N. J. Smith, R. Levy, The effect of word predictability
     on reading time is logarithmic, Cognition 128 (2013)
     302–319.
[22] M. Breen, Empirical investigations of the role of
     implicit prosody in sentence processing, Language
     and Linguistics Compass 8 (2014) 37–50.
[23] A. Nadalini, C. Marzi, M. Ferro, L. Taxitari, A. Lento,
B. FPROP accuracy

                              FPROP accuracies
                      test                     training
  model    accS      accT        accL    accS    accT    accL
  const   2.70%      7.17%     51.44% 2.82% 7.37% 51.71%
         (0.37%)    (0.70%)    (0.57%) (0.02%) (0.04%) (0.03%)
  bert   33.84%     44.86%      86.34% 37.47% 48.84% 87.68%
 + layer (1.28%)    (0.89%)    (0.15%) (1.24%) (1.24%) (0.28%)
mlp UDT 36.24%      48.75%     86.90% 43.40% 58.64% 89.49%
         (0.37%)    (0.83%)    (0.21%) (0.71%) (0.61%) (0.09%)
  bert   38.00%     48.46%     87.50% 42.78% 54.78% 89.16%
 + lstm  (0.76%)    (1.01%)    (0.43%) (0.88%) (0.70%) (0.12%)
bert FT 36.39%      47.60%     87.00% 75.10% 90.66% 95.28%
 + layer (1.09%)    (1.23%)    (0.33%) (1.78%) (1.85%) (0.26%)
  mlp    38.96%     51.23%     88.10% 39.45% 51.78% 88.34%
         (1.05%)    (1.08%)    (0.19%) (0.27%) (0.15%) (0.02%)
  lstm   37.91%     49.95%     87.93% 39.42% 51.63% 88.34%       Figure 3: Effects of surprisal, probability of the preceding
         (0.85%)    (0.78%)    (0.11%) (0.46%) (0.42%) (0.12%)   token (probMinus1), word length (len) as predictors, and word
                                                                 log-frequency (logFreq) as a smooth term, on human fixation
Table 3                                                          first-pass duration (fixFPD) as a response variable.
Accuracy values of neural models predicting the fixation prob-
abilities of the GECO dataset. For each model three different
accuracy metrics are used, as described in the paper. The
"const" model was used as a baseline; highest accuracy scores
are highlighted in bold: lowest scores are shown in italic




C. Data analysis
In this section, coefficients of Generalised Additive Mod-
els (GAMs) are detailed for each neural model. Statistical
non-significant p-values on GAM predicting terms are
given in bold-face. GAMs are fitted using the package
gamm4 version 0.2-6 of the R statistical software [24], as
they do not assume a linear relation between the fitted
variable and its predictors. All plots were created via the
ggplot2 package, version 3.5.

                               Human FPD
    parametric coeff.   estimate std. error t value   pr(>|t|)
   Intercept (content) 6.960e-02 7.858e-04 88.568 < 2𝑒 − 16
         surprisal     1.928e-03 5.002e-05 38.539 < 2𝑒 − 16
       probMinus1      -1.395e-02 1.363e-03 -10.233 < 2𝑒 − 16
  Intercept (function) -2.599e-02 1.143e-03 -22.746 < 2𝑒 − 16
     length (content)  1.562e-02 1.423e-04 109.767 < 2𝑒 − 16
    length (function)  5.499e-03 2.791e-04 19.704 < 2𝑒 − 16
 surprisal:probMinus1 4.692e-04 1.776e-04 2.642      < 0.01
        s(logFreq)                                  < 2𝑒 − 16
            R2            58.4%

Table 4
GAM coefficients fitting human fixation FPD: FPD ∼ surprisal     Figure 4: MLP effects in training (top panel) and test
× probMinus1 + POSgroup × wordlength + s(logFreq).               (bottom panel) data, with surprisal, probability of the pre-
                                                                 ceding token (probMinus1), word length (len) as predictors,
                                                                 word log-frequency as a smooth term (logFreq), and fixation
                                                                 first-pass duration as response variable.
                                MLP FPD                                                         LSTM FPD
   parametric coeff.   estimate std. error t value    pr(>|t|)      parametric coeff.   estimate std. error t value   pr(>|t|)
  Intercept (content) 7.252e-02 2.729e-04 265.71 < 2𝑒 − 16         Intercept (content) 7.051e-02 3.259e-04 216.317 < 2𝑒 − 16
        surprisal     9.028e-04 1.734e-05 52.064 < 2𝑒 − 16               surprisal      7.615e-04 2.069e-05 36.802 < 2𝑒 − 16
      probMinus1      -1.417e-02 4.723e-04 -29.995 < 2𝑒 − 16           probMinus1       2.120e-03 5.644e-04 3.756    < 0.001
 Intercept (function) -2.312e-02 3.973e-04 -58.2006 < 2𝑒 − 16     Intercept (function) -1.600e-02 4.778e-04 -33.492 < 2𝑒 − 16
    length (content)  1.651e-02 4.935e-05 334.512 < 2𝑒 − 16          length (content)   1.649e-02 5.896e-05 279.739 < 2𝑒 − 16
   length (function)  4.324e-03 9.698e-05 44.584 < 2𝑒 − 16          length (function)   2.801e-03 1.170e-04 23.945 < 2𝑒 − 16
surprisal:probMinus1 1.810e-04 6.166e-05 2.936       < 0.005     surprisal:probMinus1 -3.385e-04 7.325e-05 -4.621    < 0.001
       s(logFreq)                                   < 2𝑒 − 16           s(logFreq)                                  < 2𝑒 − 16
           R2            92.2%                                              R2            89.6%
  Intercept (content) 7.148e-02 1.183e-03 60.42 < 2𝑒 − 16          Intercept (content) 6.812e-02 1.407e-03 48.431 < 2𝑒 − 16
        surprisal     7.585e-04 7.619e-05 9.956 < 2𝑒 − 16                surprisal      6.837e-04 9.284e-05 7.364 < 2.3𝑒 − 13
      probMinus1      -1.061e-02 2.044e-03 -5.188 < 2.2𝑒 − 07          probMinus1       3.293e-03 2.458e-03 1.340      0.18
 Intercept (function) -1.919e-02 1.658e-03 -11.573 < 2𝑒 − 16      Intercept (function) -1.255e-02 1.936e-03 -6.480 < 1.1𝑒 − 10
    length (content)  1.677e-02 2.136e-04 78.502 < 2𝑒 − 16           length (content)  0.0152041 0.0004032 37.709 < 2𝑒 − 16
   length (function)  3.399e-03 3.963e-04 8.5774 < 2𝑒 − 16          length (function)  0.0042481 0.0007472 5.685 1 < 1.4𝑒 − 08
surprisal:probMinus1 -1.408e-04 2.480e-04 -0.568       0.57      surprisal:probMinus1 -0.0001970 0.0004701 -0.419      0.67
       s(logFreq)                                   < 2𝑒 − 16           s(logFreq)                                  < 2𝑒 − 16
             2                                                                2
           R             92.6%                                              R             89.9%

Table 5                                                     Table 6
GAM coefficients fitting MLP fixation FPD in training (top) GAM coefficients fitting LSTM fixation FPD in training (top)
and test (bottom) data: FPD ∼ surprisal × probMinus1 + and test (bottom) data: FPD ∼ surprisal × probMinus1 +
POSgroup × wordlength + s(logFreq).                         POSgroup × wordlength + s(logFreq).




Figure 5: LSTM effects in training (top panel) and test
(bottom panel) data, with surprisal, probability of the pre-     Figure 6: fine-tuned BERT effects in training (top panel) and
ceding token (probMinus1), word length (len) as predictors,      test (bottom panel) data, with surprisal, probability of the
word log-frequency as a smooth term (logFreq), and fixation      preceding token (probMinus1), word length (len) as predictors,
first-pass duration as response variable.                        word log-frequency as a smooth term (logFreq), and fixation
                                                                 first-pass duration as response variable.
                         BERT+fine-tuning FPD                                                 BERT FPD
   parametric coeff.   estimate std. error t value    pr(>|t|)    parametric coeff.   estimate std. error t value    pr(>|t|)
  Intercept (content) 6.950e-02 8.572e-04 81.075 < 2𝑒 − 16       Intercept (content) 9.626e-02 4.765e-04 202.020 < 2𝑒 − 16
        surprisal      2.013e-03 5.446e-05 36.9562 < 2𝑒 − 16           surprisal      1.319e-03 3.027e-05 43.586 < 2𝑒 − 16
      probMinus1      -1.475e-02 1.483e-03 -9.9416 < 2𝑒 − 16         probMinus1      -4.998e-03 8.245e-04 -6.0616 < 1.3𝑒 − 09
 Intercept (function) -2.631e-02 1.248e-03 -21.0852 < 2𝑒 − 16   Intercept (function) -2.293e-02 6.937e-04 -33.053 < 2𝑒 − 16
    length (content)   1.570e-02 1.550e-04 101.307 < 2𝑒 − 16       length (content)   1.019e-02 8.616e-05 118.232 < 2𝑒 − 16
   length (function)   5.528e-03 3.046e-04 18.148 < 2𝑒 − 16       length (function)   2.892e-03 1.693e-04 17.0848 < 2𝑒 − 16
surprisal:probMinus1 5.024e-04 1.937e-04 2.594       < 0.01    surprisal:probMinus1 -3.874e-04 1.077e-04 -3.599    < 0.001
       s(logFreq)                                   < 2𝑒 − 16         s(logFreq)                                  < 2𝑒 − 16
           R2            57.5%                                            R2            75.6%
  Intercept (content) 0.0714503 0.0022332 31.99 < 2𝑒 − 16        Intercept (content) 0.0960782 0.0021829 44.014 < 2𝑒 − 16
        surprisal     0.0014206 0.0001441 9.859 < 2.3𝑒 − 13            surprisal      0.0012786 0.0001409 9.073 < 2.3𝑒 − 13
      probMinus1      -0.0017461 0.0038742 -0.451      0.65          probMinus1      -0.0013508 0.0037907 -0.356      0.72
 Intercept (function) -0.0239773 0.0031336 -7.652 < 2.7𝑒 − 14 Intercept (function) -0.0192904 0.0030629 -6.298 < 3.4𝑒 − 10
    length (content)   1.707e-02 2.499e-04 68.321 < 2𝑒 − 16        length (content)   0.0102735 0.0003941 26.069 < 2𝑒 − 16
   length (function)   1.579e-03 4.627e-04 3.411     < 0.001      length (function)   0.0027876 0.0007299 3.819    < 0.001
surprisal:probMinus1 -5.244e-04 3.561e-04 -1.473       0.14    surprisal:probMinus1 -0.0008111 0.0004600 -1.763       0.08
       s(logFreq)                                   < 2𝑒 − 16         s(logFreq)                                  < 2𝑒 − 16
             2                                                              2
           R             78.4%                                            R             73.5%

Table 7                                                          Table 8
GAM coefficients fitting BERT+fine-tuning fixation FPD in        GAM coefficients fitting BERT fixation FPD for the training
training (top) and test (bottom) data: FPD ∼ surprisal ×         (top) and test (bottom) settings: FPD ∼ surprisal × probMi-
probMinus1 + POSgroup × wordlength + s(logFreq).                 nus1 + POSgroup × wordlength + s(logFreq).




Figure 7: untuned BERT effects in training (top panel) and
test (bottom panel) data, with surprisal, probability of the
preceding token (probMinus1), word length (len) as predictors,
word log-frequency as a smooth term (logFreq), and fixation
first-pass duration as response variable.