<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparative Evaluation of Computational Models Predicting Eye Fixation Patterns During Reading: Insights from Transformers and Simpler Architectures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Lento</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Nadalini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nadia Khlif</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vito Pirrelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia Marzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcello Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consiglio Nazionale delle Ricerche, Istituto di Linguistica Computazionale "A. Zampolli"</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università Campus Bio-Medico</institution>
          ,
          <addr-line>Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University Mohammed First</institution>
          ,
          <addr-line>Oujda</addr-line>
          ,
          <country country="MA">Morocco</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Eye tracking records of natural text reading are known to provide significant insights into the cognitive processes underlying word processing and text comprehension, with gaze patterns, such as fixation duration and saccadic movements, being modulated by morphological, lexical, and higher-level structural properties of the text being read. Although some of these efects have been simulated with computational models, it is still not clear how accurately computational modelling can predict complex fixation patterns in connected text reading. State-of-the-art neural architectures have shown promising results, with pre-trained transformer-based classifiers having recently been claimed to outperform other competitors, achieving beyond 95% accuracy. However, transformer-based models have neither been compared with alternative architectures nor adequately evaluated for their sensitivity to the linguistic factors afecting human reading. Here we address these issues by evaluating the performance of a pool of neural networks in classifying eye-fixation English data as a function of both lexical and contextual factors. We show that i) accuracy of transformer-based models has largely been overestimated, ii) other simpler models make comparable or even better predictions, iii) most models are sensitive to some of the major lexical factors accounting for at least 50% of human fixation variance, iv) most models fail to capture some significant context-sensitive interactions, such as those accounting for spillover efects in reading. The work shows the benefits of combining accuracy-based evaluation metrics with non-linear regression modelling of fixed and random efects on both real and simulated eye-tracking data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;eye-tracking</kwd>
        <kwd>eye fixation time prediction</kwd>
        <kwd>neural network</kwd>
        <kwd>contextual word embeddings</kwd>
        <kwd>lexical features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        considerable progress, leading to the development of
sophisticated computational models accounting for
fineEye-tracking records of natural text reading are a valu- grained aspects of eye movement behaviour during word
able window on the cognitive processes underlying word and sentence reading (e.g. EZ-Reader[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Swift[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). A
sigprocessing and text comprehension. By looking at fix- nificant boost in this area came from large eye-tracking
ation patterns it is possible to estimate the efects that corpora of natural reading (e.g. GECO[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], ZUCO[8],
lexical properties (e.g. length, frequencies, orthographic MECO[9]), which allow for (deep) learning models to be
similarity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), contextual constraints (e.g. predictabil- tested in prediction tasks of eye tracking metrics. Of late,
ity [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) and higher-level structures (e.g. syntactic struc- Hollenstein and colleagues [10] reported that fine-tuned,
ture or prosodic contour [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) can have on human word pre-trained transformer language models can make
reidentification and processing. While psycholinguistic ex- liable predictions on a wide range of eye-tracking
meaperiments have reliably assessed how such efects modu- surements, covering both early and late stages of lexical
late reading times, it is not clear to what extent computa- processing. The evidence suggests that transformers can
tional models of reading can simulate actual behavioural inherently encode the relative prominence of language
data such as gaze patterns and fixation durations. units in a text, in ways that accurately replicate human
Over the past 30 years, research in this field has made reading skills and their underlying cognitive mechanisms.
      </p>
      <p>Although the accuracy of multilingual transformers is
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, validated across eye-tracking evidence from diferent
lan*DCecor0r4es—po0n6,d2in02g4a, uPtihsao,r.Italy guages, the paper neither compares the performance of
$ alessandro.lento@ilc.cnr.it (A. Lento); andrea.nadalini@ilc.cnr.it transformers with the performance of other neural
net(A. Nadalini); nadia.khlif@ilc.cnr.it (N. Khlif); vito.pirrelli@ilc.cnr.it work classifiers trained on the same task, nor it shows
(V. Pirrelli); claudia.marzi@ilc.cnr.it (C. Marzi); what specific knowledge is encoded and put to use by
marcello.ferro@ilc.cnr.it (M. Ferro) transformers, by looking at the factors afecting their
(C. 0M00a0r-z0i)0;0020-0505-8010-0724-51132(V4.-3P6i9rr9e(lMli).; 0F0e0rr0o-0)002-3427-2827 behaviour. In the present paper, we address both issues
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License by assessing the performance of a pool of neural network
Attribution 4.0 International (CC BY 4.0).
classifiers on the English batch of Hollenstein et al.’s [10] ii) log frequency (source: BNC [13])
data.</p>
      <p>
        In what follows, we first describe the English data set iii) part-of-Speech tag (source: Stanza [14])
and the pool of tested classifiers. Classifiers were selected iv) context surprisal/predictability (source: GPT-2
to include and test either simpler neural architectures [
        <xref ref-type="bibr" rid="ref3">15, 16, 3</xref>
        ])
than transformers (as is the case with multi-layer
perceptrons), or cognitively more plausible processing models v) distance from the beginning of the sentence
(num(i.e. sequential long-short terms memories). Hybrid mod- ber of intervening tokens)
els, resulting from the combination of diferent
architectures, were also tested. We then move on to discussing vi) distance from the end of the sentence (number of
the metrics used in [10] for evaluation, to suggest alter- intervening tokens)
native ways to measure accuracy in a fixation prediction vii) presence of heavy punctuation after the token
task. Finally, we investigate how sensitive each tested
architecture is to a few linguistic factors that are known viii) presence of light punctuation after the token.
to account for a sizeable amount of variance in human
reading gaze patterns. Although some neural networks 2.2. BERT ++
turn out to be reasonably good at predicting fixation
patterns and replicating some robust psycholinguistic efects To replicate results from [10], we used BERT [17] with
that are found in human data, it is still unclear whether a linear layer on top of it. The linear layer gets BERT
this ability is due to specific aspects of their architecture, contextual word embeddings as input, to predict FPD
to the type of information they are provided in input, or and FPROP.
to their space of trainable parameters. We conclude that, After sentence padding and tokenization, irrelevant
contrary to recent over-enthusiastic reports, predicting and special subtokens were masked to enforce a
correeye-fixation patterns of human natural reading is still a spondence between each vector in the target sequence
big challenge for currently available neural architectures, and each vector in the output sequence, and train the
including transformer-based ones. For this very reason, loss only on relevant tokens. Mean Square Error (MSE)
we contend that the task is key to understanding the loss was used along with the AdamW optimizer (with
inductive bias of these models, as well as assessing their no weight decay for the biases). The initial learning rate
cognitive plausibility as models of language behaviour. was set to 5 · 10− 5, and a linear scheduler was used. We
used a 16 sentences batch size and 100 training epochs,
with an early stopping criterion (best model on the
vali2. Data and Experiments dation set). The model was trained both with fine-tuning
(i.e. by also training BERT internal weights: bert FT +
All models described in the following paragraphs were layer) and without fine-tuning (by only training final
trained, validated, and tested on data from the GECO layer weights: bert + layer).
corpus [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We used a 5-fold cross-validation with 95% Finally, we used BERT also in combination with a
setraining, 5% validation and 5% test. Experiments were quential LSTM network. This model (bert + LSTM) takes
conducted using the PyTorch library [11] in Python or the pre-trained BERT contextual word embeddings
MatLab [12]. (i.e. without fine-tuning) in input, along with the lexical
features (i), (ii) and (iv), to predict FPD and FPROP.
2.1. Dataset
The GECO corpus [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] contains data from 14 English
native speakers whose eye movements were recorded while
reading Agatha Christie’s novel “The Mysterious Afair
at Styles” (56410 tokens). Out of the eight word-level eye
tracking measurements used in [10], we focused on i)
ifrst-pass duration (FPD) (the time spent fixating a word
the first time it is encountered, averaged over subjects,
see Fig. 2) and ii) fixation proportion (FPROP) or
probability (number of subjects that fixated a word, divided by
the total number of subjects).
      </p>
      <p>Word tokens in the original dataset were encoded with
linguistic information including:
i) character length (removing punctuation)
2.3. LSTM
Reading is inherently sequential. Thus, recurrent neural
networks appear to ofer a promising approach to
modelling a fixation prediction task, and a good alternative
to transformers. Using the GECO dataset split into pages
rather than sentences, we trained an LSTM with 96
hidden units and a single layer, with a feed-forward network
using tanh activation functions on top of it. The model
(lstm) takes as input the lexical features (i)-(iv) for the
target token and 4 tokens to its left and 3 to its right, to
predict FPD and FPROP of the target token. MSE loss
was used along with the AdamW optimizer. The initial
learning rate was set to 5 · 10− 3, with a linear scheduler
and a batch containing the entire training dataset. The
model was trained for 3000 epochs with an early stopping
criterion (best model on the validation set).
threshold. An ofset value is needed to obtain a positive
threshold also for zero target values. This is calculated
as follows:
2.4. MLP
() = 1 −
A Multi-Layer-Perceptron (mlp) was trained using the
entire set of lexical features (i)-(viii) as input, with an
input context consisting of the two words immediately where  is the number of examples in the
trainpreceding and ensuing the target word. Several instances ing/test set,  is the Heaviside step function,  is a
threshof this architecture were tested, but only the results of old and  is a sensitivity coeficient.
the best performing instance (with a single hidden layer As for FPD, which is a duration expressed in seconds,
of 10 units, sigmoidal activation functions, the Adam we used  = 25 and  = 10% for accS, and  = 50
optimiser, the MSE loss, a constant learning rate of 0.1, for accT. As for FPROP, which is a probability, we used
and 1000 training epochs) are reported here.  = 0.01 and  = 10% for accS, and  = 0.1 for accT.</p>
      <p>An identical MLP model (mlp UDT) was eventually Finally, the performance of our models was compared
trained on a subset of GECO training data, obtained by against a baseline model (const) that always outputs the
sampling target features uniformly. This was done to overall mean fixation duration (across both subjects and
train the network with an equal number of tokens for items) in the training data.
each bin of fixation times, and assess the impact of
different distributions of input data on the network’s
performance on test data. 3. Results
1</p>
      <p>∑︁  [ − ( ·  +  )]
 ∈</p>
      <sec id="sec-1-1">
        <title>Loss accuracy (accL) is a measure of the overall simi</title>
        <p>larity between predicted and target values, calculated as
the complement to 1 of the Mean Absolute Error (MAE)
after fitting the target data  in the training set into the
[0; 1] range with the min-max scaling:</p>
        <p>ˆ|, ˆ = / =max{ }, and
ˆ is the model prediction for ˆ. Loss accuracy is the
metric used in [10].</p>
        <p>Threshold accuracy (accT) measures how many times
the predicted value is close to the target value within a
ifxed threshold, and is calculated as follows:
 () = 1 −
1</p>
        <p>∑︁  [ −  ]
 ∈</p>
      </sec>
      <sec id="sec-1-2">
        <title>Sensitivity accuracy (accS) counts how many times</title>
        <p>the predicted value is close to the target value within
a threshold dynamically calculated on the basis of the
target value: the higher the target value, the higher the</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Data analysis</title>
      <sec id="sec-2-1">
        <title>To what extent are neural network models sensitive to</title>
        <p>some of the factors accounting for gaze patterns in
human natural reading? Are language models able to adapt
themselves to both lexical properties and in-context
features of a reading text, thus exhibiting a human-like
performance?</p>
        <p>Human reading behaviour is shown to be afected by
lexical features – e.g. word length and frequency, and
morphological complexity – as well as by contextual
factors, with a facilitatory efect of contextual redundancy
and predictability (18, 19) on reading duration and eye
ifxations. Accordingly, we modelled human FPDs as a
response variable resulting from the interaction of both
lexical and contextual predictors: namely, word length,
a dichotomous classification of token POS into content
versus function words, surprisal of the target word as a
mlp (training) =0.79*****
1 2 3 4</p>
        <p>token
lstm (training) =0.79*****
)0.5
sd0.4
n
co0.3
(se0.2
PD0.1</p>
        <p>F 0
1 2 3 4 5</p>
        <p>token 104
)0.5 bert FT + layer (training) =0.98*****
sd0.4
n
co0.3
(se0.2
PD0.1
F 0</p>
        <p>token
1
2
3
4</p>
      </sec>
      <sec id="sec-2-2">
        <title>In contrast, all models fail to capture some contextual</title>
        <p>efects on test data, such as those observed in a context
window of – at least – two adjacent words. To illustrate,
eficient syntactic chunking (e.g. of noun, verb and
prepositional phrases) has been shown to lead to faster and
more accurate human reading (see, for example, [20]).</p>
        <p>Table 2 Conversely, most neural networks show no statistically
Sensitivity accuracy (accS) values for three bins from the significant efect on fixation duration of the probability
FPD distribution: low (FPD below the 5ℎ percentile = 36ms), of the immediately preceding word in context. This is
medium (FPD ranging from the 5ℎ to the 95ℎ percentile), observed either is isolation (probMinus1) in LSTMs and
and high (FPD above the 95ℎ percentile = 280ms). transformer-based models with BERT representations
(either fine-tuned or not), or in interaction with the
unpredictability of the target word (surprisal:probMinus1).</p>
        <p>The evidence shows that most neural models cannot
replimeasure of how unexpected or unpredictable the word cate, among other things, so-called spillover efects of the
is, and the probability of the word immediately preced- left-context on the reading time of ensuing words [21].
ing the target word in context (to account for so-called
spill-over efects). Additionally, we used a Generalised 5. General Discussion
Additive Model (GAM), with token log-frequency as a
smooth term, to model for possibly non-linear efects Transformer-based neural networks appear to
reasonof predictors. Models’ coeficients and efect plots are ably predict fixation probability and first-pass duration
shown in Appendix C (Figure 3 and Table 4). of words in human reading of English connected texts.</p>
        <p>GAMs with identical independent variables have been Our present investigation basically supports this
conrun to model the FPDs predicted by all our neural net- clusion, while providing new evidence on two related
works, on both training and test data. Inspection of efect questions. Two questions naturally arise in this context.
plots and model coeficients – as reported in Appendix How accurate are transformer-based predictions
comC – shows a behavioural alignment of all models with pared with the best predictions of other neural network
human data for what concerns the modulation of fixa- classifiers trained on the same task? How cognitively
tion times by lexical features, in both train and test data. plausible are the mechanisms underpinning this
performance? Here, we addressed both questions by testing gests that one can gain non trivial insights in a model’s
various models on the task of predicting human reading behaviour by analysing to what extent the behaviour is
measurements from the GECO corpus, using diferent sensitive to the same linguistic factors human readers are
evaluation metrics and regressing network predictions known to be sensitive to. On the one hand, this is a step
on a few linguistic factors that are known to account for towards understanding what information a neural model
human reading behaviour. is actually learning and putting to use for the task. On</p>
        <p>Our first observation is that assessing a network’s per- the other hand, this is instrumental in developing better
formance by looking at its MAE loss function provides a models, as it shows what type of input information is
rather gross evaluation of the efective power of a neural more needed to successfully carry out a task, at least if
network simulating human reading behaviour. A base- one is trying to simulate the way the same task is carried
line model assigning each token a constant gaze dura- out by speakers.
tion that equals the average of all FPD values attested In the end, it may well be the case that a 70%
fixedin GECO achieves a 95.7% loss-based accuracy on both threshold accuracy in simulating average gaze patterns in
test and training data. That a transformer-based classi- human reading is not as disappointing as it might seem.
ifcation scores 97.2% on the same metric and the same Given the wide variability in human reading behaviour
test data cannot be held, as such, as a sign of outstanding (and even in a single reader when confronted with
diferperformance. In fact, it turns out that the MAE loss func- ent texts), a considerable amount of variance in our data
tion is blind to both the magnitude of a network error, may simply be accounted for by by-subject (or by-token)
and possible biases in the prediction of very low/high random efects. In some experiments not reported here
target values. Thus, it provides an inflated estimate of we trained our models to predict single-reader behaviour.
a model’s accuracy. We suggest that binary evaluation All architectures fared rather poorly on the task, a
remetrics, based on a fixed threshold partially overcome sult which is in line with similar disappointing results
these limitations. Yet, as single word fixation times typ- on other output features reported in [10]. Looking back
ically range between tens to hundreds of milliseconds, at Figure 1, it can be noted that all models’ predictions
application of a fixed threshold will diferently afect to- fall into a   ±   range, where   and   are,
respeckens with diferent fixation times. We conclude that a tively, the by-reader mean and standard deviation of FPD
relative threshold based on each word’s fixation time is a values for token  (see also Table 2). This pattern may
fairer way to measure prediction accuracy. Clearly, this suggest that models’ predictions are in fact bounded by
comes at a cost. When assessed with a relative threshold, the standard deviation we observe in human behaviour
the accuracy of a transformer-based architecture on test and cannot reach out of these bounds. Conversely, this
data drops from 70% down to 57.8%. evidence may be interpreted as suggesting that more</p>
        <p>It turned out that all other network models tested for input features are needed to build more accurate
classithe present purposes showed accuracy levels that are ifers. Further experiments are needed to test the merits
comparable to the accuracy of a transformer-based archi- of either conjecture.
tecture. Since the former are trained on a more restricted
set of lexical and contextual input features than the
latter, this seems to suggest that word embeddings are of 6. Limitations and outlook
limited use in the task at hand. Although fine-tuned
word embeddings actually appear to score much higher In the present paper, we replicated recent experimental
on training data (even using accT and accS), we observe data of transformer-based architectures simulating word
that this is due to data overfitting, as clearly shown by ifxation duration in reading a connected text [ 10], with a
the considerably poorer performance of the fine-tuned view to assessing their relative performance compared
model on test data. with reading times by humans and other neural
archi</p>
        <p>An analysis of the psychometric plausibility of the gaze tectures. This justifies our exclusive focus on fixation
patterns simulated with our neural models reveals that a duration, which is, admittedly, only one behavioural
correlatively small set of linguistic factors that are known relate of a complex, inherently multimodal task such as
to account for a sizeable amount of variance in human reading. In fact, reading requires the fine coordination
ifxation times can also account for the bulk of variance of eye movements and articulatory movements for text
in models’ behaviour. This is relatively unsurprising, as decoding and comprehension. The eye provides access
most of these models were trained on input features that to the visual stimuli needed for voice articulation to
unencode at least some of these factors. Nonetheless, we be- fold at a relatively constant rate. In turn, articulation can
lieve that the result is interesting for at least two reasons. feedback oculomotor control for eye movements to be
First, it shows a promising convergence between com- directed when and where processing dificulties arise.
putational metrics of model accuracy and quantitative Incidentally, this is also true of silent reading as shown
models of psychometric assessment. Secondly, it sug- by evidence supporting the Implicit Prosody Hypothesis
[22], i.e. the idea that, in silent reading, readers activate
prosodic representations that are similar to those they
would produce when reading the text aloud. Hence, a
reader must always rely on a tight control strategy to
ensure that fixation and articulation are optimally
coordinated.</p>
        <p>A clear limitation of our current work and all
experiments reported here is that we are only focusing on
one dimension of a complex, multimodal behaviour like
reading. Recently, we showed that there is a lot about
gaze patterns that we can understand by correlating eye
movements with voice articulation [23]. This
information, which cannot be represented in a dataset structured
at the word level, may be critical for a model to accurately
learn and mimic the cognitive mechanisms underlying
natural reading. Likewise, as correctly pointed out by
one of our reviewers, focusing on fixation times while
ignoring saccadic movements may seriously detract from
the explanatory power of any computational model of
human reading. In fact, this could be tantamount to
timing a bike rider’s speed, while ignoring if she is climbing
up a hill or approaching a sharp turn. More realistic
models of reading are bound to include more aspects of
reading behaviour in more ecologically valid tasks. In the
end, it may well be the case that the task of predicting
gaze patterns of human reading should be
conceptualized diferently, by anchoring these patterns not only to
the syntagmatic dimension of a written text, but also to
the time-line of the diferent movements and multimodal
processes that unfold during reading.
ble, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, D. Crepaldi, V. Pirrelli, Eye-voice and finger-voice
J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalam- spans in adults’ oral reading of connected texts.
barkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, Implications for reading research and assessment,
J. Liang, Y. Lu, C. K. Luk, B. Maher, Y. Pan, The Mental Lexicon (2024). URL: https://benjamins.
C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, com/catalog/ml.00025.nad.</p>
        <p>H. Suk, S. Zhang, M. Suo, P. Tillet, X. Zhao, E. Wang, [24] R Core Team, R: A Language and Environment for
K. Zhou, R. Zou, X. Wang, A. Mathews, W. Wen, Statistical Computing, R Foundation for Statistical
G. Chanan, P. Wu, S. Chintala, PyTorch 2: Faster Computing, Vienna, Austria, 2023. URL: https://
Machine Learning Through Dynamic Python Byte- www.R-project.org/.
code Transformation and Graph Compilation, in:
Proceedings of the 29th ACM International
Conference on Architectural Support for Programming A. GeCO FPD data
Languages and Operating Systems, Volume 2,
volume 2 of ASPLOS ’24, Association for Computing</p>
        <p>Machinery, 2024, pp. 929–947. participant #1 participant #2 participant #10 average distribution
[12] T. M. Inc., Matlab version: 9.7.0.1190202 (r2019b), 0.5 0.5 0.5 0.5
2019. )0.4 0.4 0.4 0.4
[[1143]] ePBd.. iQCtiooi,nnY,s.o20rZt0hi7ua.mn g,,TYh.eZbhraitnisgh, Jn. aBtioolntoanl ,cCor.pDu.s,Mxamnl- (cssednoFPD000...123 000...123 000...123 000...123
ning, Stanza: A Python natural language processing
toolkit for many human languages, in: Proceed- 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
ifnogrsCoofmthpeut5a8ttihonAanlnLuinagluMiseteictisn: gSyosfttehme ADsesmocoinasttioran- 0.5 part per-token bpeahratviour averaged accproasrts all participants part
tions, 2020. )0.4
[15] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, scnod0.3</p>
        <p>I. Sutskever, Language models are unsupervised (seD0.2
multitask learners (2019). FP0.1
[16] J. A. Michaelov, B. K. Bergen, Do language models
make human-like predictions about the coreferents 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
of italian anaphoric zero pronouns?, arXiv preprint token (sorted by FPD) 104
arXiv:2208.14554 (2022). Figure 2: A view of FPD data in the GECO dataset, consisting
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: of eye-tracking patterns of 14 adult participants reading the
Pre-training of Deep Bidirectional Transformers for novel "The Mysterious Afair at Styles" by Agata Christie. Top
Language Understanding, 2019. ArXiv:1810.04805 panel: distributions of FPD data, with chapters grouped into
[cs] version: 2. 4 parts, for participant #1 (with 3 more participants showing
[18] K. E. Stanovich, Attentional and automatic con- a similar distribution), participant #2 (with 8 more
particitext efects in reading, in: Interactive processes in rpiagnhttsmsohsotwbinogx palsoitmsihlaorwdsistthriebuatvieorna)gaenddisptarribtiuctipioannta#c1ro0s.sTahlel
reading, Routledge, 2017, pp. 241–267. 14 participants. Bottom panel: plot of all 56410 tokens in
[19] G. B. Simpson, R. R. Peterson, M. A. Casteel, the dataset, in ascending order of mean FPD (dashed black
C. Burgess, Lexical and sentence context efects in line). For each token, the standard deviation calculated on the
word recognition., Journal of Experimental Psychol- distribution of the FPDs of the 14 participants is shown both
ogy: Learning, Memory, and Cognition 15 (1989) above and below the mean value (gray dots).
88.
[20] K. Rayner, K. H. Chace, T. J. Slattery, J. Ashby, Eye
movements as reflections of comprehension
processes in reading, Scientific studies of reading 10
(2006) 241–255.
[21] N. J. Smith, R. Levy, The efect of word predictability
on reading time is logarithmic, Cognition 128 (2013)
302–319.
[22] M. Breen, Empirical investigations of the role of
implicit prosody in sentence processing, Language
and Linguistics Compass 8 (2014) 37–50.
[23] A. Nadalini, C. Marzi, M. Ferro, L. Taxitari, A. Lento,</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>B. FPROP accuracy</title>
    </sec>
    <sec id="sec-4">
      <title>C. Data analysis</title>
      <p>In this section, coeficients of Generalised Additive
Models (GAMs) are detailed for each neural model. Statistical
non-significant p-values on GAM predicting terms are
given in bold-face. GAMs are fitted using the package
gamm4 version 0.2-6 of the R statistical software [24], as
they do not assume a linear relation between the fitted
variable and its predictors. All plots were created via the
ggplot2 package, version 3.5.</p>
      <sec id="sec-4-1">
        <title>Human FPD</title>
        <p>parametric coef. estimate std. error t value pr(&gt;|t|)
Intercept (content) 6.960e-02 7.858e-04 88.568 &lt; 2 −
surprisal 1.928e-03 5.002e-05 38.539 &lt; 2 −
probMinus1 -1.395e-02 1.363e-03 -10.233 &lt; 2 −
Intercept (function) -2.599e-02 1.143e-03 -22.746 &lt; 2 −
length (content) 1.562e-02 1.423e-04 109.767 &lt; 2 −
length (function) 5.499e-03 2.791e-04 19.704 &lt; 2 −
surprisal:probMinus1 4.692e-04 1.776e-04 2.642 &lt; 0.01
s(logFreq) &lt; 2 −
R2 58.4%</p>
      </sec>
      <sec id="sec-4-2">
        <title>BERT FPD</title>
        <p>parametric coef. estimate std. error t value pr(&gt;|t|)
Intercept (content) 9.626e-02 4.765e-04 202.020 &lt; 2 − 16
surprisal 1.319e-03 3.027e-05 43.586 &lt; 2 − 16
probMinus1 -4.998e-03 8.245e-04 -6.0616 &lt; 1.3 − 09
Intercept (function) -2.293e-02 6.937e-04 -33.053 &lt; 2 − 16
length (content) 1.019e-02 8.616e-05 118.232 &lt; 2 − 16
length (function) 2.892e-03 1.693e-04 17.0848 &lt; 2 − 16
surprisal:probMinus1 -3.874e-04 1.077e-04 -3.599 &lt; 0.001
s(logFreq) &lt; 2 − 16</p>
        <p>R2 75.6%
Intercept (content) 0.0960782 0.0021829 44.014 &lt; 2 − 16
surprisal 0.0012786 0.0001409 9.073 &lt; 2.3 − 13
probMinus1 -0.0013508 0.0037907 -0.356 0.72
Intercept (function) -0.0192904 0.0030629 -6.298 &lt; 3.4 − 10
length (content) 0.0102735 0.0003941 26.069 &lt; 2 − 16
length (function) 0.0027876 0.0007299 3.819 &lt; 0.001
surprisal:probMinus1 -0.0008111 0.0004600 -1.763 0.08
s(logFreq) &lt; 2 − 16
R2 73.5%</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gerth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Festman</surname>
          </string-name>
          ,
          <article-title>Reading development, word length and frequency efects: An eye-tracking study with slow and fast readers, Frontiers in Communication 6 (</article-title>
          <year>2021</year>
          )
          <fpage>743113</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Häikiö</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pagán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Dickins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hyönä</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Liversedge</surname>
          </string-name>
          ,
          <article-title>Eye movements of children and adults reading in three diferent orthographies</article-title>
          .,
          <source>Journal of Experimental Psychology: Learning, Memory, and Cognition</source>
          <volume>48</volume>
          (
          <year>2022</year>
          )
          <fpage>1518</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Salicchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chersoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <article-title>A study on surprisal and semantic relatedness for eye-tracking data prediction</article-title>
          ,
          <source>Frontiers in Psychology</source>
          <volume>14</volume>
          (
          <year>2023</year>
          )
          <fpage>1112365</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hirotani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Frazier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rayner</surname>
          </string-name>
          ,
          <article-title>Punctuation and intonation efects on clause and sentence wrap-up: Evidence from eye movements</article-title>
          ,
          <source>Journal of Memory and Language</source>
          <volume>54</volume>
          (
          <year>2006</year>
          )
          <fpage>425</fpage>
          -
          <lpage>443</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Reichle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rayner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pollatsek</surname>
          </string-name>
          ,
          <article-title>The E-Z Reader model of eye-movement control in reading: Comparisons to other models</article-title>
          ,
          <source>Behavioral and Brain Sciences</source>
          <volume>26</volume>
          (
          <year>2003</year>
          )
          <fpage>445</fpage>
          -
          <lpage>476</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Engbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nuthmann</surname>
          </string-name>
          , E. Richter,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kliegl</surname>
          </string-name>
          ,
          <article-title>SWIFT: A Dynamical Model of Saccade Generation During Reading</article-title>
          .,
          <source>Psychological review 112</source>
          (
          <year>2005</year>
          )
          <fpage>777</fpage>
          -
          <lpage>813</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>U.</given-names>
            <surname>Cop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dirix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Drieghe</surname>
          </string-name>
          , W. Duyck,
          <string-name>
            <surname>Presenting</surname>
            <given-names>GECO</given-names>
          </string-name>
          :
          <article-title>An eyetracking corpus of monolingual and bilingual sentence reading</article-title>
          ,
          <source>Behavior Research Methods</source>
          <volume>49</volume>
          (
          <year>2017</year>
          )
          <fpage>602</fpage>
          -
          <lpage>615</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>