Multi-Modal Human Cognitive State Recognition during Reading Nikita Filimonov Lomonosov Moscow State University filimonovn160@gmail.com Abstract. Human cognitive state recognition is an important and chal- lenging task. Various registration technologies can be used to collect physiological data that potentially contain relevant information regard- ing current cognitive state of a human subject. Oculography (eye-tracking) and electroencephalography (EEG) are most popular and well-researched registration technologies. Both technologies have cheap commercial vari- ants that do not require laboratory equipment or involvement of profes- sional physiologist to collect the data. However, it is still problematic and expensive to obtain large-scale datasets of physiological data of such sort. In this work a review and analysis of available open source physiological data is provided. A task of natural reading is considered since work of eyes and human brain during reading are of great interest for cognitive science in combination with machine-learning. A multi-modal approach that involves combining EEG and eye-tracking data in jointly trained artificial neural network is proposed. Intermediate results are presented regarding encoding EEG signals with Variational Auto-Encoder (VAE). Keywords: Eye tracking · Electroencephalography · Artificial neural networks. 1 Introduction This work1 is aimed at neural network architecture development for assessing the cognitive state of a person while working with a text. Term cognitive states, in this work, does not mean the emotional states of a person while performing a certain work, but which parts of the brain and how they respond to certain stim- ulus, in this case, to the information contained in the text. The first part of this work is devoted to the overview of feature selection approaches and the study of methods of data preprocessing. In this work, the following data will be used: Electroencephalograms (EEG), eye-tracking and vectorized text sets. An EEG is a collection of electrical signals registered on the brain. Generally speaking, EEG is a method for studying the functional state of the brain. Mathematically, 1 The work is performed as a master thesis at the master program “Big Data: Infras- tructures and Methods for Problem Solving”, Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 147 EEG data is represented by an M × N matrix, where M is the number of chan- nels in the device that was used to record signals, and N is the length of the record. Eye-Tracking data is the result of registration process of the patient’s eye movement while reading text, for example. A dataset is a set of coordinates where a person’s eyes were focused at a fixed frequency. The next stage of the work is architecture development. Its development will directly depend on what features will be selected and how they will be grouped in the dataset. Feature selection is an important topic because the area of the research is an intersection of neurophysiology and data science, which leads to certain difficulties with fea- ture engineering. In this research not only with mathematical quality metrics, but also with medical ones will be taken in count. Medical metrics mean that at the stages of data processing and preparation, the correspondence of the selected metrics to reality will be checked, and the features themselves will be extracted in accordance with the observations of experts. For example, EEG data have been studied for a long time, and a lot of patterns, how the brain reacts to certain stimulus, are already known and how these patterns look after certain trans- formations. Later, according to the obtained data, it will be possible to build hypotheses and conclusions, and most importantly, correctly perform batch sam- pling during training neural-networks. A significant part of research in related topics is focused only on single type of physiological data, either eye-tracking features or electroencephalography features. This work proposes an approach that combines types of features together with semantic information extracted from text. Future development of this work will include training and evaluation of proposed approach on various datasets, including open-source datasets. Apart from deep neural network architecture the work includes the following steps: data preprocessing, feature extraction and feature selection steps. As a final step, a quality metric of the neural network in cognitive state recognition task will be developed. This metric could be presented as hypothesis which requires further test and designing on experiments. The rest of this paper is organized as follows: Section 2 discusses related works, Section 3 describes the dataset and data preprocessing issues, the ap- proach is proposed in Section 4, and preliminary results are described in Section 5. 2 Related Work A significant amount of work had been dedicated to feature extraction, selection and preprocessing. Usually, to solve classification, regression and all other prob- lems with EEG or eye tracking data word-level characteristics. The following features are usually extracted from raw word-level data – number of fixations, mean fixation duration, gaze duration, number of fixations on word, the sum of all fixations on the current word in the first-pass reading before the eye moves out of the word, total reading time (TRT), the sum of all fixation durations on the current word, first fixation duration (FFD), the duration of the first fixation 148 on the prevailing word, go-past time (GPT), the sum of all fixations on the right of the current word. According to [1], both monolingual and multilingual models achieve high ac- curacy in predicting a range of eye tracking features across four languages. Com- parison of performance of language-specific and multilingual pretrained trans- former models in regression task was provided. Main task of this work was to predict eye-tracking features in an experiment where participants were read- ing texts on Dutch, English, German, and Russian languages. According to re- sults, Bidirectional Encoder Representations from Transformers (BERT) [2] and Cross-Lingual Language Model (XLM) [3] models show the best performance on restoring eye-tracking data, meanwhile XLM models require less data to fine- tune. In [4] used EEG features to supervise machine attention. Also, it was shown that cropping data with random forest-splits won’t reduce the model accuracy but will considerably reduces the number of dimensions of the EEG data. They used Bidirectional Long-Short Term Memory (BiLSTM) model with attention mechanism. Not only EEG features can be used to tune attention weights in neural-networks, but eye-tracking features also. In [4] human attention derived from eye-tracking data were used to regularize attention functions in recurrent neural networks. First note that the baseline models only attend to one or two coherent text parts. These conclusions could be made from this paper – baseline models mainly focus on stop-words, rather than on gaze or fixation information and that the regularization made by human attention, learned from eye-tracking data enables neural-networks (bidirectional LSTM in this case) to learn to better focus on the most relevant aspects of sentences for the target tasks. 3 Dataset Due to expensive data collection process and necessity of involvement of qualified physiologists, a very limited amount of data is available for research community. This work focuses on freely available ZuCo and ZuCo-2 datasets [5] [6].The first version of the datasets appeared many times in various research works while the second version of the dataset has different semantic specification. In ZuСo2 data had been recorder from 19 participants, but one of them had technical problems with the recording, thus dataset consists of recordings collected from 18 participants. In the experiment from ZuCo-2 dataset, the participants had to read 739 sentences that were selected from the Wikipedia dataset. The corpus provides se- mantic annotations of semantically different tasks. Normal Reading task consists of a random set of Wikipedia sentences. Following topics had been selected for the task-specific reading part: political affiliation, education, founder, wife/husband, job title, nationality, employer. The sentences have the same length as ZuCo 1.0, with similar semantic. In Normal reading task, the participants had to read 349 sentences, and 390 sentences in a task-specific reading task. Furthermore, there is also an overlap in the sentences between ZuCo 1.0 and ZuCo 2.0. 100 nor- mal reading and 85 task-specific sentences from ZuCo-2 dataset were already 149 recorded in ZuCo 1.0. This provides an opportunity to compare different record- ing procedures (i.e. session-specific effects) and perform studies on larger number of participants (subject-specific effects). Eye-tracking device captures eye position and pupil size. Records had been made at a sampling rate of 500 Hz with EyeLink 1000 Plus, SR Research device. The eye tracker was calibrated with a 9-point grid at the beginning of the session and re-validated before each block of sentences. In ZuCo dataset, the following word-level features are extracted from eye tracking data: 1. X, Y coordinates of the fixation 2. Fixation durations 3. Gaze duration 4. Total reading time 5. Number of fixations on this word 6. Pupil size EEG part of the dataset, contains data from 128-channels of raw data with sampling frequency 500 Hz. After the initial filtering and cleaning, that had been made by dataset maintainers, 23 channels has been removed and 105 has left. In the table below, there is a brief description of both task types in ZuCo-2 dataset. Table 1. ZuCo-2 dataset description. Normal reading Task-specific reading Sentences 349 390 Sentence length Mean: 19.6; range: 5 . . . 53 Mean: 21.3; range: 5 . . . 53 Total words 6828 8310 Word types 2412 2437 Word length Mean: 4.9; range: 1 . . . 29 Mean: 4.9; range: 1 . . . 21 Normal reading (NR): Normal reading was the first task, participants had to read the sentences naturally, without any specific tasks or instructions. Task- specific reading (TSR) reading was the second, and the final task In task-specific reading participants had to read sentences with a clearly defined topic. For ex- ample, it could be political, economical, scientific related texts or questions on the same topics. Participants were instructed to search for a specific relation in each sentence they read, from the list of topics. Instead of comprehension ques- tions, the participants had to decide for each sentence whether it contains the relation or not, also they were actively annotating each sentence. All sentences within one block involved the same relation type. The blocks started with a test round, which described the relation and was followed by three sample sen- tences, so that the participants would be familiar with the respective relation type. Event-Related Potentials were calculated for both tasks, based on fixation timestamps. 150 In a proposed approach event is a fixation on word, so the term fixation- related-potential (FRP) is used instead of ERP. In EEG, FRP is usually calcu- lated using window from -600 ms before fixation, and 1 second after it [7]. N N 1 X 1 X x̄(t) = x(t, k) = s(t) + n(t, k), (1) N N k=1 k=1 where N is the length of interval, k is event number, t is time passed after the k-th event, i.e. length of time interval, s(t) is expected value of the signal, and n(t, k) is noise. Fixation-related potentials are extracted from training data be- cause they contain significant information about cognitive human reaction on each word, and in total the information about each sentence is obtained. In other words, FRP is an averaged incentives in the EEG signal, related to a certain action in real life. Therefore, information about human cognitive state during reading each sentence is contained in FRPs, excluding non-informative data between sentences and words. After extracting FRPs, alpha (8 - 12 Hz), beta (12 - 30 Hz), gamma (30 - 45 Hz), theta (4 - 8 Hz), delta (0.5 доби– 4 Hz) frequency bands are extracted. [7]. Extracting frequency bands is a regular method of EEG data preprocessing, because frequency bands reflect cognitive and memory performance. Furthermore, this approach can be used as a dimen- sion reduction method, because raw EEG data is represented by highly correlated multidimensional time series. 4 Proposed Approach 4.1 Feature Extraction Eye Tracking Features. Feature selection and preprocessing approaches are well studied topics in analysis of physiological data. Usually, to solve classifica- tion, regression and other tasks using EEG or eye tracking modalities word-level characteristics are utilized. The following features are usually extracted from raw word-level data (EEG and eye-tracking features for every word in corpus) [5] [6]: – number of fixations - moments when eyes do not move, i.e fixated on some- thing, – mean fixation duration, – gaze duration - sum of all fixations on the word, before fixation on another one (in seconds), – number of fixations on word, – total number of fixations on the current word in the first-pass reading before the eye moves out of the word, – total reading time (TRT), – the sum of all fixation durations on the current word, – first fixation duration (FFD), – the duration of the first fixation on the prevailing word, 151 – go-past time (GPT), – the sum of all fixations on the right of the current word. Among all described above features that could be extracted from eye-tracking data, gaze features are less researched. In the paper [8] it was studied how it is possible to use and properly preprocess gaze features in sequence labelling and sequence classification tasks. The gaze features can simply be concatenated to word-level features as multidimensional vectors representing each word. Several works [9] [10] showed that word-level averages of gaze features helped better than token-level features, i.e word-level features in vectorized representation. Using word-level gaze features does not require gaze at test time, e.g in test dataset or in experiments. The features can be used in the same way as word embeddings that are usually used and several studies also successfully concatenated type-level gaze features with pretrained word embeddings for a richer representation [11]. In the context of word embeddings — embedding is a vectorized representation of the text, whcih could be obtained using multiple techniques. Concerning the EEG embeddings — embedding mean the contraction mapping from one feature space to another. EEG data contains a lot of extra information which does not make sense in this work. Consequently, generating EEG embeddings is a part of feature reduction step. Electroencephalography Features. A significant part of proposed work is planned to be focused on EEG data. Due to possibility of simultaneous registra- tion of EEG and eye-tracking data, a variety of word-level brain activity signal feature extraction approaches can be utilized. In work [12] it was demonstrated that EEG of semantic processing can complement and enrich eye-tracking data. The eye tracking data provides millisecond-accurate fixation times for each word. Therefore, it is possible to obtain brain activity representations during each fix- ations of a word, and then extract event-related potentials on a given word. An event-related potential (ERP) is the response from brain, that is the direct result of a specific event, such as a fixation. More formally, it is any response to a stim- ulus [13]. In this work, 128 channels of EEG data will be used. Then, ERP’s will be extracted from these 128 channels data. The study of the brain in this way provides a noninvasive means of evaluating brain functioning.In [14] used EEG brain activity data in NLP tasks: embeddings were generated on sentence level, sentence data had been padded to the size of maximum sentence. It is a common technique to work with multi-modal data, because EEG and eye-tracking data are represented by vectors and tensors of different size for each word. Also, it was shown that EEG data can improve performance on classification tasks in addition to usage BERT embeddings. As the final step, EEG embeddings will be generated from obtained fixation-related potentials. EEG data can also be used to fine-tune attention mechanisms in NLP tasks. In [4] EEG features are used to supervise machine attention. Also, it was shown that cropping data with random forest-splits does not reduce model accuracy but considerably reduces the number of dimensions of the EEG data. Authors 152 used Bidirectional LSTM model with attention mechanism. Not only EEG fea- tures can be used to tune attention weights in neural-networks, but eye-tracking features also. In [10] human attention derived from eye-tracking data were used to regularize attention functions in recurrent neural networks.Baseline models mainly focus on stop-words, rather than on gaze or fixation information and that the regularization made by human attention, learned from eye-tracking data en- ables neural-networks (bidirectional LSTM in this case) to learn to better focus on the most relevant aspects of sentences for the target tasks. 4.2 Architecture The purpose of this work is to propose the approach that could be applied to large scale datasets that consist of heterogeneous sequential data. This is the typical case of a neural networks application when it is problematic to apply popular statistical methods to achieve acceptable result and the size of dataset is counted in hundreds of gigabytes. Recurrent Neural Networks(RNNs) are es- pecially suited for sequential data, such as eye-tracking or EEG data. The most popular type of RNN is LSTM, and its modification Bidirectional LSTM (BiL- STM), is usually used to work with eye-tracking data. Powerful, but heavy and large architectures, such as BERT or XLM could be used to make predictions, based on eye-tracking data. Nevertheless, these models must be fine-tuned before using them, while BiLSTM [15] could be trained from scratch using much less data, than BERT and XLM requires [16]. In proposed approach, instead of using word embeddings and EEG features like in [15], embeddings generated from EEG features will be used, as well as word embeddings generated with GPT-2 BERT model and eye-tracking data embeddings. Eye-tracking embeddings will be taken from one of the final layers in the network, that will be used to predict next fixation, based on previous ones. A neural network will be trained to solve binary classification task: recognition of cognitive states related to normal reading and task specific reading. Input pipleline of the neural-network consists of fully-connected layers for each type of input features. These layers are used to perform dimension reduction and transform all data into the same shape for further concatenation. Than, the concatenated matrices are passed into BERT. BERT is a powerful architecture that is able to work well with long sequences, like in this case. Proposed network architecture is illustrated on Fig.1. The final goal could be divided into 3 sub- tasks: 1. Using the sequence of N previous fixations, word embeddings for these fix- ations and word embedding for the next word predict the probability of the fixation on next word. Take output layer activations as embeddings for eye-tracking data. 2. Generate embeddings for EEG frequency bands. 3. Using word embeddings, EEG embeddings, eye-tracking embeddings solve various classification tasks on ZuCo-2 corpora. 153 Fig. 1. Task-classification neural-network architecture In [6] word embeddings and EEG data were used together to classify partic- ipant reading tasks from ZuCo dataset. Current paper extends this approach by focusing not only on word embeddings, but using them to generate embeddings from other parts of dataset. 154 5 Preliminary Results As it was said above, embeddings of the EEG time series, eye-tracking features and work-level features will be used as an input batch to the neural-network. At present, only EEG input features had been prepared. A popular approach to generate embeddings is using autoencoders. In current work an approach that utilizes Variational Autoencoder (VAE) [17] hidden states as embeddings is proposed for EEG frequency bands data. Variation Autoencoder learns feature distribution parameters, transforms features to the hidden states vector, and then makes reconstruction of the original space. Since the input data is not a random set of time-series it could be described as a large mix of distributions with different parameters, and that is the reason why VAE was chosen. General idea of VAE is presented in the following equation: log P (x) − KL[Q(Z|X, θ1 ), P (X|Z, θ2 )] = EZ∼Q [log(P (X|Z, θ2 )] − KL[Q(Z|X, θ1 ), P (Z)], where X is an input data, Z is hidden states, Q(Z|X, θ1 ) and P (X|Z, θ2 ) are arguments of encoder and decoder functions consequently. In in a proposed ap- proach LSTM layers were used in both encoder and decoder parts of VAE. Input for VAE is defined as following vector: T X = ((α0 , β0 , γ0 , θ0 , δ0 ), . . . , (α105 , β105 , γ105 , θ105 , δ105 )) , where α, β, γ, θ, δ are EEG frequency bands respectively. The main criteria in re- construction - is to minimize reconstruction loss and make time-series generated by VAE realistic. Difference between original and reconstructed time-series. Even with the minimal batch size, it was impossible to train VAE to reconstruct time- series of the same smoothness as original. VAE had successfully reconstructed all peaks, and learned the structure of the spectre signal. But because recon- structed time series are not as smooth as the original ones, it was impossible to achieve good R2 score. Scores were computed on full time-series, with 128 chan- nels, while on Fig.2 only first 10 channels are shown. The results are presented in table 2. Table 2. VAE reconstruction statistics mean squared error mean absolute error R2 alpha 0.042 0.015 0.40 beta 0.033 0.011 0.60 gamma 0.037 0.012 0.70 theta 0.027 0.006 0.43 delta 0.031 0.007 0.44 To make sure that generated embeddings captures infromation from the orig- inal data well-enough, VAE reconstruction original data. On the charts below, 155 EEG frequency bands are presented. Left column is the original data, and re- constructed time-series are on the right column. Fig. 2. Reconstruction of EEG frequency bands with Variational Auto-Encoder 6 Conclusions and Future Work At present only EEG signal emdeddings method had been chosen and testes. Work with eye-tracking features has started, but still in progress. Future devel- opment os this work will include generation of embeddings from eye-tracking fetures sequentially predicting next fixation or its probability. Neural network architecture based on the multi-modal embeddings will be trained on available datasets. Main focus will be made on dimensions of multi-modal embeddings to reduce the padding proportion. Acknowledgement. This work is supervised by Ivan Shanin, Federal Research Center “Computer Science and Control” of Russian Academy of Sciences. References 1. Hollenstein, Pirovano F., Ce Zhang, Jager L. Beinborn: Multilingual Language Models Predict Human Reading Behavior. arXiv preprint arXiv:2104.05433, (2021) 2. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805. (2018) 156 3. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V., Unsupervised Cross-lingual Representation Learning at Scale. arXiv preprint arXiv:1911.02116. (2020) 4. Muttenthaler, L., Hollenstein, N., Barrett, M. Human brain activity for machine attention. arXiv preprint arXiv:2006.05113. (2020) 5. Hollenstein, N., Rotsztejn, J., Troendle, M., Pedroni, A., Zhang, C., Langer, N. ZuCo, a simultaneous EEG and eye-tracking resource for natural sentence reading. Scientific data, 5(1), 1-13. (2018) 6. Hollenstein, N., Troendle, M., Zhang, C., Langer, N. ZuCo 2.0: A dataset of physiological recordings during natural reading and annotation. arXiv preprint arXiv:1912.00903. (2019) 7. Kropotov J.D. Quantitative EEG, Event Related Potentials and Neurotherapy by Juri ISBN: 978-0-12-374512-5 (2009) 8. Barrett, M., Hollenstein, N. Sequence labelling and sequence classification with gaze: Novel uses of eye-tracking data for Natural Language Processing. Language and Linguistics Compass, 14(11), 1-16 (2020). 9. Culotta, A., McCallum, A., Betz, J. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference (pp. 296-303). (2006) 10. Barrett, M., Bingel, J., Hollenstein, N., Rei, M., Søgaard, A: Sequence classification with human attention. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 302-312. (2018) 11. Barrett, M., Keller, F., Søgaard, A. Cross-lingual transfer of correlations between parts of speech and gaze features. In Proceedings of COLING 2016, the 26th In- ternational Conference on Computational Linguistics: Technical Papers, pp. 1330- 1339. (2016) 12. Barrett, M., Søgaard, A. Reading behavior predicts syntactic categories. In Pro- ceedings of the nineteenth conference on computational natural language learning, pp. 345-349. (2015) 13. Dimigen, O., Sommer, W., Hohlfeld, A., Jacobs, A. M., Kliegl, R. Coregistration of eye movements and EEG in natural reading: analyses and review. Journal of experimental psychology: General, 140(4), 552. (2011) 14. Barrett, M., Bingel, J., Keller, F., Søgaard, A. Weakly supervised part-of-speech tagging using eye-tracking data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 579-584. (2016) 15. Hollenstein, N. Leveraging Cognitive Processing Signals for Natural Language Un- derstanding (Doctoral dissertation, ETH Zurich). (2021) 16. Hollenstein, N., Renggli, C., Glaus, B., Barrett, M., Troendle, M., Langer, N., Zhang, C. Decoding EEG Brain Activity for Multi-Modal Natural Language Pro- cessing. arXiv preprint arXiv:2102.08655. (2021) 17. Kingma, D. P., Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.(2013) 157