1. Introduction

Data Processing Pipeline for Eye-Tracking Analysis

Jennifer Landes

Sonja Köppl

Meike Klettke

1 0 Hochschule Neu-Ulm , Germany 1 University of Regensburg , Germany

The overarching topic of this research project is academic misconduct in online assessments, aiming to understand students´ behavior and methods. To gain deeper insights, an eye-tracking experiment was conducted to capture when and how students engage in academic misconduct. Data from this experiment will reveal irregularities in cheating behavior. This paper presents a data engineering pipeline for the preparation of the future eye-tracking data analysis implemented in Python and a reasoning for the chosen order. Steps like Feature Selection, Data Preparation, Outlier Detection and Treatment, Filtering, Smoothing, and Normalization are included in this pipeline. We describe the data set, the setting and conduction of the experiment, and the data engineering pipeline. This article contributes to the current discussion of the preprocessing and analyse of eye tracking data.

eol>Data Pipeline Data Preprocessing Machine Learning

1. Introduction

cessing to ensure accurate and reliable analysis results.

Therefore, addressing these challenges in preparing eyeAcademic misconduct still persists as a challenge to the tracking data is essential for efectively detecting patterns integrity of higher education, especially within the con- indicative of academic misconduct and advancing our untext of digital examinations. The importance of robust derstanding of cheating behaviors in digital examination methodologies to detect and prevent such behaviors is settings. further underscored. While much attention is under- The subsequent sections of this paper outline the varstandably focused on the act of cheating itself, it’s equally ious preprocessing steps undertaken to cleanse and encritical to recognize the significance of preparing data hance the eye-tracking dataset, laying the foundation for subsequent analysis. for subsequent analyses. Chapter 2 provides a compre

This paper aims to address this crucial intersection hensive review of relevant literature, emphasizing the between academic misconduct and data analysis, specifi- significance of preprocessing. Chapter 3 delves into the cally focusing on the preprocessing pipeline tailored for experiment design and dataset characteristics, setting the collected eye-tracking data during an experiment. With stage for the preprocessing procedures detailed in Chapthe rise of digital assessments, accelerated by the COVID- ter 4. Finally, Chapter 5 ofers a summary of findings and 19 pandemic [1], there’s an urgent need to explore cheat- discusses potential avenues for future research, emphaing behaviors through novel avenues. However, before sizing the pivotal role of preprocessing in the broader we can deliver insights, raw data must undergo several landscape of academic misconduct detection. preprocessing steps to ensure its quality and usability.

This study presents an approach to prepare eye- 1.1. Prior Work and Motivation tracking data for an upcoming analysis with machine learning models. While the overarching project endeav- This study is part of the broader project ii.oo (Digitales ors to understand and mitigate academic misconduct Kompetenzorientiertes Prüfen implementieren), which among students, this specific endeavor delves into the aims to address academic misconduct among students process of data preparation and refinement—a critical in digital examination settings. The primary objective prerequisite for meaningful analysis. is to delve into the various factors influencing cheating

The challenge inherent in eye-tracking data adds an behaviors and to identify the methods students employ. additional layer of complexity to the preprocessing task. The project unfolds in distinct phases. Initially, a quanThe nature of eye-tracking data, which often includes titative survey on basis of [2] was conducted to gain noise, calibration errors, and variability between par- insights into students’ cheating behaviors and the conticipants, underscores the necessity of rigorous prepro- textual factors influencing these behaviors [ 3]. The survey encompassed various tasks and cheating scenarios, 35th GI-Workshop on Foundations of Databases (Grundlagen von Daten- probing students’ preferences and motivations. Drawing b$anjkeennn)i,feMr.alayn2d2e-s2@4,h2n02u4.d,eH(eJr.dLeackned,eGs)e;rsmoannjay..köppl@hnu.de from the survey findings, an eye-tracking experiment (S. Köppl); meike.klettke@ur.de (M. Klettke) was designed to gain deeper insights into cheating pat© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License terns.

Attribution 4.0 International (CC BY 4.0).

The eye-tracking experiment involved the collection of data from 20 participants during the examination. This paper focuses specifically on the data preparation phase of the project, which is integral to the subsequent analysis and pattern detection. The preprocessing steps undertaken ensure that the collected data is refined and ready for analysis.

So, this project is structured to first understand the landscape of academic misconduct through a quantitative survey, followed by the design and implementation of an eye-tracking experiment to delve deeper into cheating behaviors. The preprocessing of collected data is a crucial preparatory step, paving the way for meaningful analysis and detection of patterns indicative of academic misconduct.

2. Related Work

ization, Z-score normalization, and unit length scaling. These techniques are essential for ensuring optimal classifier performance, especially when dealing with large diferences between feature values [ 4]. Discretization methods are discussed by Dougherty et al. [12], include equal size and equal frequency methods. These methods are relevant for handling continuous attributes in the preprocessing phase. Additionally, the significance of feature selection in preprocessing cannot be overstated. It is a crucial step for example for the k-NN procedure, involves identifying and eliminating irrelevant features by reducing data dimensionality. The selected literature gives an insight in the current preprocessing steps and their selected strategies, which are important for the upcoming choice of preprocessing steps and their detailed processing.

3. Experiment and Dataset

The preprocessing literature discusses several steps like detection and treatment of noise, outliers or missing val- The following sections provide a detailed description ues, feature selection to reduce the dimension and the of the experimental design, the question types, and the important step of normalizing. In case of sensing data cheating methods allowed during the test. like Eye tracking, the challenge of noise needs the use of filters in the preprocessing stage. These filters are 3.1. Eye-Tracking Technology able to eliminate instances in the dataset, which may lead to misclassification issues [ 4]. The identification With eye-tracking it is possible to record, measure and anand management of outliers are categorized in statistics- alyze a person´s eye movements to get insights into the based, distance-based, and density-based. A statistics- visual focus and gaze patterns. It is used in psychology, based method by Huang et al. [5] assumes a statistical user experience, and marketing. The resulting data promodel for the dataset and outliers are detected using sta- vides an understanding of visual interaction with stimuli, tistical tests. Other methods, such as by Buzzi-Ferraris which can be explored with analytic techniques. By emand Manenti [6] also evaluate mean, variance, and out- ploying tools such as heatmaps and saliency maps, the lier values. For large datasets, Angiulli and Pizzuti [7] data can be visualized. During the recording, data, like introduced a distance-based outlier detection algorithm, fixations, saccades or their durations are captured in a HilOut, to identify the top outliers in a dataset. HilOut structured format for analysis [13]. computes the weight of a point as the sum of distances to its k-nearest neighbors, identifying outliers as points 3.2. Experimental Design with the highest weight. An approach by Ghoting et al. [8], presents the RBRP algorithm for mining distance- The experiments have been conducted in the Eyebased outliers in high-dimensional datasets. Handling Tracking Laboratory at Hochschule Neu-Ulm on two data missing values stands out as a critical challenge during collection sessions on May 9th and June 15th, 2023 with data preprocessing. Zhang et al. [9] proposed the NIIA 20 bachelor students from Prof. Sonja Köppl’s lecture in imputation approach, an iterative scheme imputing miss- Industrial Engineering ranging from 1st to 5th semester. ing data using information within incomplete instances. In the laboratory are two external Tobii eye-tracking deLuengo et al. [10] addressed the missing value problem vices employed. The students are organized into groups through various imputation methods, focusing on a classi- of five, while one participant serves as a supervisor. This ifcation task and demonstrating improved accuracy with arrangement is implemented to simulate a real examspecific imputation methods. Lobato et al. [ 11] presented ination scenario. During the experiment, participants a solution by combining evolutionary computation tech- choose from three methods of cheating. First, a cheat niques, specifically genetic algorithms (GA), for data im- sheet containing information relevant to the test quesputation. Their multi-objective GA, named MOGAImp, is tions, second, a mobile phone and third, the collaboration designed for mixed-attribute datasets. The data normal- with the neighbor during the test. After each section of ization is important for classifiers, neural networks and the test, the participants mark, which cheating method SVMs and comprises methods such as Min-Max normal- they have used. Also, they have the option to mark that

3.3. Dataset and Feature Selection

Feature selection (FS) is an important step, especially for k-NN, SVMs and neural network training. FS identifies irrelevant and redundant features and reduces the dimensionality of the data to enhance eficiency. Features are generally categorized into relevant, irrelevant and redundant. An exemplary selection algorithm generates proposed feature subsets to find an optimal subset or an evolutionary algorithm that assesses the quality of the proposed feature subset by providing a ’measure of goodness’ to the selection algorithm [15].

The recording of eye movements is conducted using the iMotions software. The original output dataset includes several features, the selected features are shown in Table 1. The selected features are chosen with a focus on their significance for analysis. This selection is an iterative process. As the analysis progresses, there may be adjustments made to the set of features. The data set from 20 participants is divided into each task, which results in total in a data size of 100. they have not cheated. The participants complete a test of 20 minutes of five distinct question types, from 2.5 to 4 minutes per question. The time was limited per task controlled by the software iMotions. early in the process. Outliers, which represent unusual 4.1. Data Cleaning eye movement data points, can significantly impact the analysis if not properly addressed. By identifying and In the data cleaning process, several steps are followed to addressing outliers early on, the preprocessing pipeline ensure the dataset’s integrity. The initial step standardensures the integrity of the dataset and enhances the izes column names by removing any leading or trailing accuracy of subsequent analyses. Following outlier de- whitespaces. The values of several columns with detection, a low-pass filter is employed to reduce high- tails of technical specification are deleted or converted frequency noise or rapid fluctuations in the eye-tracking to a numeric format, with non-numeric values being codata. This step is essential for smoothing out erratic vari- erced into NaN (Not-a-Number) values. Rows containing ations in the data, thereby improving its overall quality NaN values are eliminated from the dataset. Data is filand coherence. Additionally, data smoothing is applied tered based on the and ℎ to further reduce noise, particularly in cases where minor columns, retaining only the rows where data for both lfuctuations in eye-tracking measurements may obscure eyes are valid (both columns have a value of 1). Furunderlying patterns. Smoother data enhances the visibil- thermore, filters are applied to the ity of meaningful patterns and facilitates more accurate and columns, preserving rows where analysis. Towards the end of the preprocessing pipeline, these durations fall within predefined minimum and maxnormalization is performed to scale the data to a com- imum thresholds. Any duplicate columns present in the mon range. Normalization is crucial for ensuring that DataFrame are eliminated. [17]. all features are on a level playing field, facilitating easier comparison between diferent features or datasets. This 4.2. Missing Values step enhances the efectiveness of subsequent analyses by standardizing the data and mitigating the impact of varying scales.

4.3. Outlier

Outliers also impact dataset integrity. With an outlier detection it is possible to identify and treat data points that deviate significantly from the majority of observations. These outliers can arise due to various factors, including measurement errors, participant distractions, or genuine deviations in gaze behavior. The methods for identifying outliers are categorized into statistics-based, distance-based, and density-based methods [18].

For this data set, the method used for identifying outliers is the "Z-score" of a data point, which measures the number of standard deviations by which the data point deviates from the mean. A high Z-score suggests that the data point may be a potential outlier. = −

• - Z-score. • - individual data point. • - mean of the data. • - standard deviation of the data. acceleration by one student completing the task type 6 normal_cutoff = cutoff_freq / nyquist_freq

4.4. Low-pass Filter

Eye-tracking data is afected by noise, which can impact the accuracy of results. Filtering with a low-pass filter provides a method to reduce noise and enhance fixation stability. In the case of eye-tracking data, a low-pass iflter attenuates high-frequency components while preserving low-frequency components to identify fixations and reduce noise. A commonly used low-pass filter is the Butterworth filter, an IIR (infinite impulse response) iflter used in signal processing [20]. 7 # Create a low-pass filter 8 b, a = signal.butter(4, normal_cutoff, btype=’low’)

4.5. Smoothing The smoothing is done with the Fourier Transformation

to manipulate signals in the frequency domain. Fourier Transformation decomposes the original signal into sinusoidal components, each characterized by a specific frequency . It is defined by the following formula for continuous signals: () = ∫︁ () · − • () - frequency-domain representation. • () - time-domain signal.

• - angular frequency (2 times the frequency). When applied to eye-tracking data, it converts temporal gaze coordinates into the frequency domain by selecting the cutof frequency. Unwanted frequency components are identified by their frequency characteristics and reduced in influence, preserving fixation-related components. The filtered data is then transformed back to the time domain for further analysis. This method is chosen for eye-tracking data due to its ability to analyze both periodic and non-periodic signals, making it suitable for data containing a mix of fixations and noise. The _ and _ columns are cleaned by removing date entries and rows with NaN values. Fourier transformation is applied to both columns to smooth the data, reducing high-frequency noise [22], [21]. 1 # Smoothing using Fourier transformation 2 x = df[’Gaze X’].to_numpy() 3 y = df[’Gaze Y’].to_numpy() 4 x_smoothed = ifft(fft(x)) 5 y_smoothed = ifft(fft(y)) 6 df[’Gaze X Smoothed’] = x_smoothed.real 7 df[’Gaze Y Smoothed’] = y_smoothed.real

5. Conclusion In this paper, a data preparation process is designed and

described using Python in preparation for the upcoming analysis of eye tracking data. The process consists of several steps. Subsequently, a feature selection is performed to identify the most relevant ones for analysis. The data cleaning phase comprises data preparation to handle any missing values and to ensure data quality. One of the characteristics of eye-tracking data is its inherent noise and the presence of outliers. To address these issues, an outlier detection and treatment step is conducted, which helps in mitigating the efects of extreme data points. Additionally, filtering and smoothing techniques are applied, such as lowpass filtering and data smoothing, to enhance the interpretability of the gaze data. As eye-tracking data often contains high-frequency lfuctuations, efective filtering helps extract meaningful insights. Normalization is an essential step to ensure that the data is on a consistent scale for comparisons. The resulting clean and processed dataset are then ready for the upcoming in-depth analysis.

5.1. Future Research The next phase of the research is to evaluate and optimize

4.6. Normalization the preprocessing steps applied to the eye-tracking data. Disparities between feature values need careful treatment That means, to experiment with diferent sequences of to ensure that all attribute values become appropriate. these steps to ensure that the resulting cleansed data is Data normalization is crucial for various classifiers, in- primed for analysis. This iterative process will require cluding neural networks, SVMs, k-NN algorithms, and further investigation and refinement, with the goal of fuzzy classifiers. The primary normalization methods achieving optimal data quality. In this case, it is also for addressing this issue are Min–Max normalization or aimed to generalize the results of the ordering of the feature scaling in [0, 1] or [a, b], Z-score normalization steps of a data preprocessing pipeline also suitable for or standardization or Unit length scaling. By normal- other eye-tracking data to make them ready for analyizing eye-tracking data, diferent stimulus presentation sis. Te first observations reveal, that the filtering and durations are accounted for, allowing gaze data to exist smoothing process need to be refined, so that no data within a uniform coordinate system [4]. will be lost. Furthermore, the handling of missing data is

Here, the normalization is employed by the MinMaxS- still a detailed process, so that diferent strategies will be caler from scikit-learn. Min-Max scaling was chosen for applied to diferent features. normalizing eye-tracking data due to its simplicity, in- Looking ahead to the future analyses, the choice of terpretability, robustness to outliers, and preservation of analytical techniques and the research questions at hand data distribution. The resulting normalized gaze coordi- will reveal about patterns and predictions of cheating nates are represented as values between 0 and 1. For each behaviour. Subsequently, the focus will now shift towards gaze coordinate (X or Y) in the dataset: X_normalized = the analysis phase, where K-Means and classifiers like (X - X_min) / (X_max - X_min), where: random forest or SVM will be employed and compared.

The study is limited by the current survey size, the data • X - original gaze coordinate. set comprises sensor data from 20 participants with a split • X_min - minimum value of coordinate. for each task, in total 100. In the next time, more data • X_max - maximum value of coordinate. will be collected through the conduction of upcoming • X_normalized - normalized gaze coordinate. experiments. Furthermore, the pipeline will be tested on other data sets, so that an review on a generalization will be possible. 1 # Normalize the selected numeric columns 2 scaler = MinMaxScaler() 3 data[numeric_columns] = scaler.fit_transform(data

[numeric_columns]) plications, Prentice-Hall, Inc, USA, 1996. [22] J. Makhoul, A fast cosine transform in one and two dimensions, IEEE Transactions on Acoustics, Speech, and Signal Processing 28 (1980) 27–34.

URL: http://ieeexplore.ieee.org/document/1163351/. doi:10.1109/TASSP.1980.1163351.