<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Processing Pipeline for Eye-Tracking Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jennifer Landes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonja Köppl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meike Klettke</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hochschule Neu-Ulm</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Regensburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The overarching topic of this research project is academic misconduct in online assessments, aiming to understand students´ behavior and methods. To gain deeper insights, an eye-tracking experiment was conducted to capture when and how students engage in academic misconduct. Data from this experiment will reveal irregularities in cheating behavior. This paper presents a data engineering pipeline for the preparation of the future eye-tracking data analysis implemented in Python and a reasoning for the chosen order. Steps like Feature Selection, Data Preparation, Outlier Detection and Treatment, Filtering, Smoothing, and Normalization are included in this pipeline. We describe the data set, the setting and conduction of the experiment, and the data engineering pipeline. This article contributes to the current discussion of the preprocessing and analyse of eye tracking data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Pipeline</kwd>
        <kwd>Data Preprocessing</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>cessing to ensure accurate and reliable analysis results.</p>
      <p>Therefore, addressing these challenges in preparing
eyeAcademic misconduct still persists as a challenge to the tracking data is essential for efectively detecting patterns
integrity of higher education, especially within the con- indicative of academic misconduct and advancing our
untext of digital examinations. The importance of robust derstanding of cheating behaviors in digital examination
methodologies to detect and prevent such behaviors is settings.
further underscored. While much attention is under- The subsequent sections of this paper outline the
varstandably focused on the act of cheating itself, it’s equally ious preprocessing steps undertaken to cleanse and
encritical to recognize the significance of preparing data hance the eye-tracking dataset, laying the foundation
for subsequent analysis. for subsequent analyses. Chapter 2 provides a
compre</p>
      <p>This paper aims to address this crucial intersection hensive review of relevant literature, emphasizing the
between academic misconduct and data analysis, specifi- significance of preprocessing. Chapter 3 delves into the
cally focusing on the preprocessing pipeline tailored for experiment design and dataset characteristics, setting the
collected eye-tracking data during an experiment. With stage for the preprocessing procedures detailed in
Chapthe rise of digital assessments, accelerated by the COVID- ter 4. Finally, Chapter 5 ofers a summary of findings and
19 pandemic [1], there’s an urgent need to explore cheat- discusses potential avenues for future research,
emphaing behaviors through novel avenues. However, before sizing the pivotal role of preprocessing in the broader
we can deliver insights, raw data must undergo several landscape of academic misconduct detection.
preprocessing steps to ensure its quality and usability.</p>
      <p>This study presents an approach to prepare eye- 1.1. Prior Work and Motivation
tracking data for an upcoming analysis with machine
learning models. While the overarching project endeav- This study is part of the broader project ii.oo (Digitales
ors to understand and mitigate academic misconduct Kompetenzorientiertes Prüfen implementieren), which
among students, this specific endeavor delves into the aims to address academic misconduct among students
process of data preparation and refinement—a critical in digital examination settings. The primary objective
prerequisite for meaningful analysis. is to delve into the various factors influencing cheating</p>
      <p>The challenge inherent in eye-tracking data adds an behaviors and to identify the methods students employ.
additional layer of complexity to the preprocessing task. The project unfolds in distinct phases. Initially, a
quanThe nature of eye-tracking data, which often includes titative survey on basis of [2] was conducted to gain
noise, calibration errors, and variability between par- insights into students’ cheating behaviors and the
conticipants, underscores the necessity of rigorous prepro- textual factors influencing these behaviors [ 3]. The
survey encompassed various tasks and cheating scenarios,
35th GI-Workshop on Foundations of Databases (Grundlagen von Daten- probing students’ preferences and motivations. Drawing
b$anjkeennn)i,feMr.alayn2d2e-s2@4,h2n02u4.d,eH(eJr.dLeackned,eGs)e;rsmoannjay..köppl@hnu.de from the survey findings, an eye-tracking experiment
(S. Köppl); meike.klettke@ur.de (M. Klettke) was designed to gain deeper insights into cheating
pat© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License terns.</p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
      <p>The eye-tracking experiment involved the collection
of data from 20 participants during the examination. This
paper focuses specifically on the data preparation phase
of the project, which is integral to the subsequent analysis
and pattern detection. The preprocessing steps
undertaken ensure that the collected data is refined and ready
for analysis.</p>
      <p>So, this project is structured to first understand the
landscape of academic misconduct through a quantitative
survey, followed by the design and implementation of
an eye-tracking experiment to delve deeper into
cheating behaviors. The preprocessing of collected data is a
crucial preparatory step, paving the way for meaningful
analysis and detection of patterns indicative of academic
misconduct.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>ization, Z-score normalization, and unit length scaling.
These techniques are essential for ensuring optimal
classifier performance, especially when dealing with large
diferences between feature values [ 4]. Discretization
methods are discussed by Dougherty et al. [12], include
equal size and equal frequency methods. These methods
are relevant for handling continuous attributes in the
preprocessing phase. Additionally, the significance of
feature selection in preprocessing cannot be overstated.
It is a crucial step for example for the k-NN procedure,
involves identifying and eliminating irrelevant features
by reducing data dimensionality. The selected literature
gives an insight in the current preprocessing steps and
their selected strategies, which are important for the
upcoming choice of preprocessing steps and their detailed
processing.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment and Dataset</title>
      <p>The preprocessing literature discusses several steps like
detection and treatment of noise, outliers or missing val- The following sections provide a detailed description
ues, feature selection to reduce the dimension and the of the experimental design, the question types, and the
important step of normalizing. In case of sensing data cheating methods allowed during the test.
like Eye tracking, the challenge of noise needs the use
of filters in the preprocessing stage. These filters are 3.1. Eye-Tracking Technology
able to eliminate instances in the dataset, which may
lead to misclassification issues [ 4]. The identification With eye-tracking it is possible to record, measure and
anand management of outliers are categorized in statistics- alyze a person´s eye movements to get insights into the
based, distance-based, and density-based. A statistics- visual focus and gaze patterns. It is used in psychology,
based method by Huang et al. [5] assumes a statistical user experience, and marketing. The resulting data
promodel for the dataset and outliers are detected using sta- vides an understanding of visual interaction with stimuli,
tistical tests. Other methods, such as by Buzzi-Ferraris which can be explored with analytic techniques. By
emand Manenti [6] also evaluate mean, variance, and out- ploying tools such as heatmaps and saliency maps, the
lier values. For large datasets, Angiulli and Pizzuti [7] data can be visualized. During the recording, data, like
introduced a distance-based outlier detection algorithm, fixations, saccades or their durations are captured in a
HilOut, to identify the top outliers in a dataset. HilOut structured format for analysis [13].
computes the weight of a point as the sum of distances
to its k-nearest neighbors, identifying outliers as points 3.2. Experimental Design
with the highest weight. An approach by Ghoting et al.
[8], presents the RBRP algorithm for mining distance- The experiments have been conducted in the
Eyebased outliers in high-dimensional datasets. Handling Tracking Laboratory at Hochschule Neu-Ulm on two data
missing values stands out as a critical challenge during collection sessions on May 9th and June 15th, 2023 with
data preprocessing. Zhang et al. [9] proposed the NIIA 20 bachelor students from Prof. Sonja Köppl’s lecture in
imputation approach, an iterative scheme imputing miss- Industrial Engineering ranging from 1st to 5th semester.
ing data using information within incomplete instances. In the laboratory are two external Tobii eye-tracking
deLuengo et al. [10] addressed the missing value problem vices employed. The students are organized into groups
through various imputation methods, focusing on a classi- of five, while one participant serves as a supervisor. This
ifcation task and demonstrating improved accuracy with arrangement is implemented to simulate a real
examspecific imputation methods. Lobato et al. [ 11] presented ination scenario. During the experiment, participants
a solution by combining evolutionary computation tech- choose from three methods of cheating. First, a cheat
niques, specifically genetic algorithms (GA), for data im- sheet containing information relevant to the test
quesputation. Their multi-objective GA, named MOGAImp, is tions, second, a mobile phone and third, the collaboration
designed for mixed-attribute datasets. The data normal- with the neighbor during the test. After each section of
ization is important for classifiers, neural networks and the test, the participants mark, which cheating method
SVMs and comprises methods such as Min-Max normal- they have used. Also, they have the option to mark that</p>
      <sec id="sec-3-1">
        <title>3.3. Dataset and Feature Selection</title>
        <p>Feature selection (FS) is an important step, especially for
k-NN, SVMs and neural network training. FS identifies
irrelevant and redundant features and reduces the
dimensionality of the data to enhance eficiency. Features
are generally categorized into relevant, irrelevant and
redundant. An exemplary selection algorithm generates
proposed feature subsets to find an optimal subset or
an evolutionary algorithm that assesses the quality of
the proposed feature subset by providing a ’measure of
goodness’ to the selection algorithm [15].</p>
        <p>The recording of eye movements is conducted using
the iMotions software. The original output dataset
includes several features, the selected features are shown
in Table 1. The selected features are chosen with a focus
on their significance for analysis. This selection is an
iterative process. As the analysis progresses, there may
be adjustments made to the set of features. The data
set from 20 participants is divided into each task, which
results in total in a data size of 100.
they have not cheated. The participants complete a test
of 20 minutes of five distinct question types, from 2.5 to
4 minutes per question. The time was limited per task
controlled by the software iMotions.
early in the process. Outliers, which represent unusual 4.1. Data Cleaning
eye movement data points, can significantly impact the
analysis if not properly addressed. By identifying and In the data cleaning process, several steps are followed to
addressing outliers early on, the preprocessing pipeline ensure the dataset’s integrity. The initial step
standardensures the integrity of the dataset and enhances the izes column names by removing any leading or trailing
accuracy of subsequent analyses. Following outlier de- whitespaces. The values of several columns with
detection, a low-pass filter is employed to reduce high- tails of technical specification are deleted or converted
frequency noise or rapid fluctuations in the eye-tracking to a numeric format, with non-numeric values being
codata. This step is essential for smoothing out erratic vari- erced into NaN (Not-a-Number) values. Rows containing
ations in the data, thereby improving its overall quality NaN values are eliminated from the dataset. Data is
filand coherence. Additionally, data smoothing is applied tered based on the    and  ℎ
to further reduce noise, particularly in cases where minor columns, retaining only the rows where data for both
lfuctuations in eye-tracking measurements may obscure eyes are valid (both columns have a value of 1).
Furunderlying patterns. Smoother data enhances the visibil- thermore, filters are applied to the  
ity of meaningful patterns and facilitates more accurate and  columns, preserving rows where
analysis. Towards the end of the preprocessing pipeline, these durations fall within predefined minimum and
maxnormalization is performed to scale the data to a com- imum thresholds. Any duplicate columns present in the
mon range. Normalization is crucial for ensuring that DataFrame are eliminated. [17].
all features are on a level playing field, facilitating easier
comparison between diferent features or datasets. This 4.2. Missing Values
step enhances the efectiveness of subsequent analyses
by standardizing the data and mitigating the impact of
varying scales.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.3. Outlier</title>
        <p>Outliers also impact dataset integrity. With an outlier
detection it is possible to identify and treat data points
that deviate significantly from the majority of
observations. These outliers can arise due to various factors,
including measurement errors, participant distractions,
or genuine deviations in gaze behavior. The methods for
identifying outliers are categorized into statistics-based,
distance-based, and density-based methods [18].</p>
        <p>For this data set, the method used for identifying
outliers is the "Z-score" of a data point, which measures the
number of standard deviations by which the data point
deviates from the mean. A high Z-score suggests that the
data point may be a potential outlier.  = −</p>
        <p>•  - Z-score.
•  - individual data point.
•  - mean of the data.
•  - standard deviation of the data.
acceleration by one student completing the task type 6 normal_cutoff = cutoff_freq / nyquist_freq</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.4. Low-pass Filter</title>
        <p>Eye-tracking data is afected by noise, which can impact
the accuracy of results. Filtering with a low-pass filter
provides a method to reduce noise and enhance fixation
stability. In the case of eye-tracking data, a low-pass
iflter attenuates high-frequency components while
preserving low-frequency components to identify fixations
and reduce noise. A commonly used low-pass filter is
the Butterworth filter, an IIR (infinite impulse response)
iflter used in signal processing [20].
7 # Create a low-pass filter
8 b, a = signal.butter(4, normal_cutoff, btype=’low’)</p>
      </sec>
      <sec id="sec-3-4">
        <title>4.5. Smoothing</title>
        <sec id="sec-3-4-1">
          <title>The smoothing is done with the Fourier Transformation</title>
          <p>to manipulate signals in the frequency domain. Fourier
Transformation decomposes the original signal into
sinusoidal components, each characterized by a specific
frequency . It is defined by the following formula for
continuous signals:
 () =
∫︁
 () · −  
•  () - frequency-domain representation.
•  () - time-domain signal.</p>
          <p>•  - angular frequency (2 times the frequency).
When applied to eye-tracking data, it converts temporal
gaze coordinates into the frequency domain by selecting
the cutof frequency. Unwanted frequency components
are identified by their frequency characteristics and
reduced in influence, preserving fixation-related
components. The filtered data is then transformed back to the
time domain for further analysis. This method is
chosen for eye-tracking data due to its ability to analyze
both periodic and non-periodic signals, making it
suitable for data containing a mix of fixations and noise. The
_ and _ columns are cleaned by removing
date entries and rows with NaN values. Fourier
transformation is applied to both columns to smooth the data,
reducing high-frequency noise [22], [21].
1 # Smoothing using Fourier transformation
2 x = df[’Gaze X’].to_numpy()
3 y = df[’Gaze Y’].to_numpy()
4 x_smoothed = ifft(fft(x))
5 y_smoothed = ifft(fft(y))
6 df[’Gaze X Smoothed’] = x_smoothed.real
7 df[’Gaze Y Smoothed’] = y_smoothed.real</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <sec id="sec-4-1">
        <title>In this paper, a data preparation process is designed and</title>
        <p>described using Python in preparation for the upcoming
analysis of eye tracking data. The process consists of
several steps. Subsequently, a feature selection is
performed to identify the most relevant ones for analysis.
The data cleaning phase comprises data preparation to
handle any missing values and to ensure data quality.
One of the characteristics of eye-tracking data is its
inherent noise and the presence of outliers. To address
these issues, an outlier detection and treatment step is
conducted, which helps in mitigating the efects of
extreme data points. Additionally, filtering and smoothing
techniques are applied, such as lowpass filtering and data
smoothing, to enhance the interpretability of the gaze
data. As eye-tracking data often contains high-frequency
lfuctuations, efective filtering helps extract meaningful
insights. Normalization is an essential step to ensure that
the data is on a consistent scale for comparisons. The
resulting clean and processed dataset are then ready for
the upcoming in-depth analysis.</p>
        <sec id="sec-4-1-1">
          <title>5.1. Future Research</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>The next phase of the research is to evaluate and optimize</title>
        <p>4.6. Normalization the preprocessing steps applied to the eye-tracking data.
Disparities between feature values need careful treatment That means, to experiment with diferent sequences of
to ensure that all attribute values become appropriate. these steps to ensure that the resulting cleansed data is
Data normalization is crucial for various classifiers, in- primed for analysis. This iterative process will require
cluding neural networks, SVMs, k-NN algorithms, and further investigation and refinement, with the goal of
fuzzy classifiers. The primary normalization methods achieving optimal data quality. In this case, it is also
for addressing this issue are Min–Max normalization or aimed to generalize the results of the ordering of the
feature scaling in [0, 1] or [a, b], Z-score normalization steps of a data preprocessing pipeline also suitable for
or standardization or Unit length scaling. By normal- other eye-tracking data to make them ready for
analyizing eye-tracking data, diferent stimulus presentation sis. Te first observations reveal, that the filtering and
durations are accounted for, allowing gaze data to exist smoothing process need to be refined, so that no data
within a uniform coordinate system [4]. will be lost. Furthermore, the handling of missing data is</p>
        <p>Here, the normalization is employed by the MinMaxS- still a detailed process, so that diferent strategies will be
caler from scikit-learn. Min-Max scaling was chosen for applied to diferent features.
normalizing eye-tracking data due to its simplicity, in- Looking ahead to the future analyses, the choice of
terpretability, robustness to outliers, and preservation of analytical techniques and the research questions at hand
data distribution. The resulting normalized gaze coordi- will reveal about patterns and predictions of cheating
nates are represented as values between 0 and 1. For each behaviour. Subsequently, the focus will now shift towards
gaze coordinate (X or Y) in the dataset: X_normalized = the analysis phase, where K-Means and classifiers like
(X - X_min) / (X_max - X_min), where: random forest or SVM will be employed and compared.</p>
        <p>The study is limited by the current survey size, the data
• X - original gaze coordinate. set comprises sensor data from 20 participants with a split
• X_min - minimum value of coordinate. for each task, in total 100. In the next time, more data
• X_max - maximum value of coordinate. will be collected through the conduction of upcoming
• X_normalized - normalized gaze coordinate. experiments. Furthermore, the pipeline will be tested on
other data sets, so that an review on a generalization will
be possible.
1 # Normalize the selected numeric columns
2 scaler = MinMaxScaler()
3 data[numeric_columns] = scaler.fit_transform(data</p>
        <p>[numeric_columns])
plications, Prentice-Hall, Inc, USA, 1996.
[22] J. Makhoul, A fast cosine transform in one and
two dimensions, IEEE Transactions on
Acoustics, Speech, and Signal Processing 28 (1980) 27–34.</p>
        <p>URL: http://ieeexplore.ieee.org/document/1163351/.
doi:10.1109/TASSP.1980.1163351.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>