1. Introduction

Towards a system for real-time prevention of drowsiness-related accidents. IAES International Journal of Artificial Intelligence (IJ

10.1162/neco.1989.1.4.541

A Complex CNN&LSTM Neural Network Model for Determining a Person's Drowsiness State by Their Face⋆

natolii Nikolenko

Olena Arsirii

Svitlana Antoshchuk

Oksana Babilunha

0 0 Odesa Polytechnic National University , Shevchenko Ave. 1, 65044, Odesa, Ukraines

2024

86 11 2194 2208

The article analyzes the capabilities of modern Advanced Driver Assistance Systems (ADAS) from wellknown car manufacturers. It is shown that the main drawback of ADAS is the rather low accuracy of recognizing a person's drowsiness from frames and video sequences of facial images. In addition, the shortcomings of computer vision systems related to industrial safety are shown, when a tired worker is a source of increased risk in such areas as metallurgy, the chemical industry, energy, etc. On the other hand, the capabilities of neural network models of classical convolutional neural networks and their hybrids with SVM and Random Forest layers for processing additional sensory signals for binary classification of drowsiness are analyzed. The prospects for using recurrent neural networks RNN and LSTM for recognizing a person's drowsiness from video sequences of human images are shown. In the work, the architecture of a complex neural network model CNN&LSTM is proposed for determining the state of human drowsiness, which combines a multilayer convolutional neural network (CNN) for analyzing the context of images and a model with long short-term memory (LSTM) for analyzing sequential data on the state of human drowsiness over time. The combination of multilayer CNN and LSTM allows combining the advantages of both approaches. It is proposed to increase the accuracy of determining the state of human drowsiness at the LSTM input, in addition to the data obtained for each frame of the video sequence as a result of its processing by CNN, to additionally transmit the values of the EAR, MAR, PUC, MOE indicators calculated by MediaPipe Face Mesh. The complex CNN&LSTM model is implemented in the Python language using Keras/TensorFlow in the Google Colab environment. CNN&LSTM research using Drowsiness Detection Dataset version 1 and video sequences of EAR, MAR, PUC, MOE values showed an increase in the accuracy of determining drowsiness using the Accuracy and F1 metrics to 96-97%.

1. Introduction

Solving the problem of recognizing drowsiness from a human face image is relevant in computer vision systems related to transport safety, which depends, among other things, on timely determination of the degree of driver fatigue; with industrial safety, when a tired worker is a source of increased risk in such areas as metallurgy, chemical industry, energy, etc. Analysis of the state of the eyes may be the only available biomarker in patient condition monitoring systems in intensive care, as well as in user condition monitoring systems in smartphones, laptops and VR/AR headsets for automatic blocking or pausing in case of fatigue while viewing content.

It is known that the use of modern neural network models with deep learning (Deep Learning) opens up new opportunities for increasing the accuracy of detecting the state of the driver, operator, patient and other users of such computer vision systems. Therefore, the current task, which is solved within the framework of this work, is the analysis of the capabilities of recognizing human face images using the architecture of a classical convolutional neural network and its hybrid with additional layers of SVM and Random Forest for processing additional sensory signals for binary classification of the state of drowsiness. (S. Antoshchuk);

As well as the development of a complex neural network model that combines the advantages of convolutional networks (CNN) with long short-term memory (LSTM), which allows detecting the process of slow eye closure over several frames.

The creation of such a complex CNN&LSTM model will allow to increase the accuracy of determining the state of drowsiness from images and video sequences of a person's face by identifying long-term dependencies, which is important in preventing false positives of the classifier.

2. Literature overview

Recognition of human drowsiness is important in computer vision systems aimed at driver safety, because driver drowsiness is one of the main causes of road accidents worldwide. Eye monitoring systems using video analysis of a person's face allow them to recognize closed or half-closed eyes, determine the degree of fatigue, and generate alarms in real time. Such technologies are actively implemented in ADAS (Advanced Driver Assistance Systems) [1, 2, 3, 4, 5].

One of the well-known ADAS is the Audi Rest Recommendation System, which is an innovative feature introduced by Audi to improve the comfort and well-being of drivers and passengers during long journeys. The system uses various technologies, including sensors and artificial intelligence algorithms, to analyze driver behavior, external factors such as traffic and road conditions, and vehicle status data [1]. Unfortunately, the system overloads the driver with unnecessary information and the accuracy of driver status recognition is 92%.

BMW Active Driving Assistant with Attention Assistant analyses driving behaviour and, if necessary, advises the driver to take a break. Advice on taking a break is provided in the form of graphic symbols on the control display [2]. The system is sensitive to the angle of the driver's face in relation to the control panel and to adverse weather conditions, with an accuracy of 90% in detecting driver drowsiness.

The Bosch Driver Drowsiness Detection system [3] is a proactive approach to reducing the risks of drowsy driving using advanced sensors, algorithms and real-time analysis of driver behavior and intervention if necessary. The system has a disadvantage of false positives. The accuracy of the system is 91%.

Ford Driver Alert is a driver assistance feature designed to help prevent accidents caused by driver fatigue or inattention [4]. Ford Driver Alert is specifically designed to detect signs of driver fatigue or drowsiness. The system use They track various parameters such as steering, lane departure, vehicle speed and even time of day f attentiveness. The system looks for patterns in the dri that may indicate reduced alertness, such as uneven steering movements, delayed reactions or lane a condition well, focusing only on driving patterns. The system has an accuracy of 86%.

and eyelid movements to detect signs of fatigue. By analyzing patterns such as prolonged eye closure or frequent blinking, the system can detect signs of drowsiness or distraction [5]. Honda can use machine learning algorithms to analyze data from various sensors and cameras in real time. These algorithms can learn patterns of driver behavior and identify deviations that may indicate fatigue or driver with an overloaded interface and can block the car if there is no response to the warning. The system has an accuracy of 93%.

In computer vision systems aimed at industrial safety in metallurgical, chemical, and energy environments, worker state recognition modules operate using cameras in rooms or in helmets, analyzing eyes in real time. The CORE for Tech software product [6] detects states of lethargy, drowsiness, and fatigue in high-risk environments based on biometric indicators of heart rate variability (HRV), skin conductance, eye movements, and algorithms that allow early detection and intervention. The disadvantage is the use of inconvenient sensors that affect user comfort. The

An example of monitoring the user's state in mobile and consumer applications in smartphones, laptops and automatic blocking in case of fatigue is the use of the Vigo smart Bluetooth headset [7] . Vigo is equipped with sensors that track the user's eye and head movements in real time, recording factors such as blinking, duration of eye closure and head nodding. The disadvantage is the mandatory wearing of a headset. The system's recognition accuracy is 91%.

Analysis of the above computer vision systems related to security shows that the accuracy of recognizing the user's drowsiness depends on the quality of neural network models and machine and deep learning algorithms, which are the main components of such system. Convolutional networks in their modern form appeared in the works of Jan LeCun's group at the end of the 1980s [8, 9, 10], and since then they are quite successfully used for image recognition and many other tasks [11, 12]. An overview of methods and neural networks of deep learning is given in [13].

Combining CNN with other types of neural networks can improve recognition results. In particular, recurrent neural network (RNN) and long short-term memory network (LSTM) are used to solve the task of determining the state of drowsiness behind her face. The biggest advantage of CNN is the extraction of features of the eye image, while RNN can effectively obtain information about the timing of the blinking process [14]. This ability allows RNN to take into account the previous state of the eye in the detection of blinking to better determine whether blinking is currently occurring. And because the act of blinking is sequential, RNN can effectively extract temporal features such as the duration and frequency of blinks. This helps to more accurately distinguish between blinking and other similar actions [15, 16]. However, RNN and LSTM are sequential. Each time step must be calculated sequentially, which leads to low computational efficiency. In particular, this will lead to some delay when working with longer sequences or real-time applications [17].

3. Research aim statement

The aim of the work is to increase the accuracy of determining the state of drowsiness from images and video sequences of a human face with the additional use of EAR, MAR, PUC, MOE indicators calculated using MediaPipe Face Mesh by developing a complex neural network model based on CNN and LSTM.

The development of this complex neural network model consists of the following stages: • • •

Creating a 3D model of a human face based on "landmarks" using the Mediapipe library to obtain control points for further calculations of EAR, MAR, PUC, MOE.

Development of combining drowsiness recognition from image and video sequences.

Conducting an experimental study on the accuracy of binary classification of drowsiness from human facial images with different neural network models based on a classical convolutional neural network.Methodology for developing a CNN&LSTM complex neural network model and studying its accuracy in recognizing drowsiness ront matter

3.1. Developing a 3D model of a human face based on "landmarks" using the MediaPipe library information

The Drowsiness Detection Dataset [18], which is based on the MRL [19] and Closed Eyes in Wild (CEW) [20, 21] datasets was selected as the input data for the study. This large-scale dataset, containing images of both closed and open human eyes, can be used for eye detection on faces, and in addition, for drowsiness detection (Figure 1). The datasets contain images of faces with closed and open eyes without glasses.

The images for the dataset were acquired at different lighting, distance, resolution, face angle, and eye angle parameters. There are different versions of the dataset [19]. Version 1 contains 10,000 images, divided into 5,000 images for closed and open eyes. The degree of eye openness is determined using the numerical measure of eye aspect ratio (EAR). This measure is key for many computer vision applications, such as detecting blinks, drowsiness, or assessing human attention. The eye is represented by a hexagon (Fig. 2) with values P1 P6, with points P1 and P4 defining the extreme left and right corners of the eye, P2 and P3 defining the two upper corners, and P6 and P5 defining the two lower corners of the eye, respectively. The EAR measure is calculated as

= | 2−2 ∙|6|1+−| 34−| 5|. ( 1 )

If the EAR value is close to 0, then this is the case of a closed eye [23]. Fig. 2 shows the tracking of the EAR value (ordinate axis) over a certain time, represented by the abscissa axis. We see that when the eye is closed, the EAR value drops sharply, almost to zero.

But for the final determination of the state of drowsiness, it is necessary to distinguish it from ordinary blinking. To do this, the duration of eye closure for blinking is determined. This interval on drowsiness is often accompanied by other visible signs of fatigue, namely yawning (frequent and wide opening of the mouth), head tilts, facial micromovements, instability of posture, lack of focus of gaze.

Drowsiness detection systems use various metrics, algorithms, and technologies to analyze these features. For example, for efficient feature calculation, a 3D model of the face surface is used, created based on a regression approach [24, 25]. Technologically, such a model is created using the open computer vision and machine learning library from Google MediaPipe [26]. To detect drowsiness on a person's face, MediaPipe creates a mesh of 468 landmarks, called the Media Pipe Face Mesh. This comprehensive solution from MediaPipe allows you to accurately determine 468 3D coordinates (or landmarks) on the face surface in real time. These landmarks cover the contour of the face, eyebrows, eyes, nose, and mouth, providing a detailed representation of facial expressions and shape (Figure 3, a).

Determination of drowsiness using the created Media Pipe Face Mesh is based on the EAR, MAR, PUC, and MOE indicators, which indicate fatigue or drowsiness. Let's consider each of these indicators in more detail.

As mentioned, EAR ( 1 ) is a key indicator that measures the degree of eye openness. Its value ranges from high for an open eye to low for a closed eye (Fig. 3, b and c). In addition, the duration is the tracking of the number of blinks per minute. An extremely high or, conversely, an extremely low blink rate is an indicator of fatigue.

A significant indicator of sleepiness is the Mouth Aspect Ratio (MAR), which measures the degree of mouth openness and is calculated similarly to EAR, but for landmarks around the mouth (usually 12-20 landmarks are used). At the same time, a high MAR value over a certain period of time indicates yawning (see Fig. 3, b and c), which is one of the most obvious and direct physical signs of sleepiness.

Pupil Circularity (PUC) is a measure of the degree of roundness of the pupil or the overall shape of the pupil and iris. It can be calculated based on the distances from the center of the pupil to points along its contour using the Media Pipe Face Mesh landmarks. The shape of the pupil and its apparent size are change for a sleepy person. A decrease in the PUC (or a decrease in its apparent diameter/area) is an indicator that the eye is not fully open. It complements the EAR by allowing the detection of "heavy" eyelids or a decrease in the pupil aperture, which often occurs with fatigue. It should be noted that the calculation of PUC is more complex than EAR and MAR, as it requires more accurate detection of the pupil and is sensitive to lighting and image quality.

Mouth Over Eye Ratio (MOE) is a composite measure that combines MAR and EAR. It is calculated as from other facial expressions that only affect one of the measures, for example, smiling increases MAR but does not affect EAR. Calculating MOE helps account for the interaction between eye and mouth movements, making it a powerful aggregate measure for drowsiness classifiers.

It should be noted that practical applications of computer vision systems to determine human drowsiness usually use a combination of all the listed metrics and other behavioral features (e.g., head tracking, blink rate, gaze duration, etc.). These indicators are collected during a certain time window and fed as input to machine learning algorithms (e.g., SVM, Random Forest, convolutional and recurrent neural networks, and their combinations), which then classify the human state as normal or drowsy. Such a comprehensive approach significantly increases the accuracy and reliability of the system.

3.2. Development of the architecture of a complex neural network model

From the analysis of scientific sources and analogues, it was found that modern systems for determining the state of human drowsiness are implemented based on a neural network approach using the CNN model. But some of them use only the average values of features over a period of time, which leads to the loss of information about the dynamics of events, and others can analyze the state of drowsiness only at certain points in time without taking into account the context and sequence, which leads to the loss of important details about the human state.

In the work, a complex neural network model architecture is proposed for determining the human state, combining a multilayer convolutional neural network (CNN) for analyzing the context of images and a long short-term memory model (LSTM) for analyzing sequential data about the state of human drowsiness over time [17, 28, 29, 30]. The combination of multilayer CNN and LSTM allows combining the advantages of both approaches. It is proposed to increase the accuracy of determining the state of human drowsiness at the LSTM input, in addition to the data obtained for each frame of the video sequence as a result of its processing by CNN, to additionally transmit the values of the EAR, MAR, PUC, MOE indicators calculated by MediaPipe Face Mesh. Thus, when using this model, it is proposed to consider two streams of input data, namely, the stream of images of each frame of the video sequence (224×224×3, scalable RGB images) and the stream of EAR, MAR, PUC, MOE indicators calculated by MediaPipe Face Mesh. In the first stream, each frame of the video sequence is processed, while the EAR, MAR, PUC, MOE indicators are updated only every 10 frames, which allows you to match the duration of data updates with MediaPipe Face Mesh and eye closure during drowsiness.

The main idea of creating a complex neural network model CNN&LSTM is explained by the following algorithm: • • • • •

Step 1. Each of the N frames of the first image stream is processed by a CNN to extract spatial features.

Step 2. A flow of metrics is generated. In the preparatory stage, the EAR, MAR, PUC, MOE metrics are calculated for every 10th frame of the video sequence and repeated for the intermediate 9 frames. This creates a sequence of metrics of the same length - N, as the image sequence.

Step 3. For each frame, the vector of values obtained at the CNN output and the corresponding (calculated or repeated) EAR, MAR, PUC, MOE indicators are combined.

Step 4. The combined feature vector is fed to a single LSTM to detect temporal patterns. Step 5. Final fully connected layers are added to perform binary classification to obtain a conclusion about the person's sleepiness state. •

Step 5. Final fully connected layers are added to perform binary classification to obtain a conclusion about the person's sleepiness state.

The design of the complex CNN&LSTM neural network model was carried out using the Python programming language.

Input data for the model: • • flow 1 (input_images). Sequence of raw RGB images. (batch_size, sequence_length=64, 224, 224, 3); flow 2 (input_mp_features). Sequence of EAR, MAR, PUC, MOE metrics that have already been preprocessed (i.e., values are repeated for intermediate frames to match the length sequence_length = 64). (batch_size, sequence_length=64, 4).

CNN&LSTM architecture for a sequence of 64 frames: 1. architecture CNN(Image Feature Extractor) for flow 1: a. input image ( 224, 224, 3 ) (one frame at a time via TimeDistributed); b. convolution blocks: •Conv2D(32, ( 3,3 ), activation='relu'), Conv2D(32, ( 3,3 ), activation='relu'),

MaxPooling2D(( 2,2 )), Dropout(0.25); •Conv2D(64, ( 3,3 ), activation='relu'), Conv2D(64, ( 3,3 ), activation='relu'),

MaxPooling2D(( 2,2 )), Dropout(0.25); •Conv2D(128, ( 3,3 ), activation='relu'), Conv2D(128, ( 3,3 ), activation='relu'),

MaxPooling2D(( 2,2 )), Dropout(0.25); c. flatten layer Flatten(); d. dense layer Dense(128, activation='relu'); e. output 128-dimensional vector for each frame. 2. flow of indicators 2 (Feature Flow). As noted earlier, the indicators have already been calculated at the preparatory stage. The four indicators for frames the values calculated for frames 0, 10, 20, respectively. A vector ( 4, ) (EAR, MAR, PUC, MOE) is formed for each frame in the sequence; 3. frame-level feature merging (Concatenation per Frame). The output of the 128dimensional vector from the CNN flow is concatenated with the 4-dimensional vector of metrics (EAR, MAR, PUC, MOE) for each corresponding frame. At the output of the CNN, we obtain the combined vector (132,) for each frame; 4. temporal analysis using LSTM (Temporal Feature Learner): a. input a sequence of 64 concatenated feature vectors (sequence_length=64, features_per_frame=132); b. LSTM layer LSTM(128, return_sequences=False); c. dropout layer Dropout(0.5); 5. final classification (Classification Head): a. dense layer Dense(64, activation='relu'). b. dropout layer Dropout(0.25).

c. output layer Dense(1, activation='sigmoid')(for binary classification.).

The output layer receives the LSTM outputs and performs binary classific A complex neural network model is trained using human face data, which are labeled in the 20, 21]. The use of the ReLU activation function helps to prevent the problem of gradient decay and speeds up the training. The Dropout method is used to prevent overtraining of the model and increase its generalization ability.

To optimize the model, backpropagation error methods with the Adam optimization algorithm, training step 0.1 are used.

After training the model, its performance is evaluated on the test dataset [20, 21] to determine the classification accuracy and the F1-Score metric [13].

3.3. Conducting an experimental study on the accuracy of binary classification of drowsiness based on human faces

In order to experimentally prove the correctness of the creation and feasibility of using the proposed complex CNN&LSTM neural network model, it is necessary to conduct experiments on the same data set and compare it with the results of other models according to accuracy criteria (accuracy, F1Score).

The following systems were considered as analogues: 1. system with a single-layer CNN. The system is implemented as follows: a. Conv2D(32, ( 3, 3 ), activation='relu', input_shape=( 64, 64, 1 )): •Conv2D this is a convolutional layer (2D convolution); • the number of filters (neurons) in this layer is 32, each of which extracts its own features from the image; •size of each filter 3x3 pixels; • ReLU (Rectified Linear Unit) activation function activation='relu', which adds nonlinearity to the model, helping it learn more complex patterns; • input image shape input_shape=( 64, 64, 1 ). The input data is a 64×64 pixel image with one color channel (e.g., grayscale); b. MaxPooling2D(( 2, 2 )): • a max pooling sampling layer MaxPooling2D, which reduces the image size by selecting the maximum value from each 2×2 pixel window.

• sampling window size ( 2, 2 ); c. Dense (128, activation='relu'): • Dense a fully connected (or dense) layer in which each neuron is connected to all neurons in the previous layer. •number of neurons in this layer 128.

•activation function ReLU activation='relu'. 2. a hybrid system of single-layer CNN, SVM and Random Forest.The difference of the hybrid architecture is the combination of deep learning of the convolutional neural network for feature extraction with traditional machine learning methods of the support vector machine and the random forest method to improve the classification results. That is, the following layers are added to the CNN layer described in the previous model: a. SVM training: • svm_model = SVC(kernel='linear'); • svm_model.fit(sensor_data_scaled, labels); b. Random Forest training: • rf_model = RandomForestClassifier(n_estimators=100); • rf_model.fit(sensor_data, labels). 3. system using a single-layer CNN and LSTM model.

Models using a single-layer CNN with LSTM typically build on the CNN structure found in the previous sections and add an LSTM layer followed by concatenation.: a. lstm_out = LSTM(50): • LSTM(50).This is an LSTM layer with 50 neurons. LSTM is used to process sequential data, storing information about long-term dependencies in the data; b. combined = concatenate([cnn_out, lstm_out]): • concatenate: This is a pooling function that combines the outputs from two different types of layers (in this case, CNN and LSTM); • [cnn_out, lstm_out]. This is the list of outputs that we want to merge. cnn_out and lstm_out represent the outputs of the respective previous layers; c. combined = Dense(50, activation='relu') (combined): • Dense(50, activation='relu'). This is a fully connected layer with 50 neurons and a

ReLU activation function; • Combined. This layer uses the result of merging the outputs from the previous step as its inp; d. output = Dense(2, activation='softmax') (combined): • Dense(2, activation='softmax'. This is the last fully connected layer that produces the final predictions. It has 2 neurons (probabilities for the two classes) and uses a softmax activation function to convert the outputs into probabilities; • Combined. This layer uses the output of the previous fully connected layer as input. 4. complex multilayer CNN&LSTM model. The architecture of the system based on this model is discussed above in section 4.2.

The results of determining the driver's drowsiness state by systems based on the considered neural network models on a dataset of 10 thousand frames [20, 21] are given in Table 1. Single-layer CNN Hybrid system of single-layer CNN, SVM and Random Forest System using a single-layer CNN and LSTM model.

Complex multilayer CNN&LSTM model

From the data analysis, it can be concluded that the results improved when using the complex CNN&LSTM model, relative to the best analogue, both in accuracy - by 3%, and in F1-Score - by 1.5%. This is an experimental confirmation of the feasibility of the developed model and its performance and competitiveness.

When implementing a neural network model, an additional increase in the accuracy of its operation can be ensured by the correct selection and adjustment of the network hyperparameters.

During the computer experiment, the following hyperparameters were set for training LSTM: • • • • • • •

Batch Size: 32.

Number of Epochs: 10.

Learning Rate: 0.1 (the step at which model parameters are updated during optimization). Number of LSTM Layers: 2.

Number of Hidden Units in LSTM: 64.

Optimizer: Adam.

Dropout: 0.25.

When calculating the resulting F1-Score value (see Table 1), the error values were taken into account, namely, False Positive and False Negative: • • the first type of error, False Positive, tells us that a person is considered asleep when they are actually awake; a type II error, False Negative, indicates that a person's state is considered normal (active) when they are already falling asleep.

Since the potential risk of type I errors is incomparably smaller than the potential risk of type II errors, it is necessary to minimize type II errors by minimizing the loss of accuracy in the process of Learning rate, with which the model parameters are updated during optimization.

We will sequentially change the value of the Learning rate parameter from the standard 0.1 to 1, 0.01 and 0.001 and evaluate the Accuracy and F1-Score. The results of changing the Learning rate value, evaluating type I and type II errors, and calculating accuracy metrics are given in Table 2.

At Learning rate = 0.001, we see a change in both metrics: Accuracy 0.952, F1-score 0.973. The values of these indicators have slightly decreased compared to the standard setting of the Learning rate indicator, however, these indicators are still higher than any proposed analogue, and fulfill the main goal, namely, minimizing type II errors.

4. Conclusions

Thus, the goal of the work on increasing the accuracy of determining the state of drowsiness from images and video sequences of a human face with the additional use of EAR, MAR, PUC, MOE indicators calculated by MediaPipe Face Mesh by developing a complex neural network model based on CNN and LSTM was achieved.

It is shown that recognizing the state of drowsiness from an image of a human face is relevant in computer vision systems related to transport safety, which depends, among other things, on the timely determination of the degree of driver fatigue; with industrial safety, when a tired worker is a source of increased risk in such areas as metallurgy, chemical industry, energy, etc. The analysis of such systems showed that their main disadvantages are the rather low accuracy of recognizing the state of drowsiness from frames and video sequences of human facial images, as well as the complexity and inconvenience of additional sensors and overload with unnecessary information.

Analysis of modern computer vision systems related to security has shown that the accuracy of recognizing the user's state of drowsiness depends on the quality of neural network models and machine and deep learning algorithms (Deep Learning), which are the main components of such systems. Combining CNN with other types of neural networks allows you to improve the recognition results. In particular, to solve the problems of determining the state of drowsiness of a person by his face, recurrent neural network (RNN) and long short-term memory network (LSTM) are used. Combining the advantages of convolutional neural networks for recognizing closed eyes in a single frame of a face image with long short-term memory (LSTM), allows us to detect the process of slow eye closure over several frames of a video sequence and increase the accuracy of solving the problem of binary classification of sleepiness.

The calculation of features for the video sequence is implemented using a 3D model of the face surface, created based on the regression approach. For its implementation, the open library of computer vision and machine learning from Google MediaPipe was used. Calculations of EAR, MAR, PUC and MOE indicators are performed using the created Media Pipe Face Mesh.

To determine the human state, architecture of a complex neural network model was created, which combines a multilayer convolutional neural network (CNN) for analyzing the context of images and a model with long short-term memory (LSTM) for analyzing sequential data on the state of human sleepiness over time. It is proposed to increase the accuracy of determining the state of human drowsiness at the LSTM input, in addition to the data obtained for each frame of the video sequence as a result of its processing by CNN, to additionally transmit the values of the calculated indicators EAR, MAR, PUC, MOE. Thus, when using the proposed model, two streams of input data are considered, namely, the stream of images of each frame of the video sequence (224x224x3, scalable RGB images) and the stream of indicators EAR, MAR, PUC, MOE calculated by MediaPipe Face Mesh. In the first stream, each frame of the video sequence is processed, while the indicators EAR, MAR, PUC, MOE are updated only every 10 frames.

The complex CNN&LSTM model is implemented in the Python language using Keras/TensorFlow in the Google Colab environment.

To study the advantages of the developed CNN&LSTM using the Drowsiness Detection Dataset, a comparative experimental analysis of the accuracy of drowsiness recognition of a system with a single-layer CNN, a hybrid system of a single-layer CNN with SVM and Random Forest, as well as a complex CNN&LSTM model was conducted. It was shown that the results of using CNN&LSTM improved in accuracy by 3% and in F1-Score by 1.5%.

It was shown that when implementing the CNN&LSTM neural network model, an additional increase in the accuracy of its operation can be ensured by the correct selection and adjustment of the network hyperparameters. An experiment was conducted with a change in the learning rate, i.e. the value of the Learning rate step. The possibility of significantly reducing the value of the second type of error in drowsiness detection systems due to the rational selection of the Learning rate value was experimentally proven.

Declaration on Generative AI

The authors have not employed any Generative AI tools.

[1]

Audi

: Assistance Systems , 2023 . URL: https://www.audi-mediacenter.com/en/assistancesystems-237.

[2] BMW: Driver assistance . 2024 . URL: https://www.press.bmwgroup.com/global/article/search/driver%20drowsiness/topic:5243.

[3] Driver- drowsiness-detection, 2024 . URL: https://www.boschmobility.com/en/solutions/assistance-systems/driver-drowsiness-detection/.

[4]

Driving . 2023 . URL: https://www.ukcar -discount.co.uk/news/fords-driver-alert-monitoring-system.

[5]

ACCESS

YOUR INFO Find Your Honda , 2025 . URL: https://mygarage.honda.com/s/find-honda.

[6] Core for Tech. Develops software technology that analyzes heart rate variability and provides early signs of drowsiness to drivers and vehicles, 2025 . URL: -d217 - 4ae9 - 8aa8- 9f4805a7c696.

[7]

Catherine

Shu , Bluetooth Headset Vigo Knows When You Are Tired Before You Do , 2014 . URL: https://techcrunch.com/ 2014 /01/17/vigo/.