Towards Infusing Auxiliary Knowledge for Distracted Driver
                                Detection
                                Ishwar B Balappanawar1 , Ashmit Chamoli1 , Ruwan Wickramarachchi2 , Aditya Mishra1 ,
                                Ponnurangam Kumaraguru1 and Amit Sheth2
                                1
                                    International Institute of Information Technology, Hyderabad
                                2
                                    AI Institute, University of South Carolina, Columbia, SC


                                                Abstract
                                                Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting
                                                and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds
                                                to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver
                                                behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted
                                                driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural
                                                configuration of the driver’s pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver’s
                                                pose information with the visual cues in video frames to create a holistic representation of the driver’s actions. Our results
                                                indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary
                                                knowledge with visual information. The source code for KiD3 is available at: https://github.com/ishwarbb/KiD3.

                                                Keywords
                                                Knowledge Infusion, Distracted Driving, Scene Graphs, Pose Estimation, Object Detection, Classification


                                1. Introduction                                                                  ing and computer vision techniques, including, but not
                                                                                                                 limited to, object detection, pose estimation, and action
                                Distracted driving is a leading cause of road accidents                          recognition. On the other hand, recent advancements in
                                globally, posing significant challenges to road safety. Ac-                      knowledge infusion [1] and Neurosymbolic AI [2] pro-
                                cording to the National Highway Traffic Safety Adminis-                          vide new opportunities for challenging tasks in scene
                                tration (NHTSA)1 approximately 3,308 people lost their                           understanding [3, 4, 5] and context understanding [6].
                                lives in the United States in 2022 due to distracted driving,                    Hence, we posit that there is valuable auxiliary knowl-
                                and nearly 290,000 people were injured. Almost 20% of                            edge that can be either computed/ derived from the visual
                                those killed in distracted driving-related crashes were                          inputs. Specifically, we hypothesize that by infusing such
                                pedestrians, cyclists, and others outside the vehicle. In                        knowledge with current computer vision models would
                                addition to the loss of lives and injuries, the financial bur-                   improve the overall detection capabilities and robustness
                                den from distracted driving crashes collectively amounts                         while not requiring the heavy computation demands of
                                to $98 billion in 2019 alone, highlighting the urgency of                        ultra-high parameter models.
                                developing effective detection methods.
                                                                                                                     To this end, we propose KiD3, a novel, simplistic
                                   The task of identifying distracted driving involves re-                        method for distracted driver detection that infuses aux-
                                liably detecting and classifying various forms of driver                          iliary knowledge about inherent semantic relations be-
                                distraction, such as texting, eating, or using other ob-                          tween entities in a scene and the structural configuration
                                jects/devices from in-vehicle camera feeds. This task is                          of the driver’s pose. Specifically, we construct a unified
                                challenging due to the need for robust models that can                            framework that integrates scene graphs and the driver’s
                                generalize to a diverse set of driver behaviors without                           pose information with visual information to enhance
                                requiring extensive annotated datasets. Traditionally, the                        the model’s understanding of distraction behaviors (see
                                DDD task has been solved using various end-to-end learn-                          Figure 1).
                                                                                                                                           Conducting experiments on a real-world, open dataset,
                                KiL’24: Workshop on Knowledge-infused Learning co-located with                                          our results indicate that incorporating such auxiliary
                                30th ACM KDD Conference, August 26, 2024, Barcelona, Spain
                                $ ishwar.balappanawar@students.iiit.ac.in (I. B. Balappanawar);
                                                                                                                                        knowledge with visual information significantly im-
                                ashmit.chamoli@students.iiit.ac.in (A. Chamoli);                                                        proves detection accuracy. KiD3 achieves a 13.64% accu-
                                ruwan@email.sc.edu (R. Wickramarachchi);                                                                racy improvement over the vision-only baseline, demon-
                                aditya.mishra@students.iiit.ac.in (A. Mishra); pk.guru@iiit.ac.in                                       strating the effectiveness of integrating semantic and
                                (P. Kumaraguru); amit@sc.edu (A. Sheth)                                                                 pose information in DDD tasks. This improvement high-
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

                                1
                                           Attribution 4.0 International (CC BY 4.0).
                                                                                                                                        lights the potential of our method to contribute to safer
                                  https://www.nhtsa.gov/speeches-presentations/distracted-
                                  driving-event-put-phone-away-or-pay-campaign                                                          driving environments by providing a more reliable, effi-


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
        Sampled Frame                  Pose Estimation               Object Information          Scene Graph Information


Figure 1: This figure illustrates the process of extracting detailed information from a scene to analyze driver behavior. The
extreme left panel shows an image of a driver which is sampled from the video. The middle left panel presents the corresponding
estimated pose, highlighting how structured representations can be derived from raw image data. The middle right panel
presents the object information obtained via object detection.The extreme right panel provides an sample relation from the
scene graph, capturing the relationships between different objects and actions.


cient and scalable solution that does not demand the use        action recognition model on each view and taking the
of expensive high-parameter models.                             average probability over all the views as the final output.
Contributions of this paper are as follows:                     The outputs are then post-processed for predicting the
                                                                action label and temporal localization of the predicted
    1. A novel, simple method for distracted driver de-         action. This work utilizes the X3D family of networks
       tection that incorporates the auxiliary knowl-           [9] for video classification instead of relying on manual
       edge computed/estimated with vision inputs with-         feature engineering. Wei Zhou et al. [10] improve upon
       out the need for high-parameter, computational           this work by fine-tuning large pre-trained models instead
       heavy models.                                            of training from scratch and by empirically selecting spe-
    2. A demonstration of the effectiveness of infus-           cific camera views for specific distracted action classes.
       ing different types of auxiliary knowledge over
       vision-only baselines using real-world distracted
       driving data.                                           Previous works mainly focus on the use of so-
                                                            phisticated post-processing algorithms, use of larger
                                                            encoder-decoder architectures and multi-view syn-
                                                            chronization to improve action recognition and TAL
2. Related Work                                             performance. In contrast, our work aims to improve
                                                            classification performance by incorporating auxiliary
Distracted Driver Detection is generally formulated as knowledge (e.g., semantic entities/relationships of a frame,
one of 2 tasks: Action Recognition/Classification and pose information) that can be derived and infused as
Temporal Action Localization (TAL). Action recogni- graphs into the encoder side of our architecture. Next,
tion is a computer vision task that involves classifying we will explore the state-of-the-art methods for scene
a given image or a video into a set of pre-defined set graph generation.
of actions or classes. TAL, on the other hand detects
activities being performed in a video streams and outputs Scene Graph Generation (SGG) refers to the task of au-
start and end timestamps. In this paper, we focus on tomatically mapping an image or a video into a semantic
solving the action recognition task by classifying frames structural scene graph, which requires the correct label-
into various distracted driver activities. Here, we explore ing of detected objects and their relationships [11]. Yuren
related work considering two directions: (1) methods Cong et al. [12] pose SGG as a set prediction problem.
for distracted driver identification and (2) methods for They propose an end-to-end SGG model, RelTR, with
generating/encoding semantic graphs from visual scenes. an encoder-decoder architecture. In contrast to most ex-
                                                                 isting scene graph generation methods, such as Neural
Existing Methods for DDD: Vats et al.[7] proposes Key            Motif, VCTree, and Graph R-CNN, [13, 14, 15] which
Point-Based Driver Activity Recognition that extracts            RelTR used as benchmarks, RelTR is a one-stage method
static and movement-based features from driver pose and          that predicts sparse scene graphs directly only using vi-
facial features and trains a frame classification model for      sual appearance without combining entities and labeling
action recognition. Then, a merge procedure is used to           all possible predicates. Due to its simplicity, efficiency
identify robust activity segments while ignoring outlier         and SOTA performance, we selected RelTR to generate
frame activity predictions.                                      SGGs for our experiments.
  In their work, Tran et al. [8] utilize multi-view syn- Additionally, inspired by the work of Pen Ping et al.
chronization across videos by training an ensemble 3D [16] we incorporate atomic action information extracted
                                                                Table 1
                                                                The list of distracted driving activities in the SynDD1 dataset.
                                                                        Sr. no.       Distracted driver behavior
                                                                           1            Normal forward driving
                                                                           2                    Drinking
                                                                           3               Phone call (right)
                                                                           4                Phone call (left)
                                                                           5                     Eating
                                                                           6                 Texting (right)
                                                                           7                  Texting (left)
                                                                           8                Hair / makeup
Figure 2: Camera mounting setup for the three views in the                 9               Reaching behind
SynDD1 dataset: 1. Dashboard, 2. Behind rear view mirror,
                                                                          10            Adjusting control panel
and 3. Top right side window.
                                                                          11         Picking up from floor (driver)
                                                                          12       Picking up from floor (passenger)
                                                                          13       Talking to passenger at the right
from the objects detected in the scene and the estimated                  14       Talking to passenger at backseat
pose of the driver.                                                       15                    Yawning
                                                                          16                 Hand on head
                                                                          17              Singing with music
3. Methodology                                                            18        Shaking or dancing with music

In this section, we formally define the DDD problem,
the datasets used, preprocessing steps, and delve deep          people in the background. SynDDv1 consists of 30 video
into the technical details of each sub-component in the         clips in the training set and 30 videos in the test set. The
proposed approach (see Figure 3).                               dataset consists of images collected using three in-vehicle
                                                                cameras positioned at locations: on the dashboard, near
3.1. Problem Statement                                          the rear-view mirror, and on the top right-side window
                                                                corner, as shown in Table 1 and Figure 1. The video
Given a video frame x ∈ R𝑚×𝑛×3 sampled from a video             sequences are sampled at 30 frames per second at a reso-
where 𝑚 denotes the height of the frame, 𝑛 denotes              lution of 1920×1080 and are manually synchronized for
the width of the frame, and 3 corresponds to the color          the three camera views. Each video is approximately
channels (RGB), the learning objective is to classify it into   10 minutes long and contains all 18 distracted activities
one of 18 predefined activities 𝒞 = {𝐶1 , 𝐶2 , . . . , 𝐶18 }.   shown in Table 2. The driver executed these activities
                                                                with or without an appearance block, such as a hat or sun-
   We define a classifier model 𝑓 : R𝑚×𝑛×3 → [0, 1]18
                                                                glasses, in random order for a random duration. There
that maps a video frame to a probability distribution
                                                                are six videos for each driver: three videos with an ap-
over the 18 activities. Specifically, 𝑓 (x) = p, where
                                                                pearance block and three videos without any appearance
p = [𝑝1 , 𝑝2 , . . . , 𝑝18 ] and 𝑝𝑖 represents the ∑︀
                                                   probability
                                                                block.
that the frame x belongs to class 𝐶𝑖 , such that 18   𝑖=1 𝑝𝑖 =
1 and 0 ≤ 𝑝𝑖 ≤ 1 ∀𝑖 ∈ {1, . . . , 18}. The predicted
      ˆ for the frame x can therefore be determined by: 3.3. Data Preprocessing
class 𝐶
𝐶ˆ = arg max𝐶 ∈𝒞 𝑝𝑖 .
                   𝑖                                           From the dataset, we selected the Dashboard variant, re-
                                                               sulting in 10 videos for training and 10 videos for testing.
3.2. Datasets for DDD                                          Sets of (frame, label) were created by sampling frames
                                                               from the videos at regular intervals and obtaining the
The real-world datasets for distracted driver identifica-
                                                               corresponding labels from the annotations. The publicly
tion typically include annotated video sequences from
                                                               available dataset contains various inconsistencies in the
cameras mounted inside the vehicle. While several open
                                                          2    annotation format provided as CSV files. These inconsis-
datasets are available, such as StateFarmDataset , we
                                                               tencies, such as different naming conventions, variations
have selected SynDDv1 [17] to be used for experiments
                                                               in capitalization, and extra spaces in names, have been
due to the higher number of distracted behavior classes
                                                               resolved to ensure consistency across all data splits.
and the diversity, including variations in lighting con-
ditions, driver appearances, and the use of objects and           Next, we will outline the technical details for each
2
                                                               sub-component    in our approach, shown in Figure 3.
https://www.kaggle.com/competitions/state-farm-distracted-
driver-detection
                                                        Image                    Image Embeddings
                                                       Encoder


                                                     Scene Graph         Scene       GCN Graph        Graph      Linear
                                                                                                     Encoding                Label
                                                      Generator          Graph        Encoder                   Classifier


                                                         Pose                     Pose Information
                                                       Estimator


Figure 3: Workflow of our proposed method. The figure illustrates the integration of an Image Encoder, Scene Graph Generator,
GCN Graph Encoder, and Pose Estimators within our pipeline.


3.4. Image Encoding                                          image embedding vector. The rationale for discarding
                                                             the last 2 layers is that the final layer reduces the dimen-
3.4.1. Background
                                                             sionality to only 18, which is insufficient for our needs.
To classify video frames into one of the predefined activ- Additionally, the earlier layers capture more general fea-
ities, the first step is to obtain robust image embeddings tures, which are beneficial for transfer learning. These
that would effectively capture the visual features in raw embeddings are then used for further processing and
pixel data into a more manageable and informative rep- classification tasks.
resentation. Possible methods for this transformation in-
clude using pre-trained Convolutional Neural Networks 3.5. Scene Graph Generation and
(CNNs) like VGGNet [18], ResNet [19], or Inception [20].
                                                                    Encoding
Out of these methods, we selected VGG16, a variant of
VGGNet, due to its simplicity and effectiveness in ex- 3.5.1. Background
tracting deep features from images. VGG16 has been
                                                             Scene graphs structurally represent the relationships be-
extensively used and validated in various image classifi-
                                                             tween various objects in a given image. Each node in the
cation tasks, making it a reliable choice for our purpose.
                                                             graph represents an object, while edges denote the rela-
                                                             tionships between these objects; for example consider the
3.4.2. Technical Details                                     triple: “« man holding phone »”. Scene graphs capture
VGGNet, particularly VGG16, is a deep convolutional the high-level contextual and semantic information of the
network known for its simple yet effective architecture, scene, going beyond pixel-level data. They are also essen-
consisting of 16 weight layers. The network is struc- tial for scene understanding and reasoning and allow us
tured with multiple convolutional layers followed by fully to explicitly inject knowledge into the pipeline. For exam-
connected layers. Each convolutional layer uses small ple, considering DDD task, a scene graph containing the
receptive fields (3x3) and applies multiple filters to ex- triple “« person drinking_from bottle »” might indicate
tract features at different levels of abstraction. The fully distracted driving activity. Modeling such important rela-
connected layers then process these features for classifi- tions can otherwise be achieved implicitly using methods
cation. VGG16’s design focuses on depth and simplicity, such as convolutional-network-based image encoders,
making it an ideal candidate for transfer learning.          with some uncertainty.

3.4.3. Pre-processing and Adaptation                               3.5.2. Technical Details

To adapt VGG16 for our task, we fine-tuned the model to            To generate the scene graph for a given frame, we use
obtain image embeddings. Specifically, we discarded the            the RelTr architecture [12]. Then, we use a Graph Convo-
last 2 classifier layers of the pre-trained VGG16 model and        lutional Network (GCN) [21] layer followed by a 𝑇 𝑎𝑛ℎ
retained the base model along with the first 4 classifier          activation to obtain representations for each node in the
layers. This configuration results in a 4096-dimensional           graph. We take the mean of all the node embeddings to
abstraction. The fully connected layers then process these features        3.6.2 Technical Details. We utilized OpenPose [1], a state-of-the-
for classi�cation. VGG16’s design focuses on depth and simplicity,         art 2D pose estimation model, to extract pose information. Open-
making it an ideal candidate for transfer learning.                        Pose can detect and output a set of key points corresponding to
                                                                           various body parts, such as the head, shoulders, elbows, and hands.
3.4.3 Pre-processing and Adaptation. To adapt VGG16 for our task,          These key points are represented as coordinates in a 2D space. The
we �ne-tuned the model to obtain image embeddings. Speci�cally,            process involves detecting the spatial locations of these joints and
we discarded the last 2 classi�er layers of the pre-trained VGG16          constructing a pose structure that re�ects the driver’s body con-
model and retained the base model along with the �rst 4 classi�er          �guration. Mathematically, each key point can be represented as:
layers. This con�guration results in a 4096-dimensional image em-          k8 = (G8 , ~8 ) where k8 denotes the 8-th key point with G8 and ~8
bedding vector. The rationale for discarding the last 2 layers is that     being its coordinates in the image frame.
the �nal layer
 obtain        reduces the representation
          a graph-level    dimensionality to only
                                              and18,  which
                                                   treat     is insuf-
                                                           this  vector    3.7.
                                                                           3.6.3 Unified      Pipeline
                                                                                 Pre-processing and Adaptation. To adapt the pose estima-
�cient
 as theforgraph
           our needs. Additionally, the earlier layers capture more
                 encoding.                                                 tion data for our task, we pre-processed the key point coordinates
general features, which are bene�cial for transfer learning. These         We  construct
                                                                           obtained          a simple machine-learning
                                                                                       from OpenPose.    The key points werepipeline
                                                                                                                                normalizedto com-
                                                                                                                                              and
embeddings are then used for further processing and classi�cation          bine  the latent
                                                                           structured           encodings
                                                                                        to consistently       of the
                                                                                                        represent     above pose.
                                                                                                                  the driver’s  modules. Each
tasks.
 3.5.3. Pre-processing and Adaptation                                         Additionally,
                                                                           module     takes we anderived
                                                                                                  imagefeatures   such
                                                                                                           as input  andas the distance between
                                                                                                                            processes    it into a
                                                                           the hands and eyes/face,  the angle formed by  thethen
                                                                                                                              eyes with the neck,
 A scene graph output from RelTr [12] is in the form of meaningful                          vector representation.      We          concatenate
3.5    Scene    Graph                                                      and the distance between the hands and objects like a phone or
 triplets  of the   formGeneration
                           (𝑛𝑜𝑑𝑒, relation, and𝑛𝑜𝑑𝑒).
                                                  Encoding  Essentially, these     representations using a feed-forward MLP to clas-
                                                                           bottle (if detected using YOLO [9]). These features were crucial for
3.5.1 Background. Scene graphs structurally represent the rela-
 we get a list of relations 𝑅𝑖 = (𝑛1 , r, 𝑛2 ) where 𝑛1 and sify                the
                                                                           enhancing  input   image.ability
                                                                                         the model’s   Algorithm    1 succinctly
                                                                                                            to accurately  interpretoutlines   the
                                                                                                                                     and classify
tionships between various objects in a given image. Each node in
 𝑛   are  nodes   and   r is  the   relation   between      them.    This  main   stepsactivities.
                                                                           the driver’s    of this pipeline.
the graph represents an object, while edges denote the relation-
   2
 format
ships     is converted
      between              tofora example
               these objects;      list of edges,
                                           considerwhere      edges
                                                     the triple: "« manare
 represented
holding          asScene
         phone »".   pairsgraphs
                            of nodes.
                                    captureThis   is provided
                                            the high-level         to the Algorithm 1 KiD3 Pipeline
                                                            contextual
and
 GCN semantic information
        encoder    to obtain of athe scene, goingrepresentation.
                                   graph-level     beyond pixel-level      Require: Training Dataset, a collection of images and labels.
data. They are also essential for scene understanding and reasoning          for 8<064, ;014; in Training Dataset do
and allow us to explicitly inject knowledge into the pipeline. For               E8BD0;⇢=2>38=6      <064⇢=2>34A (8<064)
example,
 3.6. Poseconsidering DDD task, a scene graph containing the triple
                 Estimation                                                     B6⇢=2>38=6       (24=4⌧A0?⌘">3D;4 (8<064)
"« person drinking_from bottle »" might indicate distracted driv-                ?>B4 40CDA4B      %>B4 =5 >A<0C8>=">3D;4 (8<064)
ing activity.
 3.6.1.       Modeling such important relations can otherwise be
         Background
achieved implicitly using methods such as convolutional-network-                 2>=20C4=0C43     [E8BD0;⇢=2>38=6; B6⇢=2>38=6; ?>B4 40CDA4B]
 Poseimage
based  estimation
              encoders,iswith
                          a critical  component in understand-
                               some uncertainty.                                 ;>68CB   (> 5 C<0G ("!% (2>=20C4=0C43))
ing the spatial configuration of a subject’s body, which                         ;>BB   ⇠A>BB⇢=CA>?~ (;>68CB, ;014;)
 in this
3.5.2     case isDetails.
       Technical           To generate
                   the driver.          the scene graph
                                  By capturing            for a given of
                                                    the positions
frame,  we useparts,
 key body      the RelTr
                       posearchitecture
                              estimation [2]. provides
                                              Then, we use   a Graphin-
                                                          valuable               ;>BB.BackPropagate()          ù Propagate errors to the linear
Convolutional Network (GCN) [5] layer followed by a ) 0=⌘ acti-              classi�er and GCNs
 formation about the driver’s posture and movements.
vation to obtain representations for each node in the graph. We              end for
 Thistheinformation
take                    isnode
          mean of all the  essential   for accurately
                                embeddings                 classifying
                                              to obtain a graph-level
 the driver’s activities.
representation and treat thisVarious
                               vector asmethods
                                         the graphcan    be employed
                                                    encoding.
for pose estimation, including 2D and 3D approaches.
We opted to use a state-of-the-art 2D pose estimation
technique to effectively capture the required spatial data. 3.7.1. Training
                                                          We first fine-tune the pre-trained image encoder on the
3.6.2. Technical Details                                  distracted driver classification task to obtain task-suitable
We utilized OpenPose [22], a state-of-the-art 2D pose embeddings. During training, we freeze the Image En-
estimation model, to extract pose information. OpenPose coding and Pose Information modules and only train the
can detect and output a set of key points corresponding linear classifier and the GCN graph encoder in the Scene
to various body parts, such as the head, shoulders, el- Graph Encoding module. We use 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 activation
bows, and hands. These key points are represented as in the final layer of the feed-forward MLP and use the
coordinates in a 2D space. The process involves detecting Cross-Entropy loss function.
the spatial locations of these joints and constructing a
pose structure that reflects the driver’s body configura-
tion. Mathematically, each key point can be represented
                                                          4. Experiments
as: k𝑖 = (𝑥𝑖 , 𝑦𝑖 ) where k𝑖 denotes the 𝑖-th key point We outline the following experimental setup to evaluate
with 𝑥𝑖 and 𝑦𝑖 being its coordinates in the image frame. the proposed approach’s overall performance and the
                                                                           contribution of each sub-component.
3.6.3. Pre-processing and Adaptation
To adapt the pose estimation data for our task, we pre-                    4.1. Method 1 - Vision Only
processed the key point coordinates obtained from Open-
                                                                           In the first experiment, we utilized existing computer vi-
Pose. The key points were normalized and structured to
                                                                           sion (CV) models to establish a baseline performance
consistently represent the driver’s pose.
                                                                           for the frame classification task. We fine-tuned the
  Additionally, we derived features such as the distance
                                                                           VGG-16 model to assess the performance of traditional
between the hands and eyes/face, the angle formed by
                                                                           CV models. To achieve this, we froze the weights of
the eyes with the neck, and the distance between the
                                                                           the entire model and unfroze only the classification
hands and objects like a phone or bottle (if detected using
                                                                           layers (model.classifier[1...6]). The sixth classification
YOLO [23]). These features were crucial for enhancing
                                                                           layer nn.Linear(4096, 1000) was replaced with
the model’s ability to accurately interpret and classify
                                                                           nn.Linear(4096, 18) to match the number of activ-
the driver’s activities.
                                                                           ity classes. The modified model was then fine-tuned on
Table 2
Performance of the three methods on the test set
                               Method                                Accuracy                  F1 Score
                              Vision Only                           79.64 ± 2.17%                0.81
                         Vision + Scene Graphs                89.1 ± 1.61% (↑ 11.88%)       0.89 (↑ 9.88%)
               Vision + Scene Graphs + Pose Information       90.5 ± 1.32% (↑ 13.64%)      0.91 (↑ 12.35%)


our classification task, allowing the classification layers    5. Results
to adapt to the specific features of our dataset.
                                                            Table 2 summarizes the results of our experiments on the
                                                            test set and the ablation studies across different method
4.2. Method 2 - Vision + Scene Graphs                       variations. We evaluate the performance using two met-
In the second experiment, we use the VGG-16 similar to rics: accuracy and the F1 score. The vision-only model
how it was used in Method 1; however, out of the last achieves 79.64 overall accuracy and 0.81 F1 score, respec-
six classifier layers, we discarded the last two layers and tively. With the inclusion of scene graphs, the accuracy
used the base model with the first four classifier layers and the F1 score increased by 11.88% and 9.88%, respec-
to obtain a 4096-dimensional image embedding vector. tively. Finally, the complete model incorporating both
The rationale is that the final layer could not be utilized scene graphs and pose information achieves the peak
because it reduces the image embedding to only 18 di- performance of 90.5% accuracy and 0.91 F1 score, respec-
mensions, which is insufficient for capturing the rich tively.
features needed for our task. Moreover, earlier layers in
the network capture more general features beneficial for
transfer learning. Then, we integrate image embeddings
with scene graphs encoded using a Graph Convolutional
Network (GCN) [21]. The embeddings derived from the
GCN are concatenated with the image embeddings ob-
tained from the VGG-16 model. Linear layers are used as
a head to combine these information streams, forming a
unified representation. This combined model was trained
on the same classification objective, leveraging both the
visual and relational features present in the data.

4.3. Method 3 - Vision + Scene Graphs +
     Pose Information
In the final experiment, we further enrich the scene rep-
resentation by incorporating pose information, enhanc-
ing its ability to understand the driver’s activities. The
pose details included the location of objects via bound-
ing boxes and the outline of the human skeleton with
coordinates of key points such as the eyes, nose, and
fists. We engineered additional features based on exter-
nal knowledge, including the distance between the hand
and face and the distance between the hand and a phone
or bottle (if detected using YOLO [23]). These engineered      Figure 4: F1 scores and support for individual activity (i.e.,
                                                               Class 1 - 18) prediction across three methods, with Method 2
features were added to the concatenation of image em-
                                                               (i.e., Vision + SGG) and Method 3 (i.e., Vision + SGG + Pose
beddings and scene graph embeddings. The model is              Info) showing improvements over Method 1 (i.e., Vision only).
then re-trained on the classification task with these addi-
tional features, providing a holistic understanding of the
driver’s activities.                                             We have observed (see Figure 4) that our methods
                                                               are particularly effective in identifying classes such as
                                                               Eating (class 5), Adjusting Control Panel (class 10), and
                                                               Singing with Music (class 17). We interpret this as evi-
dence that our approach successfully incorporates auxil-      References
iary knowledge, enhancing our model’s performance for
these classes.                                                 [1] A. Sheth, M. Gaur, U. Kursuncu, R. Wickrama-
                                                                   rachchi, Shades of knowledge-infused learning
                                                                   for enhancing deep learning, IEEE Internet Com-
6. Discussion                                                      puting 23 (2019) 54–63. doi:10.1109/MIC.2019.
                                                                   2960071.
Our results clearly support the initial hypothesis that        [2] A. Sheth, K. Roy, M. Gaur, Neurosymbolic artificial
the inclusion of valuable auxiliary knowledge with vi-             intelligence (why, what, and how), IEEE Intelligent
sual features would enhance the performance of the DDD             Systems 38 (2023) 56–62. doi:10.1109/MIS.2023.
task. The ablation study further establishes each auxiliary        3268724.
knowledge type’s role in the overall performance. Scene        [3] R. Wickramarachchi, C. Henson, A. Sheth,
graphs provided the most significant auxiliary knowl-              Knowledge-infused Learning for Entity Prediction
edge, highlighting the importance of explicitly encoding           in Driving Scenes, Frontiers in Big Data 4 (2021)
semantic information and infusing it with visual features.         759110. doi:10.3389/fdata.2021.759110.
By incorporating pose information of driver actions, we        [4] R. Wickramarachchi, C. Henson, A. Sheth,
were able to further enrich overall accuracy and robust-           Knowledge-based entity prediction for improved
ness. However, several limitations to our approach war-            machine perception in autonomous systems,
rant further investigation.                                        IEEE Intelligent Systems (2022). doi:10.1109/MIS.
                                                                   2022.3181015.
6.1. Limitations                                               [5] R. Wickramarachchi, C. Henson, A. Sheth, Clue-ad:
                                                                   A context-based method for labeling unobserved
One limitation is the reliance on annotated data for train-        entities in autonomous driving data, Proceedings of
ing. While we used a combination of supervised and un-             the AAAI Conference on Artificial Intelligence 37
supervised learning techniques to mitigate this issue, the         (2023) 16491–16493. URL: https://ojs.aaai.org/index.
availability of annotated data remains a key constraint.           php/AAAI/article/view/27089. doi:10.1609/aaai.
Additionally, our method may struggle with complex and             v37i13.27089.
highly variable driving scenarios where the relationships      [6] A. Oltramari, J. Francis, C. Henson, K. Ma, R. Wick-
between objects and actions are less clear. Finally, we            ramarachchi, Neuro-symbolic architectures for con-
have not considered using foundation models like Vi-               text understanding, in: Knowledge Graphs for eX-
sion Language Models (VLMs) for our experiments. Our               plainable Artificial Intelligence: Foundations, Ap-
main focus in this work is to evaluate the impact of aux-          plications and Challenges, IOS Press, 2020, pp. 143–
iliary knowledge on the DDD task without the need for              160.
complex, high-parameter models.                                [7] A. Vats, D. C. Anastasiu, Key point-based driver
                                                                   activity recognition, in: 2022 IEEE/CVF Confer-
7. Conclusions and Future Work                                     ence on Computer Vision and Pattern Recognition
                                                                   Workshops (CVPRW), 2022.
In this paper, we proposed a novel, simple approach to         [8] M. T. Tran, M. Quan Vu, N. D. Hoang, K.-H.
distracted driver detection by infusing two types of aux-          Nam Bui, An effective temporal localization
iliary knowledge with visual information. Our method               method with multi-view 3d action recognition for
leverages scene graphs and estimated pose information              untrimmed naturalistic driving videos, in: 2022
with visual embeddings to comprehensively represent                IEEE/CVF Conference on Computer Vision and
driver actions. Our experimental results showcase the ef-          Pattern Recognition Workshops (CVPRW), 2022,
fectiveness of infusing each type of auxiliary knowledge           pp. 3167–3172. doi:10.1109/CVPRW56347.2022.
with visual features to achieve 90.5% peak performance             00357.
on the DDD task.                                               [9] C. Feichtenhofer,       X3D: expanding architec-
   Future work will address the limitations mentioned              tures for efficient video recognition,         CoRR
above, such as the reliance on annotated data and the              abs/2004.04730 (2020). URL: https://arxiv.org/abs/
handling of complex driving scenarios. Additionally, we            2004.04730. arXiv:2004.04730.
plan to explore the integration of other types of knowl-      [10] W. Zhou, Y. Qian, Z. Jie, L. Ma, Multi view action
edge representations, such as temporal graphs, to further          recognition for distracted driver behavior local-
enhance the performance of distracted driver detection             ization, 2023. doi:10.1109/CVPRW59228.2023.
systems Further, we plan to investigate the role of VLMs           00567.
in this task.                                                 [11] G. Zhu, L. Zhang, Y. Jiang, Y. Dang, H. Hou, P. Shen,
                                                                   M. Feng, X. Zhao, Q. Miao, S. A. A. Shah, M. Ben-
     namoun, Scene graph generation: A comprehensive
     survey, 2022. arXiv:2201.00443.
[12] Y. Cong, M. Y. Yang, B. Rosenhahn, Reltr: Rela-
     tion transformer for scene graph generation, 2023.
     arXiv:2201.11460.
[13] R. Zellers, M. Yatskar, S. Thomson, Y. Choi, Neural
     motifs: Scene graph parsing with global context, in:
     Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition (CVPR), 2018.
[14] K. Tang, H. Zhang, B. Wu, W. Luo, W. Liu, Learn-
     ing to compose dynamic tree structures for visual
     contexts, CoRR abs/1812.01880 (2018). URL: http:
     //arxiv.org/abs/1812.01880. arXiv:1812.01880.
[15] J. Yang, J. Lu, S. Lee, D. Batra, D. Parikh, Graph
     r-cnn for scene graph generation, in: Proceedings
     of the European Conference on Computer Vision
     (ECCV), 2018.
[16] P. Ping, C. Huang, W. Ding, Y. Liu, M. Chiy-
     omi, T. Kazuya, Distracted driving detection
     based on the fusion of deep learning and causal
     reasoning, Information Fusion 89 (2023) 121–
     142. URL: https://www.sciencedirect.com/science/
     article/pii/S1566253522001014. doi:https://doi.
     org/10.1016/j.inffus.2022.08.009.
[17] M. S. Rahman, A. Venkatachalapathy, A. Sharma,
     J. Wang, S. V. Gursoy, D. Anastasiu, S. Wang, Syn-
     thetic distracted driving (syndd1) dataset for analyz-
     ing distracted behaviors and various gaze zones of
     a driver, Data in Brief 46 (2023) 108793. doi:https:
     //doi.org/10.1016/j.dib.2022.108793.
[18] K. Simonyan, A. Zisserman, Very deep convolu-
     tional networks for large-scale image recognition,
     arXiv preprint arXiv:1409.1556 (2014).
[19] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     ing for image recognition, in: Proceedings of the
     IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), 2016.
[20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
     D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabi-
     novich, Going deeper with convolutions, in: Pro-
     ceedings of the IEEE Conference on Computer Vi-
     sion and Pattern Recognition (CVPR), 2015.
[21] T. N. Kipf, M. Welling, Semi-supervised classifi-
     cation with graph convolutional networks, 2017.
     arXiv:1609.02907.
[22] Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime
     multi-person 2d pose estimation using part affinity
     fields, in: Proceedings of the IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR),
     2017.
[23] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You
     only look once: Unified, real-time object detection,
     in: Proceedings of the IEEE Conference on Com-
     puter Vision and Pattern Recognition (CVPR), 2016.