<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Infusing Auxiliary Knowledge for Distracted Driver Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ishwar B Balappanawar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashmit Chamoli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruwan Wickramarachchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aditya Mishra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ponnurangam Kumaraguru</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amit Sheth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AI Institute, University of South Carolina</institution>
          ,
          <addr-line>Columbia, SC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>International Institute of Information Technology</institution>
          ,
          <addr-line>Hyderabad</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver's pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver's pose information with the visual cues in video frames to create a holistic representation of the driver's actions. Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information. The source code for KiD3 is available at: https://github.com/ishwarbb/KiD3.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Infusion</kwd>
        <kwd>Distracted Driving</kwd>
        <kwd>Scene Graphs</kwd>
        <kwd>Pose Estimation</kwd>
        <kwd>Object Detection</kwd>
        <kwd>Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Distracted driving is a leading cause of road accidents
globally, posing significant challenges to road safety.
According to the National Highway Trafic Safety
Administration (NHTSA)1 approximately 3,308 people lost their
lives in the United States in 2022 due to distracted driving,
and nearly 290,000 people were injured. Almost 20% of
those killed in distracted driving-related crashes were
pedestrians, cyclists, and others outside the vehicle. In
addition to the loss of lives and injuries, the financial
burden from distracted driving crashes collectively amounts
to $98 billion in 2019 alone, highlighting the urgency of
developing efective detection methods.
ing and computer vision techniques, including, but not
limited to, object detection, pose estimation, and action
recognition. On the other hand, recent advancements in
knowledge infusion [1] and Neurosymbolic AI [2]
provide new opportunities for challenging tasks in scene
understanding [3, 4, 5] and context understanding [6].
Hence, we posit that there is valuable auxiliary
knowledge that can be either computed/ derived from the visual
inputs. Specifically, we hypothesize that by infusing such
knowledge with current computer vision models would
improve the overall detection capabilities and robustness
while not requiring the heavy computation demands of
ultra-high parameter models.</p>
      <sec id="sec-1-1">
        <title>To this end, we propose KiD3, a novel, simplistic</title>
        <p>The task of identifying distracted driving involves re- method for distracted driver detection that infuses
auxliably detecting and classifying various forms of driver iliary knowledge about inherent semantic relations
bedistraction, such as texting, eating, or using other ob- tween entities in a scene and the structural configuration
jects/devices from in-vehicle camera feeds. This task is of the driver’s pose. Specifically, we construct a unified
challenging due to the need for robust models that can framework that integrates scene graphs and the driver’s
generalize to a diverse set of driver behaviors without pose information with visual information to enhance
requiring extensive annotated datasets. Traditionally, the the model’s understanding of distraction behaviors (see
DDD task has been solved using various end-to-end learn- Figure 1).</p>
      </sec>
      <sec id="sec-1-2">
        <title>Conducting experiments on a real-world, open dataset,</title>
        <p>our results indicate that incorporating such auxiliary
knowledge with visual information significantly
improves detection accuracy. KiD3 achieves a 13.64%
accuracy improvement over the vision-only baseline,
demonstrating the efectiveness of integrating semantic and
cient and scalable solution that does not demand the use
of expensive high-parameter models.</p>
        <p>Contributions of this paper are as follows:
action recognition model on each view and taking the
average probability over all the views as the final output.</p>
        <p>The outputs are then post-processed for predicting the
action label and temporal localization of the predicted
1. A novel, simple method for distracted driver de- action. This work utilizes the X3D family of networks
tection that incorporates the auxiliary knowl- [9] for video classification instead of relying on manual
edge computed/estimated with vision inputs with- feature engineering. Wei Zhou et al. [10] improve upon
out the need for high-parameter, computational this work by fine-tuning large pre-trained models instead
heavy models. of training from scratch and by empirically selecting
spe2. A demonstration of the efectiveness of infus- cific camera views for specific distracted action classes.
ing diferent types of auxiliary knowledge over
vision-only baselines using real-world distracted
driving data.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>Distracted Driver Detection is generally formulated as</title>
        <p>one of 2 tasks: Action Recognition/Classification and
Temporal Action Localization (TAL). Action
recognition is a computer vision task that involves classifying
a given image or a video into a set of pre-defined set
of actions or classes. TAL, on the other hand detects
activities being performed in a video streams and outputs
start and end timestamps. In this paper, we focus on
solving the action recognition task by classifying frames
into various distracted driver activities. Here, we explore
related work considering two directions: (1) methods
for distracted driver identification and (2) methods for
generating/encoding semantic graphs from visual scenes.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Existing Methods for DDD: Vats et al.[7] proposes Key</title>
        <p>Point-Based Driver Activity Recognition that extracts
static and movement-based features from driver pose and
facial features and trains a frame classification model for
action recognition. Then, a merge procedure is used to
identify robust activity segments while ignoring outlier
frame activity predictions.</p>
      </sec>
      <sec id="sec-2-3">
        <title>In their work, Tran et al. [8] utilize multi-view syn</title>
        <p>chronization across videos by training an ensemble 3D
Previous works mainly focus on the use of
sophisticated post-processing algorithms, use of larger
encoder-decoder architectures and multi-view
synchronization to improve action recognition and TAL
performance. In contrast, our work aims to improve
classification performance by incorporating auxiliary
knowledge (e.g., semantic entities/relationships of a frame,
pose information) that can be derived and infused as
graphs into the encoder side of our architecture. Next,
we will explore the state-of-the-art methods for scene
graph generation.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Scene Graph Generation (SGG) refers to the task of au</title>
        <p>tomatically mapping an image or a video into a semantic
structural scene graph, which requires the correct
labeling of detected objects and their relationships [11]. Yuren
Cong et al. [12] pose SGG as a set prediction problem.
They propose an end-to-end SGG model, RelTR, with
an encoder-decoder architecture. In contrast to most
existing scene graph generation methods, such as Neural
Motif, VCTree, and Graph R-CNN, [13, 14, 15] which
RelTR used as benchmarks, RelTR is a one-stage method
that predicts sparse scene graphs directly only using
visual appearance without combining entities and labeling
all possible predicates. Due to its simplicity, eficiency
and SOTA performance, we selected RelTR to generate
SGGs for our experiments.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Additionally, inspired by the work of Pen Ping et al. [16] we incorporate atomic action information extracted</title>
        <p>from the objects detected in the scene and the estimated
pose of the driver.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>In this section, we formally define the DDD problem, the datasets used, preprocessing steps, and delve deep into the technical details of each sub-component in the proposed approach (see Figure 3).</title>
        <sec id="sec-3-1-1">
          <title>3.1. Problem Statement</title>
          <p>Given a video frame x ∈ R× × 3 sampled from a video
where  denotes the height of the frame,  denotes
the width of the frame, and 3 corresponds to the color
channels (RGB), the learning objective is to classify it into
one of 18 predefined activities  = {1, 2, . . . , 18}.
ˆ = arg max∈ .</p>
          <p>We define a classifier model  : R× × 3 → [0, 1]18
that maps a video frame to a probability distribution
over the 18 activities. Specifically,  (x) = p, where
p = [1, 2, . . . , 18] and  represents the probability
that the frame x belongs to class , such that ∑︀1=8 1  =
1 and 0 ≤  ≤ 1 ∀ ∈ {1, . . . , 18}. The predicted
class ˆ for the frame x can therefore be determined by:</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Datasets for DDD</title>
          <p>The real-world datasets for distracted driver
identification typically include annotated video sequences from
cameras mounted inside the vehicle. While several open
datasets are available, such as StateFarmDataset2, we
have selected SynDDv1 [17] to be used for experiments
due to the higher number of distracted behavior classes
and the diversity, including variations in lighting
conditions, driver appearances, and the use of objects and</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>2https://www.kaggle.com/competitions/state-farm-distracted</title>
        <p>driver-detection
people in the background. SynDDv1 consists of 30 video
clips in the training set and 30 videos in the test set. The
dataset consists of images collected using three in-vehicle
cameras positioned at locations: on the dashboard, near
the rear-view mirror, and on the top right-side window
corner, as shown in Table 1 and Figure 1. The video
sequences are sampled at 30 frames per second at a
resolution of 1920×1080 and are manually synchronized for
the three camera views. Each video is approximately
10 minutes long and contains all 18 distracted activities
shown in Table 2. The driver executed these activities
with or without an appearance block, such as a hat or
sunglasses, in random order for a random duration. There
are six videos for each driver: three videos with an
appearance block and three videos without any appearance
block.</p>
        <sec id="sec-3-2-1">
          <title>3.3. Data Preprocessing</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>From the dataset, we selected the Dashboard variant, re</title>
        <p>sulting in 10 videos for training and 10 videos for testing.
Sets of (frame, label) were created by sampling frames
from the videos at regular intervals and obtaining the
corresponding labels from the annotations. The publicly
available dataset contains various inconsistencies in the
annotation format provided as CSV files. These
inconsistencies, such as diferent naming conventions, variations
in capitalization, and extra spaces in names, have been
resolved to ensure consistency across all data splits.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Next, we will outline the technical details for each sub-component in our approach, shown in Figure 3.</title>
        <p>Pose
Estimator</p>
        <p>Pose Information
Scene Graph
Generator</p>
        <p>Scene
Graph</p>
        <p>GCN Graph</p>
        <p>Encoder</p>
        <p>Graph
Encoding</p>
        <p>Linear
Classifier</p>
        <p>Label
3.4. Image Encoding image embedding vector. The rationale for discarding
the last 2 layers is that the final layer reduces the
dimen3.4.1. Background sionality to only 18, which is insuficient for our needs.
To classify video frames into one of the predefined activ- Additionally, the earlier layers capture more general
feaities, the first step is to obtain robust image embeddings tures, which are beneficial for transfer learning. These
that would efectively capture the visual features in raw embeddings are then used for further processing and
pixel data into a more manageable and informative rep- classification tasks.
resentation. Possible methods for this transformation
include using pre-trained Convolutional Neural Networks 3.5. Scene Graph Generation and
(CNNs) like VGGNet [18], ResNet [19], or Inception [20]. Encoding
Out of these methods, we selected VGG16, a variant of
VGGNet, due to its simplicity and efectiveness in ex- 3.5.1. Background
tracting deep features from images. VGG16 has been
extensively used and validated in various image classifi- Scene graphs structurally represent the relationships
becation tasks, making it a reliable choice for our purpose. tween various objects in a given image. Each node in the
graph represents an object, while edges denote the
relationships between these objects; for example consider the
3.4.2. Technical Details triple: “« man holding phone »”. Scene graphs capture
VGGNet, particularly VGG16, is a deep convolutional the high-level contextual and semantic information of the
network known for its simple yet efective architecture, scene, going beyond pixel-level data. They are also
essenconsisting of 16 weight layers. The network is struc- tial for scene understanding and reasoning and allow us
tured with multiple convolutional layers followed by fully to explicitly inject knowledge into the pipeline. For
examconnected layers. Each convolutional layer uses small ple, considering DDD task, a scene graph containing the
receptive fields (3x3) and applies multiple filters to ex- triple “« person drinking_from bottle »” might indicate
tract features at diferent levels of abstraction. The fully distracted driving activity. Modeling such important
relaconnected layers then process these features for classifi- tions can otherwise be achieved implicitly using methods
cation. VGG16’s design focuses on depth and simplicity, such as convolutional-network-based image encoders,
making it an ideal candidate for transfer learning. with some uncertainty.
3.4.3. Pre-processing and Adaptation</p>
      </sec>
      <sec id="sec-3-5">
        <title>To adapt VGG16 for our task, we fine-tuned the model to</title>
        <p>obtain image embeddings. Specifically, we discarded the
last 2 classifier layers of the pre-trained VGG16 model and
retained the base model along with the first 4 classifier
layers. This configuration results in a 4096-dimensional
3.5.2. Technical Details
To generate the scene graph for a given frame, we use
the RelTr architecture [12]. Then, we use a Graph
Convolutional Network (GCN) [21] layer followed by a  ℎ
activation to obtain representations for each node in the
graph. We take the mean of all the node embeddings to
for classi cation. VGG16’s design focuses on depth and simplicity,
making it an ideal candidate for transfer learning.
art 2D pose estimation model, to extract pose information.
OpenPose can detect and output a set of key points corresponding to
various body parts, such as the head, shoulders, elbows, and hands.</p>
        <p>These key points are represented as coordinates in a 2D space. The
process involves detecting the spatial locations of these joints and
constructing a pose structure that re ects the driver’s body
conguration. Mathematically, each key point can be represented as:
k8 = (G8, ~8 ) where k8 denotes the 8-th key point with G8 and ~8
being its coordinates in the image frame.
3.4.3 Pre-processing and Adaptation. To adapt VGG16 for our task,
we ne-tuned the model to obtain image embeddings. Speci cally,
we discarded the last 2 classi er layers of the pre-trained VGG16
model and retained the base model along with the rst 4 classi er
layers. This con guration results in a 4096-dimensional image
embedding vector. The rationale for discarding the last 2 layers is that
tohebtanianl laaygerrarepdhu-cleesvtehel rdeimpernessieonnatalittyiotonoannlyd1t8r,ewahticthhiissivnseucft-or
acsietnhtefogr roauprhneeendsc.oAddidnigti.onally, the earlier layers capture more
general features, which are bene cial for transfer learning. These
embeddings are then used for further processing and classi cation
t3as.k5s..3. Pre-processing and Adaptation
33..67.3. UPren-pirfoiceesdsinPgiapndeAldinapetation. To adapt the pose
estimation data for our task, we pre-processed the key point coordinates
Wobetacinoendstfrroumct OapseinmPposlee.mThaechkienyep-loeianrtsnwinegrepnipoermlinaleizetod
caondmbstirnuectuthreed ltaotceonntsiestnenctolydirnepgrsesoenfttthhee darbivoevr’es pmosoed.ules. Each
moAddudlietiotnaaklleys,waendiemrivaegdefeaastuirneps usutcahnads tphreodcisetsanseces biettiwneteona
A scene graph output from RelTr [12] is in the form of mtheeahnanindsgafnudl evyeecst/foarcer,ethperaensgelnetfaortmioend.bWytehetheyeens wciothncthaetennecakt,e
tt33wtih.o.r5e52ine.1pgsaghlreraeiSBpetptscsahancebokrnoelegftidprewsotretheueGseosneenrdfana.fnvrtopSaesdrrclhaimaeornntuGieioso(sbegnojtrnbeshajcepeetrch,tarwss,etsrhi=litneiaorlluetaan(ictgeotiduianov1grnne,aebn,dlrsle,yidmtEerwanen2gpeo)certeoe.)ewnsE.dtehahintEceenhthsregrtsenhemeloeand.tr1teieoiTliaananhl--nliyds,
tsmteabhihnnofaeeythdtisadlnteetnrhhic(sverieitfeenderdpigi’pnessrtttspehaeasuccnoettetcfeimnveditiomthtbuadieeisatestsiilwgn.o’psgeeni.aeYpsbnAOeiulLtllihigstOnyieone[trh9og.ia]ta)nha.cdmTcfsuhearee1ansdtedse-uflfoeyocbacirtjnueiwtncreetacrssprtllwdriyeketMeroaeaunLctdprPluhicnctolioaenasslescfiltoofahyrrsesfhoiprsmbaettwiesencothnevseeorbtejedctst;ofoar elxisatmoplfe ecodngseidse,rwthheetrrieplee:d"g«emsanare
hroeldpirnegspehnotneed»a".sScpeaniersgroapfhnsocadpetsu.reTthheishiigsh-plervoevl icdoendtexttouatlhe Algorithm 1 KiD3 Pipeline
aGndCsNemeannctoicdienrfotromoatbiotanionf athgersacpenhe-,lgeovienlgrbeepyroensdepnitxaetli-olenve.l Require: Training Dataset, a collection of images and labels.
data. They are also essential for scene understanding and reasoning for 8&lt;064, ;014; in Training Dataset do
and allow us to explicitly inject knowledge into the pipeline. For E8BD0;⇢=2&gt;38=6 &lt;064⇢=2&gt;34A(8&lt;064)
e3xa.6m.pleP,coonsseideErinsgtiDmDDattaisko, nascene graph containing the triple B6⇢=2&gt;38=6 (24=4⌧A0?⌘"&gt;3D;4 (8&lt;064)
"« person drinking_from bottle »" might indicate distracted driv- ?&gt;B4 40CDA4B %&gt;B4 =5 &gt;A&lt;0C8&gt;="&gt;3D;4(8&lt;064)
i3ng.6a.c1t.iviBtya.cMkogderloinugnsduch important relations can otherwise be
abPcahoseisedeviemedsaitgmiempelianctcitioloydneurssii,snwgaimtcherstiohtmiocdeaslusnucccoehmrtaaspincootnyn.evnoltutiinonuanl-ndeetrwsotrakn-d- ;2&gt;&gt;6=82C0BC4=0(C&gt;435C&lt;0G[E(8"BD!0%;⇢(2=&gt;2=&gt;2308=C46=; 0B6C4⇢3=)2)&gt;38=6; ?&gt;B4 40CDA4B]
ing the spatial configuration of a subject’s body, which ;&gt;BB ⇠A&gt;BB⇢=CA&gt;?~(;&gt;68CB, ;014;)
3i.n5.2thiTsecchansiecailsDtehtaeilsd.rTivoegre.nBeryatecathpetuscreinneggtrhapehpfoorsiatigoivnesnof
fkraemye,bwoedyusepathretsR, eplTorsaercehsitteimctuarteio[2n].pTrhoevni,dweesuvsealauGabralephin- ;&gt;BB.BackPropagate() ù Propagate errors to the linear
CfoonrmvoaluttiioonnalaNbeotuwtortkhe(GdCrNi)v[e5r]’slaypeorsftoullroeweadnbdy ma)o0v=e⌘macetni-ts. classi er and GCNs
vation to obtain representations for each node in the graph. We end for
tTakheisthienmfoeramnoaftaiollnthiesneosdseeenmtbiaelddfionrgsatcocoubrtaatinelaygcralapshs-liefvyeilng
rtehperedsernivtaetrio’sn aancdtivtrietaitesth.iVsvaerciotoursasmtheethgoradpshceanncobdeinegm.ployed
for pose estimation, including 2D and 3D approaches.</p>
        <p>We opted to use a state-of-the-art 2D pose estimation
technique to efectively capture the required spatial data. 3.7.1. Training
3.6.2. Technical Details
We utilized OpenPose [22], a state-of-the-art 2D pose
estimation model, to extract pose information. OpenPose
can detect and output a set of key points corresponding
to various body parts, such as the head, shoulders,
elbows, and hands. These key points are represented as
coordinates in a 2D space. The process involves detecting
the spatial locations of these joints and constructing a
pose structure that reflects the driver’s body configura- 4. Experiments
tion. Mathematically, each key point can be represented
as: k = (, ) where k denotes the -th key point We outline the following experimental setup to evaluate
with  and  being its coordinates in the image frame. the proposed approach’s overall performance and the
contribution of each sub-component.</p>
      </sec>
      <sec id="sec-3-6">
        <title>We first fine-tune the pre-trained image encoder on the</title>
        <p>distracted driver classification task to obtain task-suitable
embeddings. During training, we freeze the Image
Encoding and Pose Information modules and only train the
linear classifier and the GCN graph encoder in the Scene
Graph Encoding module. We use   activation
in the final layer of the feed-forward MLP and use the
Cross-Entropy loss function.
3.6.3. Pre-processing and Adaptation
To adapt the pose estimation data for our task, we pre- 4.1. Method 1 - Vision Only
processed the key point coordinates obtained from Open- In the first experiment, we utilized existing computer
viPose. The key points were normalized and structured to sion (CV) models to establish a baseline performance
consistently represent the driver’s pose. for the frame classification task. We fine-tuned the</p>
        <p>Additionally, we derived features such as the distance VGG-16 model to assess the performance of traditional
between the hands and eyes/face, the angle formed by CV models. To achieve this, we froze the weights of
the eyes with the neck, and the distance between the the entire model and unfroze only the classification
hands and objects like a phone or bottle (if detected using layers (model.classifier[1...6]). The sixth classification
YthOeLmOo[d2e3l]’)s. aTbhielistey fteoataucrceusrawteerlye cinrutecriparleftoranendhcalanscsiinfgy layer nn.Linear(4096, 1000) was replaced with
the driver’s activities. intny.cLlaisnseesa.rT(h4e0m96o,difie1d8)mtoodeml awtcahs tthheennfinume-btuenreodf
oacntiv</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>our classification task, allowing the classification layers
to adapt to the specific features of our dataset.</p>
      <p>Table 2 summarizes the results of our experiments on the
4.2. Method 2 - Vision + Scene Graphs test set and the ablation studies across diferent method
variations. We evaluate the performance using two
metIn the second experiment, we use the VGG-16 similar to rics: accuracy and the F1 score. The vision-only model
how it was used in Method 1; however, out of the last achieves 79.64 overall accuracy and 0.81 F1 score,
respecsix classifier layers, we discarded the last two layers and tively. With the inclusion of scene graphs, the accuracy
used the base model with the first four classifier layers and the F1 score increased by 11.88% and 9.88%,
respecto obtain a 4096-dimensional image embedding vector. tively. Finally, the complete model incorporating both
The rationale is that the final layer could not be utilized scene graphs and pose information achieves the peak
because it reduces the image embedding to only 18 di- performance of 90.5% accuracy and 0.91 F1 score,
respecmensions, which is insuficient for capturing the rich tively.
features needed for our task. Moreover, earlier layers in
the network capture more general features beneficial for
transfer learning. Then, we integrate image embeddings
with scene graphs encoded using a Graph Convolutional
Network (GCN) [21]. The embeddings derived from the
GCN are concatenated with the image embeddings
obtained from the VGG-16 model. Linear layers are used as
a head to combine these information streams, forming a
unified representation. This combined model was trained
on the same classification objective, leveraging both the
visual and relational features present in the data.</p>
      <sec id="sec-4-1">
        <title>4.3. Method 3 - Vision + Scene Graphs +</title>
      </sec>
      <sec id="sec-4-2">
        <title>Pose Information</title>
        <p>In the final experiment, we further enrich the scene
representation by incorporating pose information,
enhancing its ability to understand the driver’s activities. The
pose details included the location of objects via
bounding boxes and the outline of the human skeleton with
coordinates of key points such as the eyes, nose, and
ifsts. We engineered additional features based on
external knowledge, including the distance between the hand
and face and the distance between the hand and a phone
or bottle (if detected using YOLO [23]). These engineered
features were added to the concatenation of image
embeddings and scene graph embeddings. The model is
then re-trained on the classification task with these
additional features, providing a holistic understanding of the
driver’s activities.</p>
        <sec id="sec-4-2-1">
          <title>We have observed (see Figure 4) that our methods</title>
          <p>are particularly efective in identifying classes such as
Eating (class 5), Adjusting Control Panel (class 10), and
Singing with Music (class 17). We interpret this as
evidence that our approach successfully incorporates auxil- References
iary knowledge, enhancing our model’s performance for
these classes.
[1] A. Sheth, M. Gaur, U. Kursuncu, R.
Wickramarachchi, Shades of knowledge-infused learning
for enhancing deep learning, IEEE Internet
Com6. Discussion puting 23 (2019) 54–63. doi:10.1109/MIC.2019.
2960071.</p>
          <p>Our results clearly support the initial hypothesis that [2] A. Sheth, K. Roy, M. Gaur, Neurosymbolic artificial
the inclusion of valuable auxiliary knowledge with vi- intelligence (why, what, and how), IEEE Intelligent
sual features would enhance the performance of the DDD Systems 38 (2023) 56–62. doi:10.1109/MIS.2023.
task. The ablation study further establishes each auxiliary 3268724.
knowledge type’s role in the overall performance. Scene [3] R. Wickramarachchi, C. Henson, A. Sheth,
graphs provided the most significant auxiliary knowl- Knowledge-infused Learning for Entity Prediction
edge, highlighting the importance of explicitly encoding in Driving Scenes, Frontiers in Big Data 4 (2021)
semantic information and infusing it with visual features. 759110. doi:10.3389/fdata.2021.759110.
By incorporating pose information of driver actions, we [4] R. Wickramarachchi, C. Henson, A. Sheth,
were able to further enrich overall accuracy and robust- Knowledge-based entity prediction for improved
ness. However, several limitations to our approach war- machine perception in autonomous systems,
rant further investigation. IEEE Intelligent Systems (2022). doi:10.1109/MIS.
2022.3181015.
6.1. Limitations [5] R. Wickramarachchi, C. Henson, A. Sheth, Clue-ad:
A context-based method for labeling unobserved
One limitation is the reliance on annotated data for train- entities in autonomous driving data, Proceedings of
ing. While we used a combination of supervised and un- the AAAI Conference on Artificial Intelligence 37
supervised learning techniques to mitigate this issue, the (2023) 16491–16493. URL: https://ojs.aaai.org/index.
availability of annotated data remains a key constraint. php/AAAI/article/view/27089. doi:10.1609/aaai.
Additionally, our method may struggle with complex and v37i13.27089.
highly variable driving scenarios where the relationships [6] A. Oltramari, J. Francis, C. Henson, K. Ma, R.
Wickbetween objects and actions are less clear. Finally, we ramarachchi, Neuro-symbolic architectures for
conhave not considered using foundation models like Vi- text understanding, in: Knowledge Graphs for
eXsion Language Models (VLMs) for our experiments. Our plainable Artificial Intelligence: Foundations,
Apmain focus in this work is to evaluate the impact of aux- plications and Challenges, IOS Press, 2020, pp. 143–
iliary knowledge on the DDD task without the need for 160.
complex, high-parameter models. [7] A. Vats, D. C. Anastasiu, Key point-based driver
activity recognition, in: 2022 IEEE/CVF
Confer7. Conclusions and Future Work ence on Computer Vision and Pattern Recognition
Workshops (CVPRW), 2022.</p>
          <p>In this paper, we proposed a novel, simple approach to [8] M. T. Tran, M. Quan Vu, N. D. Hoang, K.-H.
distracted driver detection by infusing two types of aux- Nam Bui, An efective temporal localization
iliary knowledge with visual information. Our method method with multi-view 3d action recognition for
leverages scene graphs and estimated pose information untrimmed naturalistic driving videos, in: 2022
with visual embeddings to comprehensively represent IEEE/CVF Conference on Computer Vision and
driver actions. Our experimental results showcase the ef- Pattern Recognition Workshops (CVPRW), 2022,
fectiveness of infusing each type of auxiliary knowledge pp. 3167–3172. doi:10.1109/CVPRW56347.2022.
with visual features to achieve 90.5% peak performance 00357.
on the DDD task. [9] C. Feichtenhofer, X3D: expanding
architec</p>
          <p>Future work will address the limitations mentioned tures for eficient video recognition, CoRR
above, such as the reliance on annotated data and the abs/2004.04730 (2020). URL: https://arxiv.org/abs/
handling of complex driving scenarios. Additionally, we 2004.04730. arXiv:2004.04730.
plan to explore the integration of other types of knowl- [10] W. Zhou, Y. Qian, Z. Jie, L. Ma, Multi view action
edge representations, such as temporal graphs, to further recognition for distracted driver behavior
localenhance the performance of distracted driver detection ization, 2023. doi:10.1109/CVPRW59228.2023.
systems Further, we plan to investigate the role of VLMs 00567.
in this task. [11] G. Zhu, L. Zhang, Y. Jiang, Y. Dang, H. Hou, P. Shen,
M. Feng, X. Zhao, Q. Miao, S. A. A. Shah, M.
Bennamoun, Scene graph generation: A comprehensive
survey, 2022. arXiv:2201.00443.
[12] Y. Cong, M. Y. Yang, B. Rosenhahn, Reltr:
Relation transformer for scene graph generation, 2023.</p>
          <p>arXiv:2201.11460.
[13] R. Zellers, M. Yatskar, S. Thomson, Y. Choi, Neural
motifs: Scene graph parsing with global context, in:
Proceedings of the IEEE Conference on Computer</p>
          <p>Vision and Pattern Recognition (CVPR), 2018.
[14] K. Tang, H. Zhang, B. Wu, W. Luo, W. Liu,
Learning to compose dynamic tree structures for visual
contexts, CoRR abs/1812.01880 (2018). URL: http:
//arxiv.org/abs/1812.01880. arXiv:1812.01880.
[15] J. Yang, J. Lu, S. Lee, D. Batra, D. Parikh, Graph
r-cnn for scene graph generation, in: Proceedings
of the European Conference on Computer Vision
(ECCV), 2018.
[16] P. Ping, C. Huang, W. Ding, Y. Liu, M.
Chiyomi, T. Kazuya, Distracted driving detection
based on the fusion of deep learning and causal
reasoning, Information Fusion 89 (2023) 121–
142. URL: https://www.sciencedirect.com/science/
article/pii/S1566253522001014. doi:https://doi.</p>
          <p>org/10.1016/j.inffus.2022.08.009.
[17] M. S. Rahman, A. Venkatachalapathy, A. Sharma,</p>
          <p>J. Wang, S. V. Gursoy, D. Anastasiu, S. Wang,
Synthetic distracted driving (syndd1) dataset for
analyzing distracted behaviors and various gaze zones of
a driver, Data in Brief 46 (2023) 108793. doi:https:
//doi.org/10.1016/j.dib.2022.108793.
[18] K. Simonyan, A. Zisserman, Very deep
convolutional networks for large-scale image recognition,
arXiv preprint arXiv:1409.1556 (2014).
[19] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learning for image recognition, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern</p>
          <p>Recognition (CVPR), 2016.
[20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,</p>
          <p>D. Anguelov, D. Erhan, V. Vanhoucke, A.
Rabinovich, Going deeper with convolutions, in:
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2015.
[21] T. N. Kipf, M. Welling, Semi-supervised
classification with graph convolutional networks, 2017.</p>
          <p>arXiv:1609.02907.
[22] Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime
multi-person 2d pose estimation using part afinity
ifelds, in: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR),
2017.
[23] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You
only look once: Unified, real-time object detection,
in: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>