<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Towards Infusing Auxiliary Knowledge for Distracted Driver Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ishwar</forename><forename type="middle">B</forename><surname>Balappanawar</surname></persName>
							<email>ishwar.balappanawar@students.iiit.ac.in</email>
							<affiliation key="aff0">
								<orgName type="department">International Institute of Information Technology</orgName>
								<address>
									<settlement>Hyderabad</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ashmit</forename><surname>Chamoli</surname></persName>
							<email>ashmit.chamoli@students.iiit.ac.in</email>
							<affiliation key="aff0">
								<orgName type="department">International Institute of Information Technology</orgName>
								<address>
									<settlement>Hyderabad</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ruwan</forename><surname>Wickramarachchi</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">AI Institute</orgName>
								<orgName type="institution">University of South Carolina</orgName>
								<address>
									<settlement>Columbia</settlement>
									<region>SC</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Aditya</forename><surname>Mishra</surname></persName>
							<email>aditya.mishra@students.iiit.ac.in</email>
							<affiliation key="aff0">
								<orgName type="department">International Institute of Information Technology</orgName>
								<address>
									<settlement>Hyderabad</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ponnurangam</forename><surname>Kumaraguru</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">International Institute of Information Technology</orgName>
								<address>
									<settlement>Hyderabad</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Amit</forename><surname>Sheth</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">AI Institute</orgName>
								<orgName type="institution">University of South Carolina</orgName>
								<address>
									<settlement>Columbia</settlement>
									<region>SC</region>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Towards Infusing Auxiliary Knowledge for Distracted Driver Detection</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">C7B233DBF54D139569FEE27B529C98CA</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Knowledge Infusion</term>
					<term>Distracted Driving</term>
					<term>Scene Graphs</term>
					<term>Pose Estimation</term>
					<term>Object Detection</term>
					<term>Classification</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver's pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver's pose information with the visual cues in video frames to create a holistic representation of the driver's actions. Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information. The source code for KiD3 is available at: https://github.com/ishwarbb/KiD3.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Distracted driving is a leading cause of road accidents globally, posing significant challenges to road safety. According to the National Highway Traffic Safety Administration (NHTSA) 1 approximately 3,308 people lost their lives in the United States in 2022 due to distracted driving, and nearly 290,000 people were injured. Almost 20% of those killed in distracted driving-related crashes were pedestrians, cyclists, and others outside the vehicle. In addition to the loss of lives and injuries, the financial burden from distracted driving crashes collectively amounts to $98 billion in 2019 alone, highlighting the urgency of developing effective detection methods.</p><p>The task of identifying distracted driving involves reliably detecting and classifying various forms of driver distraction, such as texting, eating, or using other objects/devices from in-vehicle camera feeds. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. Traditionally, the DDD task has been solved using various end-to-end learn-1 https://www.nhtsa.gov/speeches-presentations/distracteddriving-event-put-phone-away-or-pay-campaign ing and computer vision techniques, including, but not limited to, object detection, pose estimation, and action recognition. On the other hand, recent advancements in knowledge infusion <ref type="bibr" target="#b0">[1]</ref> and Neurosymbolic AI <ref type="bibr" target="#b1">[2]</ref> provide new opportunities for challenging tasks in scene understanding <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref> and context understanding <ref type="bibr" target="#b5">[6]</ref>. Hence, we posit that there is valuable auxiliary knowledge that can be either computed/ derived from the visual inputs. Specifically, we hypothesize that by infusing such knowledge with current computer vision models would improve the overall detection capabilities and robustness while not requiring the heavy computation demands of ultra-high parameter models.</p><p>To this end, we propose KiD3, a novel, simplistic method for distracted driver detection that infuses auxiliary knowledge about inherent semantic relations between entities in a scene and the structural configuration of the driver's pose. Specifically, we construct a unified framework that integrates scene graphs and the driver's pose information with visual information to enhance the model's understanding of distraction behaviors (see Figure <ref type="figure" target="#fig_0">1</ref>).</p><p>Conducting experiments on a real-world, open dataset, our results indicate that incorporating such auxiliary knowledge with visual information significantly improves detection accuracy. KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline, demonstrating the effectiveness of integrating semantic and pose information in DDD tasks. This improvement highlights the potential of our method to contribute to safer driving environments by providing a more reliable, effi-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sampled Frame</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pose Estimation</head><p>Object Information Scene Graph Information cient and scalable solution that does not demand the use of expensive high-parameter models.</p><p>Contributions of this paper are as follows:</p><p>1. A novel, simple method for distracted driver detection that incorporates the auxiliary knowledge computed/estimated with vision inputs without the need for high-parameter, computational heavy models. 2. A demonstration of the effectiveness of infusing different types of auxiliary knowledge over vision-only baselines using real-world distracted driving data. Existing Methods for DDD: Vats et al. <ref type="bibr" target="#b6">[7]</ref> proposes Key Point-Based Driver Activity Recognition that extracts static and movement-based features from driver pose and facial features and trains a frame classification model for action recognition. Then, a merge procedure is used to identify robust activity segments while ignoring outlier frame activity predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>In their work, Tran et al. <ref type="bibr" target="#b7">[8]</ref> utilize multi-view synchronization across videos by training an ensemble 3D action recognition model on each view and taking the average probability over all the views as the final output. The outputs are then post-processed for predicting the action label and temporal localization of the predicted action. This work utilizes the X3D family of networks <ref type="bibr" target="#b8">[9]</ref> for video classification instead of relying on manual feature engineering. Wei Zhou et al. <ref type="bibr" target="#b9">[10]</ref> improve upon this work by fine-tuning large pre-trained models instead of training from scratch and by empirically selecting specific camera views for specific distracted action classes.</p><p>Previous works mainly focus on the use of sophisticated post-processing algorithms, use of larger encoder-decoder architectures and multi-view synchronization to improve action recognition and TAL performance. In contrast, our work aims to improve classification performance by incorporating auxiliary knowledge (e.g., semantic entities/relationships of a frame, pose information) that can be derived and infused as graphs into the encoder side of our architecture. Next, we will explore the state-of-the-art methods for scene graph generation.</p><p>Scene Graph Generation (SGG) refers to the task of automatically mapping an image or a video into a semantic structural scene graph, which requires the correct labeling of detected objects and their relationships <ref type="bibr" target="#b10">[11]</ref>. Yuren Cong et al. <ref type="bibr" target="#b11">[12]</ref> pose SGG as a set prediction problem. They propose an end-to-end SGG model, RelTR, with an encoder-decoder architecture. In contrast to most existing scene graph generation methods, such as Neural Motif, VCTree, and Graph R-CNN, <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref> which RelTR used as benchmarks, RelTR is a one-stage method that predicts sparse scene graphs directly only using visual appearance without combining entities and labeling all possible predicates. Due to its simplicity, efficiency and SOTA performance, we selected RelTR to generate SGGs for our experiments.</p><p>Additionally, inspired by the work of Pen Ping et al. <ref type="bibr" target="#b15">[16]</ref> we incorporate atomic action information extracted from the objects detected in the scene and the estimated pose of the driver.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>In this section, we formally define the DDD problem, the datasets used, preprocessing steps, and delve deep into the technical details of each sub-component in the proposed approach (see Figure <ref type="figure" target="#fig_3">3</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Problem Statement</head><p>Given a video frame x ∈ R 𝑚×𝑛×3 sampled from a video where 𝑚 denotes the height of the frame, 𝑛 denotes the width of the frame, and 3 corresponds to the color channels (RGB), the learning objective is to classify it into one of 18 predefined activities 𝒞 = {𝐶1, 𝐶2, . . . , 𝐶18}.</p><p>We define a classifier model 𝑓 : R 𝑚×𝑛×3 → [0, 1] 18 that maps a video frame to a probability distribution over the 18 activities. Specifically, 𝑓 (x) = p, where p = [𝑝1, 𝑝2, . . . , 𝑝18] and 𝑝𝑖 represents the probability that the frame x belongs to class 𝐶𝑖, such that ∑︀ 18 𝑖=1 𝑝𝑖 = 1 and 0 ≤ 𝑝𝑖 ≤ 1 ∀𝑖 ∈ {1, . . . , 18}. The predicted class 𝐶 ˆfor the frame x can therefore be determined by: 𝐶 ˆ= arg max𝐶 𝑖 ∈𝒞 𝑝𝑖.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Datasets for DDD</head><p>The real-world datasets for distracted driver identification typically include annotated video sequences from cameras mounted inside the vehicle. While several open datasets are available, such as StateFarmDataset<ref type="foot" target="#foot_0">2</ref> , we have selected SynDDv1 <ref type="bibr" target="#b16">[17]</ref> to be used for experiments due to the higher number of distracted behavior classes and the diversity, including variations in lighting conditions, driver appearances, and the use of objects and Picking up from floor (passenger) 13</p><p>Talking to passenger at the right 14</p><p>Talking to passenger at backseat 15 Yawning 16</p><p>Hand on head 17</p><p>Singing with music 18</p><p>Shaking or dancing with music people in the background. SynDDv1 consists of 30 video clips in the training set and 30 videos in the test set. The dataset consists of images collected using three in-vehicle cameras positioned at locations: on the dashboard, near the rear-view mirror, and on the top right-side window corner, as shown in Table <ref type="table" target="#tab_1">1</ref> and Figure <ref type="figure" target="#fig_0">1</ref>. The video sequences are sampled at 30 frames per second at a resolution of 1920×1080 and are manually synchronized for the three camera views. Each video is approximately 10 minutes long and contains all 18 distracted activities shown in Table <ref type="table" target="#tab_3">2</ref>. The driver executed these activities with or without an appearance block, such as a hat or sunglasses, in random order for a random duration. There are six videos for each driver: three videos with an appearance block and three videos without any appearance block.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Data Preprocessing</head><p>From the dataset, we selected the Dashboard variant, resulting in 10 videos for training and 10 videos for testing. Sets of (frame, label) were created by sampling frames from the videos at regular intervals and obtaining the corresponding labels from the annotations. The publicly available dataset contains various inconsistencies in the annotation format provided as CSV files. These inconsistencies, such as different naming conventions, variations in capitalization, and extra spaces in names, have been resolved to ensure consistency across all data splits.</p><p>Next, we will outline the technical details for each sub-component in our approach, shown in Figure <ref type="figure" target="#fig_3">3</ref>.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Image Encoding</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.1.">Background</head><p>To classify video frames into one of the predefined activities, the first step is to obtain robust image embeddings that would effectively capture the visual features in raw pixel data into a more manageable and informative representation. Possible methods for this transformation include using pre-trained Convolutional Neural Networks (CNNs) like VGGNet <ref type="bibr" target="#b17">[18]</ref>, ResNet <ref type="bibr" target="#b18">[19]</ref>, or Inception <ref type="bibr" target="#b19">[20]</ref>. Out of these methods, we selected VGG16, a variant of VGGNet, due to its simplicity and effectiveness in extracting deep features from VGG16 been extensively used and validated in various image classification tasks, making it a reliable choice for our purpose.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.2.">Technical Details</head><p>VGGNet, particularly VGG16, is a deep convolutional network known for its simple yet effective architecture, consisting of 16 weight layers. The network is structured with multiple convolutional layers followed by fully connected layers. Each convolutional layer uses small receptive fields (3x3) and applies multiple filters to extract features at different levels of abstraction. The fully connected layers then process these features for classification. VGG16's design focuses on depth and simplicity, making it an ideal candidate for transfer learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.3.">Pre-processing and Adaptation</head><p>To adapt VGG16 for our task, we fine-tuned the model to obtain image embeddings. Specifically, we discarded the last 2 classifier layers of the pre-trained VGG16 model and retained the base model along with the first 4 classifier layers. This configuration results in a 4096-dimensional image embedding vector. The rationale for discarding the last 2 layers is that the final layer reduces the dimensionality to only 18, which is insufficient for our needs. Additionally, the earlier layers capture more general features, which are beneficial for transfer learning. These embeddings are then used for further processing and classification tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Scene Graph Generation and Encoding</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.1.">Background</head><p>Scene graphs structurally represent the relationships between various objects in a given image. Each node in the graph represents an object, while edges denote the relationships between these objects; for example consider the triple: "« man holding phone »". Scene graphs capture the high-level contextual and semantic information of the scene, going beyond pixel-level data. They are also essential for scene understanding and reasoning and allow us to explicitly inject knowledge into the pipeline. For example, considering DDD task, a scene graph containing the triple "« person drinking_from bottle »" might indicate distracted driving activity. Modeling such important relations can otherwise be achieved implicitly using methods such as convolutional-network-based image encoders, with some uncertainty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.2.">Technical Details</head><p>To generate the scene graph for a given frame, we use the RelTr architecture <ref type="bibr" target="#b11">[12]</ref>. Then, we use a Graph Convolutional Network (GCN) <ref type="bibr" target="#b20">[21]</ref> layer followed by a 𝑇 𝑎𝑛ℎ activation to obtain representations for each node in the graph. We take the mean of all the node embeddings to obtain a graph-level representation and treat this vector as the graph encoding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.3.">Pre-processing and Adaptation</head><p>A scene graph output from RelTr <ref type="bibr" target="#b11">[12]</ref> is in the form of triplets of the form (𝑛𝑜𝑑𝑒, relation, 𝑛𝑜𝑑𝑒). Essentially, we get a list of relations 𝑅𝑖 = (𝑛1, r, 𝑛2) where 𝑛1 and 𝑛2 are nodes and r is the relation between them. This format is converted to a list of edges, where edges are represented as pairs of nodes. This is provided to the GCN encoder to obtain a graph-level representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6.">Pose Estimation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6.1.">Background</head><p>Pose estimation is a critical component in understanding the spatial configuration of a subject's body, which in this case is the driver. By capturing the positions of key body parts, pose estimation provides valuable information about the driver's posture and movements. This information is essential for accurately classifying the driver's activities. Various methods can be employed for pose estimation, including 2D and 3D approaches. We opted to use a state-of-the-art 2D pose estimation technique to effectively capture the required spatial data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6.2.">Technical Details</head><p>We utilized OpenPose <ref type="bibr" target="#b21">[22]</ref>, a state-of-the-art 2D pose estimation model, to extract pose information. OpenPose can detect and output a set of key points corresponding to various body parts, such as the head, shoulders, elbows, and hands. These key points are represented as coordinates in a 2D space. The process involves detecting the spatial locations of these joints and constructing a pose structure that reflects the driver's body configuration. Mathematically, each key point can be represented as: k𝑖 = (𝑥𝑖, 𝑦𝑖) where k𝑖 denotes the 𝑖-th key point with 𝑥𝑖 and 𝑦𝑖 being its coordinates in the image frame.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6.3.">Pre-processing and Adaptation</head><p>To adapt the pose estimation data for our task, we preprocessed the key point coordinates obtained from Open-Pose. The key points were normalized and structured to consistently represent the driver's pose. Additionally, we derived features such as the distance between the hands and eyes/face, the angle formed by the eyes with the neck, and the distance between the hands and objects like a phone or bottle (if detected using YOLO <ref type="bibr" target="#b22">[23]</ref>). These features were crucial for enhancing the model's ability to accurately interpret and classify the driver's activities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.7.">Unified Pipeline</head><p>We construct a simple machine-learning pipeline to combine the latent encodings of the above modules. Each module takes an image as input and processes it into a meaningful vector representation. We then concatenate these representations using a feed-forward MLP to classify the input image. Algorithm 1 succinctly outlines the main steps of this pipeline.</p><p>abstraction. The fully connected layers then process these features for classication. VGG16's design focuses on depth and simplicity, making it an ideal candidate for transfer learning.</p><p>3.4.3 Pre-processing and Adaptation. To adapt VGG16 our task, we ne-tuned the model to obtain image embeddings. Specically, we discarded the last 2 classier layers of the pre-trained VGG16 model and retained the base model along with the rst 4 classier layers. This conguration results in a 4096-dimensional image embedding vector. The rationale for discarding the last 2 layers is that the nal layer reduces the dimensionality to only 18, which is insufcient for our needs. Additionally, the earlier capture more general features, which are benecial for transfer learning. These embeddings are then used for further processing and classication tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Scene Graph Generation and Encoding</head><p>3.5.1 Background. Scene graphs structurally represent the relationships between various objects in a given image. Each node in the graph represents an object, while edges denote the relationships between these objects; for example consider the triple: "« man holding phone »". Scene graphs capture the high-level contextual and semantic information of the scene, going beyond pixel-level data. They are also essential for scene understanding and reasoning and allow us to explicitly inject knowledge into the pipeline. For example, considering DDD task, a scene graph containing the triple "« person drinking_from bottle »" might indicate distracted driving activity. Modeling such important relations can otherwise be achieved implicitly using methods such as convolutional-networkbased image encoders, with some uncertainty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.2">Technical Details.</head><p>To generate the scene graph for a given frame, we use the RelTr architecture <ref type="bibr" target="#b1">[2]</ref>. Then, we use a Graph Convolutional Network (GCN) <ref type="bibr" target="#b4">[5]</ref> layer followed by a )0=⌘ activation to obtain representations for each node in the graph. We take the mean of all the node embeddings to obtain a graph-level representation and treat this vector as the graph encoding.</p><p>3.6.2 Technical Details. We utilized OpenPose <ref type="bibr" target="#b0">[1]</ref>, a state-of-the-art 2D pose estimation model, to extract pose information. Open-Pose can detect and output a set of key points corresponding to various body parts, such as the head, shoulders, elbows, and hands. These key points are represented as coordinates in a 2D space. The process involves detecting the spatial locations of these joints and constructing a pose structure that reects the driver's body conguration. Mathematically, each key point can be represented as:</p><formula xml:id="formula_0">k 8 = (G 8 , ~8 )</formula><p>where k 8 denotes the 8-th key point with G 8 and ~8 being its coordinates in the image frame.</p><p>3.6.3 Pre-processing and Adaptation. To adapt the pose estimation data for our task, we pre-processed the key point coordinates obtained from OpenPose. The key points were normalized and structured to consistently represent the driver's pose.</p><p>Additionally, we derived features such as the distance between the hands and eyes/face, the angle formed by the eyes with the neck, and the distance between the hands and objects like a phone or bottle (if detected using YOLO <ref type="bibr" target="#b8">[9]</ref>). These features were crucial for enhancing the model's ability to accurately interpret and classify the driver's activities. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.7.1.">Training</head><p>We first fine-tune the pre-trained image encoder on the distracted driver classification task to obtain task-suitable embeddings. During training, we freeze the Image Encoding and Pose Information modules and only train the linear classifier and the GCN graph encoder in the Scene Graph Encoding module. We use 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 activation in the final layer of the feed-forward MLP and use the Cross-Entropy loss function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>We outline the following experimental setup to evaluate the proposed approach's overall performance and the contribution of each sub-component.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Method 1 -Vision Only</head><p>In the first experiment, we utilized existing computer vision (CV) models to establish a baseline performance for the frame classification task. We fine-tuned the VGG-16 model to assess the performance of traditional CV models. To achieve this, we froze the weights of the entire model and unfroze only the classification layers (model.classifier <ref type="bibr">[1...6]</ref>). The sixth classification layer nn.Linear(4096, 1000) was replaced with nn. <ref type="bibr">Linear(4096,</ref><ref type="bibr" target="#b17">18)</ref> to match the number of activity classes. The modified model was then fine-tuned on our classification task, allowing the classification layers to adapt to the specific features of our dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Method 2 -Vision + Scene Graphs</head><p>In the second experiment, we use the VGG-16 similar to how it was used in Method 1; however, out of the last six classifier layers, we discarded the last two layers and used the base model with the first four classifier layers to obtain a 4096-dimensional image embedding vector. The rationale is that the final layer could not be utilized because it reduces the image embedding to only 18 dimensions, which is insufficient for capturing the rich features needed for our task. Moreover, earlier layers in the network capture more general features beneficial for transfer learning. Then, we integrate image embeddings with scene graphs encoded using a Graph Convolutional Network (GCN) <ref type="bibr" target="#b20">[21]</ref>. The embeddings derived from the GCN are concatenated with the image embeddings obtained from the VGG-16 model. Linear layers are used as a head to combine these information streams, forming a unified representation. This combined model was trained on the same classification objective, leveraging both the visual and relational features present in the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Method 3 -Vision + Scene Graphs + Pose Information</head><p>In the final experiment, we further enrich the scene representation by incorporating pose information, enhancing its ability to understand the driver's activities. The pose details included the location of objects via bounding boxes and the outline of the human skeleton with coordinates of key points such as the eyes, nose, and fists. We engineered additional features based on external knowledge, including the distance between the hand and face and the distance between the hand and a phone or bottle (if detected using YOLO <ref type="bibr" target="#b22">[23]</ref>). These engineered features were added to the concatenation of image embeddings and scene graph embeddings. The model is then re-trained on the classification task with these additional features, providing a holistic understanding of the driver's activities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>Table <ref type="table" target="#tab_3">2</ref> summarizes the results of our experiments on the test set and the ablation studies across different method variations. We evaluate the performance using two metrics: accuracy and the F1 score. The vision-only model achieves 79.64 overall accuracy and 0.81 F1 score, respectively. With the inclusion of scene graphs, the accuracy and the F1 score increased by 11.88% and 9.88%, respectively. Finally, the complete model incorporating both scene graphs and pose information achieves the peak performance of 90.5% accuracy and 0.91 F1 score, respectively. We have observed (see Figure <ref type="figure" target="#fig_4">4</ref>) that our methods are particularly effective in identifying classes such as Eating (class 5), Adjusting Control Panel (class 10), and Singing with Music (class 17). We interpret this as evi-dence that our approach successfully incorporates auxiliary knowledge, enhancing our model's performance for these classes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion</head><p>Our results clearly support the initial hypothesis that the inclusion of valuable auxiliary knowledge with visual features would enhance the performance of the DDD task. The ablation study further establishes each auxiliary knowledge type's role in the overall performance. Scene graphs provided the most significant auxiliary knowledge, highlighting the importance of explicitly encoding semantic information and infusing it with visual features. By incorporating pose information of driver actions, we were able to further enrich overall accuracy and robustness. However, several limitations to our approach warrant further investigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Limitations</head><p>One limitation is the reliance on annotated data for training. While we used a combination of supervised and unsupervised learning techniques to mitigate this issue, the availability of annotated data remains a key constraint. Additionally, our method may struggle with complex and highly variable driving scenarios where the relationships between objects and actions are less clear. Finally, we have not considered using foundation models like Vision Language Models (VLMs) for our experiments. Our main focus in this work is to evaluate the impact of auxiliary knowledge on the DDD task without the need for complex, high-parameter models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions and Future Work</head><p>In this paper, we proposed a novel, simple approach to distracted driver detection by infusing two types of auxiliary knowledge with visual information. Our method leverages scene graphs and estimated pose information with visual embeddings to comprehensively represent driver actions. Our experimental results showcase the effectiveness of infusing each type of auxiliary knowledge with visual features to achieve 90.5% peak performance on the DDD task.</p><p>Future work will address the limitations mentioned above, such as the reliance on annotated data and the handling of complex driving scenarios. Additionally, we plan to explore the integration of other types of knowledge representations, such as temporal graphs, to further enhance the performance of distracted driver detection systems Further, we plan to investigate the role of VLMs in this task.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: This figure illustrates the process of extracting detailed information from a scene to analyze driver behavior. The extreme left panel shows an image of a driver which is sampled from the video. The middle left panel presents the corresponding estimated pose, highlighting how structured representations can be derived from raw image data. The middle right panel presents the object information obtained via object detection.The extreme right panel provides an sample relation from the scene graph, capturing the relationships between different objects and actions.</figDesc><graphic coords="2,399.69,92.88,100.01,65.29" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Camera mounting setup for the three views in the SynDD1 dataset: 1. Dashboard, 2. Behind rear view mirror, and 3. Top right side window.</figDesc><graphic coords="3,96.20,84.19,187.51,111.09" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Workflow of our proposed method. The figure illustrates the integration of an Image Encoder, Scene Graph Generator, GCN Graph Encoder, and Pose Estimators within our pipeline.</figDesc><graphic coords="4,90.84,117.35,136.62,76.85" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: F1 scores and support for individual activity (i.e., Class 1 -18) prediction across three methods, with Method 2 (i.e., Vision + SGG) and Method 3 (i.e., Vision + SGG + Pose Info) showing improvements over Method 1 (i.e., Vision only).</figDesc><graphic coords="6,302.62,328.97,208.33,231.48" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>The list of distracted driving activities in the SynDD1 dataset.</figDesc><table><row><cell>Sr. no.</cell><cell>Distracted driver behavior</cell></row><row><cell>1</cell><cell>Normal forward driving</cell></row><row><cell>2</cell><cell>Drinking</cell></row><row><cell>3</cell><cell>Phone call (right)</cell></row><row><cell>4</cell><cell>Phone call (left)</cell></row><row><cell>5</cell><cell>Eating</cell></row><row><cell>6</cell><cell>Texting (right)</cell></row><row><cell>7</cell><cell>Texting (left)</cell></row><row><cell>8</cell><cell>Hair / makeup</cell></row><row><cell>9</cell><cell>Reaching behind</cell></row><row><cell>10</cell><cell>Adjusting control panel</cell></row><row><cell>11</cell><cell>Picking up from floor (driver)</cell></row><row><cell>12</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>Performance of the three methods on the test set</figDesc><table><row><cell>Method</cell><cell>Accuracy</cell><cell>F1 Score</cell></row><row><cell>Vision Only</cell><cell>79.64 ± 2.17%</cell><cell>0.81</cell></row><row><cell>Vision + Scene Graphs</cell><cell>89.1 ± 1.61% (↑ 11.88%)</cell><cell>0.89 (↑ 9.88%)</cell></row><row><cell cols="3">Vision + Scene Graphs + Pose Information 90.5 ± 1.32% (↑ 13.64%) 0.91 (↑ 12.35%)</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">https://www.kaggle.com/competitions/state-farm-distracteddriver-detection</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Shades of knowledge-infused learning for enhancing deep learning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sheth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gaur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Kursuncu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Wickramarachchi</surname></persName>
		</author>
		<idno type="DOI">10.1109/MIC.2019.2960071</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Internet Computing</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="54" to="63" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Neurosymbolic artificial intelligence (why, what, and how)</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sheth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Roy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gaur</surname></persName>
		</author>
		<idno type="DOI">10.1109/MIS.2023.3268724</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Intelligent Systems</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="56" to="62" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Knowledge-infused Learning for Entity Prediction in Driving Scenes</title>
		<author>
			<persName><forename type="first">R</forename><surname>Wickramarachchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Henson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sheth</surname></persName>
		</author>
		<idno type="DOI">10.3389/fdata.2021.759110</idno>
	</analytic>
	<monogr>
		<title level="j">Frontiers in Big Data</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page">759110</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Knowledge-based entity prediction for improved machine perception in autonomous systems</title>
		<author>
			<persName><forename type="first">R</forename><surname>Wickramarachchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Henson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sheth</surname></persName>
		</author>
		<idno type="DOI">10.1109/MIS.2022.3181015</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Intelligent Systems</title>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Clue-ad: A context-based method for labeling unobserved entities in autonomous driving data</title>
		<author>
			<persName><forename type="first">R</forename><surname>Wickramarachchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Henson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sheth</surname></persName>
		</author>
		<idno type="DOI">10.1609/aaai.v37i13.27089</idno>
		<ptr target="https://ojs.aaai.org/index.php/AAAI/article/view/27089.doi:10.1609/aaai.v37i13.27089" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="16491" to="16493" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Neuro-symbolic architectures for context understanding</title>
		<author>
			<persName><forename type="first">A</forename><surname>Oltramari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Francis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Henson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Wickramarachchi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Knowledge Graphs for eXplainable Artificial Intelligence: Foundations, Applications and Challenges</title>
				<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="143" to="160" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Key point-based driver activity recognition</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vats</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">C</forename><surname>Anastasiu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</title>
				<imprint>
			<date type="published" when="2022">2022. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">An effective temporal localization method with multi-view 3d action recognition for untrimmed naturalistic driving videos</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Quan</forename><surname>Vu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Hoang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K.-H. Nam</forename><surname>Bui</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPRW56347.2022.00357</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</title>
				<imprint>
			<date type="published" when="2022">2022. 2022</date>
			<biblScope unit="page" from="3167" to="3172" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">X3D: expanding architectures for efficient video recognition</title>
		<author>
			<persName><forename type="first">C</forename><surname>Feichtenhofer</surname></persName>
		</author>
		<idno>CoRR abs/2004.04730</idno>
		<ptr target="https://arxiv.org/abs/2004.04730.arXiv:2004.04730" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Multi view action recognition for distracted driver behavior localization</title>
		<author>
			<persName><forename type="first">W</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Jie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ma</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPRW59228.2023.00567</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Miao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A A</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bennamoun</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2201.00443</idno>
		<title level="m">Scene graph generation: A comprehensive survey</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Cong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Rosenhahn</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2201.11460</idno>
		<title level="m">Reltr: Relation transformer for scene graph generation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Neural motifs: Scene graph parsing with global context</title>
		<author>
			<persName><forename type="first">R</forename><surname>Zellers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yatskar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Thomson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Choi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Learning to compose dynamic tree structures for visual contexts</title>
		<author>
			<persName><forename type="first">K</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<idno>CoRR abs/1812.01880</idno>
		<ptr target="http://arxiv.org/abs/1812.01880.arXiv:1812.01880" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Graph r-cnn for scene graph generation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the European Conference on Computer Vision (ECCV)</title>
				<meeting>the European Conference on Computer Vision (ECCV)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Distracted driving detection based on the fusion of deep learning and causal reasoning</title>
		<author>
			<persName><forename type="first">P</forename><surname>Ping</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chiyomi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kazuya</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.inffus.2022.08.009</idno>
		<ptr target="https://doi.org/10.1016/j.inffus.2022.08.009" />
	</analytic>
	<monogr>
		<title level="j">Information Fusion</title>
		<imprint>
			<biblScope unit="volume">89</biblScope>
			<biblScope unit="page" from="121" to="142" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Synthetic distracted driving (syndd1) dataset for analyzing distracted behaviors and various gaze zones of a driver</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Rahman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Venkatachalapathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">V</forename><surname>Gursoy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Anastasiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.dib.2022.108793</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1016/j.dib.2022.108793" />
	</analytic>
	<monogr>
		<title level="j">Data in Brief</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="page">108793</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1409.1556</idno>
		<title level="m">Very deep convolutional networks for large-scale image recognition</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Going deeper with convolutions</title>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sermanet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Reed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Anguelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Erhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rabinovich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">N</forename><surname>Kipf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1609.02907</idno>
		<title level="m">Semi-supervised classification with graph convolutional networks</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Realtime multi-person 2d pose estimation using part affinity fields</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Simon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-E</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sheikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">You only look once: Unified, real-time object detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Redmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Divvala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
