<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Real-time detection and tracking of pedestrians in CCTV images using a deep convolutional neural network</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Debaditya</forename><surname>Acharya</surname></persName>
							<email>acharyad@student.unimelb.edu.au</email>
							<affiliation key="aff0">
								<orgName type="department">Infrastructure Engineering</orgName>
								<orgName type="institution">The University of Melbourne</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stephan</forename><surname>Winter</surname></persName>
							<email>winter@unimelb.edu.au</email>
							<affiliation key="aff0">
								<orgName type="department">Infrastructure Engineering</orgName>
								<orgName type="institution">The University of Melbourne</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Real-time detection and tracking of pedestrians in CCTV images using a deep convolutional neural network</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">CBDE6D004A1F8F664DE56099743538A4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T15:00+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this work, deep convolutional neural networks are used to automate the process of feature extraction from CCTV images. The extracted features serve as a strong basis for a variety of object recognition tasks and are used to address a tracking problem. The approach is to match the extracted features of individual detections in subsequent frames, hence creating a correspondence of detections across multiple frames. The developed framework is able to address challenges like cluttered scenes, change in illumination, shadows and reflection, change in appearances and partial occlusions. However, total occlusion and similar persons in the same frame remain a challenge to be addressed. The framework is able to generate the detection and the tracking results at the rate of four frames per second.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Pedestrian tracking has gained a significant interest in the last two decades. The increasing interest is due to the availability of high-quality inexpensive CCTV video cameras and the need for automated video analysis. Recognising human actions in real-world environments finds applications in intelligent video surveillance, knowing customer attributes, customer shopping behaviour analysis <ref type="bibr" target="#b1">(Chen et al., 2016)</ref>, homeland security, crime prevention, hospitals, elderly and child care <ref type="bibr" target="#b16">(Wang, 2013)</ref> and can be used for management of public places and handling emergency situations as well.</p><p>There is a rich literature <ref type="bibr" target="#b17">(Yilmaz et al., 2006;</ref><ref type="bibr" target="#b13">Smeulders et al., 2014)</ref> that follows the conventional paradigm of pattern recognition that includes extraction of hand-crafted features (pre-defined features such as Histogram of oriented Gradients (HOG)) from the images for detecting pedestrians in a scene and their subsequent classification, using classifiers. The drawback of using such hand-crafted features for a tracking task is the limited ability of the hand-crafted features to adapt to variations of object appearance that are complex, highly non-linear and time-varying <ref type="bibr" target="#b17">(Yilmaz et al., 2006;</ref><ref type="bibr" target="#b1">Chen et al., 2016)</ref>. Additionally, to achieve accurate recognition, major challenges that are required to be addressed include occlusions, cluttered backgrounds, viewpoint variations, changes in appearance (scale, pose and shape), similar appearing pedestrians, illumination variations and unpredictable nature of pedestrian movements <ref type="bibr" target="#b6">(Ji et al., 2013;</ref><ref type="bibr" target="#b1">Chen et al., 2016)</ref>. However, most of the state-of-the-art trackers address specific challenges and the generalisation abilities of the trackers are not sufficient <ref type="bibr" target="#b4">(Feris et al., 2013)</ref>. Re-identification of pedestrians (in single camera and multi-camera views) still remains an open challenge.</p><p>In this work, pedestrians are detected in each frame of CCTV images using a state-of-the-art object detection framework Faster R-CNN <ref type="bibr" target="#b12">(Ren et al., 2015)</ref>. Subsequently, to overcome the limitations of using hand-crafted features, automatic feature extraction from the detected pedestrians with deep convolutional neural networks (CNNs) is performed. <ref type="bibr" target="#b2">Donahue et al. (2014)</ref> state that the activations of the neurons in the late layers of a deep CNN serve as strong features for a variety of object recognition tasks. The hypothesis behind this work is that the extracted activations from the late layers of a deep CNN can be used to distinguish detected pedestrians across the frames and can be used to address a tracking-by-detection problem accurately. So, in a novel way features are used to address a tracking problem. Tracking is formulated as the correspondence of the detections across multiple frames and is achieved by matching the extracted features of individual detections in subsequent frames. The main contributions are:</p><p>• A framework for real-time pedestrian detection and tracking using CNNs is developed • A new algorithm is developed to establish correspondence between the detections across the frames The framework addresses challenges such as partial occlusion, variations in illumination, changes in pose, shape and scale of pedestrians, cluttered backgrounds and total occlusions for short periods. The framework is not able to handle total occlusions of long periods and fails to address the problem of having similar appearing persons in the same frame.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>Tracking is defined as the creation of trajectory of an object in an image plane and a tracker assigns correct labels to the tracked objects in different frames of a video. There are three fundamental aspects of pedestrian tracking that are analogous to object tracking: 1) detection of the pedestrian in the video frame, 2) tracking of the detection, and 3) analysis of the tracks for the specified purpose <ref type="bibr" target="#b17">(Yilmaz et al., 2006)</ref>. In the literature, for object detection point detectors, background subtraction methods, segmentation and supervised learning methods have been used. For accurate tracking, selection of suitable features plays a vital role and is related to object representation. Subsequently, the task of establishing correspondence of the detections is performed. This has been done in the past using deterministic or probabilistic motion models and appearance based kernel tracking models. Additionally, on-line adaptation methods have been proposed that adapt detectors to handle the variations in the appearances of the tracked objects over time. The detectors are trained and updated on-line during tracking, however these usually require a large number of instances for learning, which may not always be available. <ref type="bibr" target="#b1">(Chen et al., 2016;</ref><ref type="bibr" target="#b4">Feris et al., 2013)</ref>.</p><p>Recently, there has been a significant performance improvement in the field of image category classification and recognition by training a deep CNN with millions of images of different classes <ref type="bibr" target="#b8">(Krizhevsky et al., 2012)</ref>. The CNNs <ref type="bibr" target="#b9">(Lecun et al., 1998)</ref> are a machine learning method that exploits the local spatial information in an image and learns a hierarchy of increasingly complex features, thus automating the process of feature construction. CNNs are relatively insensitive to certain variations on the inputs <ref type="bibr" target="#b6">(Ji et al., 2013)</ref>.</p><p>Motivated by the success of image classification and recognition, attempts have been made to exploit the usefulness of deep CNN for tracking tasks. <ref type="bibr" target="#b3">Fan et al. (2010)</ref> design a CNN tracker with shift-variant architecture. The features are learned during off-line training that extracts both spatial and temporal information by considering image pairs of two consecutive frames rather than a single frame. The tracker extracts both local and global features to address partial occlusions and change in views. <ref type="bibr" target="#b6">Ji et al. (2013)</ref> use a 3D CNN model for pedestrian action recognition. The model extracts features from both spatial and temporal dimensions by performing 3D convolutions and captures motion information across multiple frames. <ref type="bibr" target="#b7">Jin et al. (2013)</ref> introduce a deep CNN for the task of tracking, which extracts features and transforms images to high dimensional vectors. A confidence map is generated by computing the similarities of two matches by using a radial basis function. <ref type="bibr" target="#b5">Hong et al. (2015)</ref> propose using outputs from the last layer of a pre-trained CNN to learn discriminative appearance models using an on-line Support Vector Machine (SVM). Subsequently, tracking is performed using sequential Bayesian filtering with a target-specific saliency map, which is computed by back-projection of the outputs from the last layer. <ref type="bibr" target="#b14">Wang et al. (2015)</ref> use features learned from a pre-trained CNN for on-line tracking. The CNN is fine-tuned during on-line tracking to adjust the appearance of an object specified in the first frame of the sequence and a probability map is generated instead of producing simple class labels. <ref type="bibr" target="#b15">Wang and Yeung (2013)</ref> train a stacked de-noising auto-encoder off-line and follow a knowledge transfer from off-line training to on-line tracking process to adapt appearance changes of a moving target. <ref type="bibr" target="#b11">Nam and Han (2015)</ref> propose a tracking algorithm that learns domain independent representations from pre-training, and captures domain-specific information through on-line learning during tracking. The network has a simple architecture compared to the one designed for image classification tasks. The entire network is pre-trained off-line, and the later fully connected layers including a single domain-specific layer are fine-tuned on-line. <ref type="bibr" target="#b10">Li et al. (2016)</ref> propose a novel tracking algorithm using CNN to automatically learn the most useful feature representation of a particular target object. A tracking-by-detection strategy is followed to distinguish the target object from its background. The CNN generates scores of all possible hypotheses of object locations in a frame. The tracker learns on the samples obtained from the current image sequence. <ref type="bibr" target="#b1">Chen et al. (2016)</ref> train a deep CNN and transfer the learned parameters for the tracking task and construct an object appearance model. Initial and on-line training is used to update the appearance model. Despite such success of CNNs, only a limited number of tracking algorithms (discussed above) exploiting CNNs are proposed so far in the literature. Moreover, previous works have not integrated the approach of detection and tracking simultaneously with CNNs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methodology</head><p>The developed framework uses CNNs both to detect pedestrians within the frames and track across the frames. A state-of-the-art object detection framework, Faster R-CNN <ref type="bibr" target="#b12">(Ren et al., 2015)</ref> is used for the detection of pedestrians. The features used for the tracking are derived from a pre-trained CNN (Fig. <ref type="figure" target="#fig_0">1</ref>) and serve as a strong basis for object recognition. The proposed algorithm<ref type="foot" target="#foot_1">1</ref> for creating correspondence is closest to the appearance based kernel tracking, but a robust representation is developed by imposing weights for appearance and spatial information.  A simplified layout of the framework is provided in Fig. <ref type="figure" target="#fig_1">2</ref>. The CCTV image frames are input to the detector that detects and localises individual pedestrians. Features from the cropped images of the pedestrians are extracted by a pre-trained CNN. The developed algorithm is used to make correspondences of the detections across the frames and ids are allocated to individual detections. Tracking results are shown by overlaying the ids of the detections on the respective frames.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Definitions</head><p>The last layer before the classification layer (FC-8) of the CNN generates a vector of 1000 features for each input image. Individual detections from each frame are in form of a bounding box around pedestrians. Subsequently, the detections are cropped and fed to the CNN that generates a matrix of feature vectors. Mathematically this can be represented as Eq. 1, where F V i(k) denotes the matrix of feature vectors of the i detections for a single Frame k, the set {A (1,i(k)) , ...., A (1000,i(k)) } denotes the activations of i th detection in Frame k, and i(k) denotes the number of detections in Frame k.</p><formula xml:id="formula_0">F V i(k) =    A (1,1) . . . A (1,i(k)) . . . . . . . . . A (1000,1) . . . A (1000,i(k))    (1) P C i(k) =    x (1,k) . . . x (i,k) . . . . . . . . . y (1,k) . . . y (i,k)    (2)</formula><p>The centroids of the detections can be expressed by Eq. 2. Where P C i(k) denotes the matrix of x and y coordinates of the centroids of i detections in Frame k. Correspondence is established by calculating a feature distance and a pixel distance between every pair of detections in two consecutive frames. Let F V i(k) and F V j(k+1) denote respectively the feature vectors for i and j detections in Frame k and Frame k +1. The normalised feature distance between the two detections F d(i(k),j(k+1)) is expressed as Eq. 3, where |F V | denotes l 2 -norm of a real vector FV. Let P C i(k) and P C j(k+1) denote the centroids for i and j detections in Frame k and Frame k + 1 respectively. Similarly the normalised pixel distance between the two detections P d(i(k),j(k+1)) is expressed as Eq. 4.</p><formula xml:id="formula_1">F d(i(k),j(k+1)) = | F V i(k) − F V j(k+1) | | F V i(k) || F V j(k+1) | (3) P d(i(k),j(k+1)) = | P C i(k) − P C j(k+1) | | P C i(k) || P C j(k+1) |<label>(4)</label></formula><p>A distance matrix F d(k+1) for the feature vectors is generated from the normalised pairwise feature distances and is represented by Eq. 5. A distance matrix for the pixel distances P d(k+1) is generated from the normalised pairwise pixel distances and is represented by Eq. 6. The matrices F d(k+1) and P d(k+1) are combined using a weight w (0 ≤ w ≤ 1). The combination result is called a tracking matrix T d(k+1) and is defined by Eq. 7.</p><p>Where t i(k),j(k+1) represents the weighted additions of F d(i(k),j(k+1)) and P d(i(k),j(k+1)) .</p><formula xml:id="formula_2">F d(k+1) =    F d(1,1) . . . F d(1,j(k+1)) . . . . . . . . . F d(i(k),1) . . . F d(i(k),j(k+1))    (5) P d(k+1) =    P d(1,1) . . . P d(1,j(k+1)) . . . . . . . . . P d(i(k),1) . . . P d(i(k),j(k+1))    (6) T d(k+1) = (w)P d(k+1) + (1 − w)E d(k+1) =    t (1,1) . . . t (1,j(k+1)) . . . . . . . . . t (i(k),1) . . . t (i(k),j(k+1))    (7)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Algorithm</head><p>In the first frame, the ids are generated randomly and tracked in the subsequent frames. The number of generated ids in the first frame is equal to the number of detections. For the detections in the subsequent frames either an id is assigned from the previous frame (which involves the matching based on the minimum distance criteria) or a new id is generated (which is for the case a new person enters the frame). Let the set {t (1,i(k)) , ...., t (i(k),j(k+1)) } denote the weighted distances from the detection j in Frame k + 1 to all detections in Frame k. The minimum value of the set {t (1,i(k)) , ...., t (i(k),j(k+1) } is used to make correspondence of j(k + 1) th detection in Frame k + 1 to the 1 st , ...., i(k) th detections in Frame k, only if this minimum value is below a threshold. Fig. <ref type="figure" target="#fig_2">3</ref>(a) illustrates the process of establishing correspondences for this case, where detection 1 of frame k + 1 is compared with i(k) detections of Frame k for a correspondence. If the minimum value of the set {t (1,i(k)) , ...., t (i(k),j(k+1)) } for a detection j(k + 1) in Frame k + 1 is above the threshold, no correspondence is made to the Frame k, but the detection is compared to the detections of previous z frames for a match. This is explained in Fig. <ref type="figure" target="#fig_2">3(b)</ref>, where detection 1 of frame k + 1 is compared with all the detections from Frame k to Frame k − z and each frame can contain different number of detections (a,b,g,h,i and j).</p><p>If a match is found, a correspondence of j(k + 1) th detection is made to the corresponding id of the detection in (k − z) th frame . If there is no match after comparing the previous z frames, the detection is assumed as a new pedestrian entering the frame. The new pedestrian is allocated a new id and it is tracked in the subsequent frames. If a pedestrian leaves the scene or is totally occluded in Frame k + 1, the the corresponding detection in Frame k will not have any match in Frame k + 1, but, that id will be stored in the database for future correspondences. However, if the algorithm is able to re-identify the pedestrian after total occlusion in the z previous frames, it is allocated the corresponding id of the detection in the (k − z) th frame.</p><p>Multiple correspondences from j(k + 1) detections to i(k) th detection might happen, if j(k + 1) &gt; i(k). Such situations may be resolved by creating a correspondence of j(k + 1) t h detection to the n(k) th detection having the least value of the set {t (1,i(k)) , ...., t (i(k),j(k+1)) }. The correspondence of unallocated detections is done by using the second least value of the set {t (1,i(k)) , ...., t (i(k),j(k+1)) } if it is below the threshold. If not, then the unallocated detections are compared to the detections of previous z frames for a match.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head><p>Town centre dataset<ref type="foot" target="#foot_2">2</ref> was used for evaluation in the study. First 30 seconds of the video was used at a reduced frame rate of 8 frames per second. The detection and tracking are evaluated separately using tracking matrices. The results of the detections and tracking are shown in Fig. <ref type="figure" target="#fig_3">4</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Evaluation of the detector and the tracker</head><p>The average precision and recall of the detector for used data are 0.93 and 0.80 respectively. High precision means that most detected objects are actually pedestrians and high recall means that most pedestrians in the scene are detected. The variation of precision and recall with frames is shown in Fig. <ref type="figure" target="#fig_4">5</ref>. Multi-object tracking precision (MOTP) and multi-object tracking accuracy (MOTA) <ref type="bibr" target="#b0">(Bernardin and Stiefelhagen, 2008)</ref> matrices are used for the evaluation of the tracker. The matrices are used for objective comparison of tracker characteristics on their precision in estimating object locations, their accuracy in recognising object configuration and their ability to consistently label objects over time. MOTP and MOTA are expressed mathematically as:</p><formula xml:id="formula_3">M OT P = i,t d i t t C t (8) M OT A = 1 − t (m t + f p t + mme t ) t g t<label>(9)</label></formula><p>Where d i t is the distance between the detection and the i t h pedestrian (from the ground truth) and C t is the number of matches found in time t. m t , f p t and mme t are the number of misses in detection (false negatives), number of false positives and the number of mismatches in the correspondence respectively, and g t represents the number of pedestrians present at time t. MOTP is the total error in estimated position of detections over all frames, averaged by the number of correspondences made. Higher value of MOTP signifies low accuracy of the bounding boxes around the object. Higher values of MOTA signifies high accuracy in tracking. Experimental results of MOTP and MOTA for the dataset are 27.92 pixels and 71.13 % respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Discussion</head><p>Tracking is achieved by creating a correspondence of detections of two consecutive frames only (provided that there are no multiple correspondences or correspondence to z previous frames). Hence, the appearance of pedestrians are updated over time and the framework is robust to change in appearance (pose, shape and scale). The detector misses some of the detections due to total occlusions and hence explains the low value of recall. Another contributor to lower recall values is that the detector misses pedestrians that appear smaller due to their distance to the camera. This can be alleviated in a multi-camera setting, where pedestrians that are missed in one camera are likely to be detected in another camera. On a closer observation, the low value of precision is due to the false detections created by the reflection of the pedestrians in a glass panel that is present in the dataset. High value of MOTP is due to the inaccuracy of the bounding boxes of the detected pedestrians. This is insignificant considering the high resolution of the dataset. Low value of MOTA is mainly due to the large number of misses in the detection and partially due to the false detections and mismatches in the correspondence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>A framework is developed for real-time detection and tracking of pedestrians in CCTV image frames using CNNs. A new algorithm is developed for making correspondence of the detections across multiple frames. The detector is able to overcome the challenges of variations in the illumination, cluttered backgrounds, partial occlusions and changes in the scale. The tracking algorithm is able to track pedestrians with 71.13 % accuracy and addresses the problem of changes in appearance (pose and shape) and total occlusions for short periods. However, total occlusions for longer periods remains a challenge to be addressed for future work. To improve the accuracy, it is proposed to perform the evaluation and estimation of pedestrians' future trajectories from past observations (e.g. Kalman filtering) for overcoming the problem of unpredictable pedestrian movements. To address the problem of total occlusions and similar persons, an average representation of individual pedestrians (for all the tracked frames) can be used.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: A simplified architecture of the deep CNN (after Krizhevsky et al. (2012)). Fully connected -8 (FC-8) layer is the last layer before the classification, from which the features are extracted.</figDesc><graphic coords="3,190.19,191.55,231.90,102.90" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: A simplified layout of the framework.</figDesc><graphic coords="3,65.55,326.65,484.50,87.38" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: The algorithm used for correspondence. Green colour represents the unallocated detections. Blue colour represents the detections of the previous frame/ frames for comparing. Yellow colour represents the rest of the detections that are present in the database.Let the set {t (1,i(k)) , ...., t (i(k),j(k+1)) } denote the weighted distances from the detection j in Frame k + 1 to all detections in Frame k. The minimum value of the set {t (1,i(k)) , ...., t (i(k),j(k+1) } is used to make correspondence of j(k + 1) th detection in Frame k + 1 to the 1 st , ...., i(k) th detections in Frame k, only if this minimum value is below a threshold. Fig.3(a) illustrates the process of establishing correspondences for this case, where detection 1 of frame k + 1 is compared with i(k) detections of Frame k for a correspondence. If the minimum value of the set {t (1,i(k)) , ...., t (i(k),j(k+1)) } for a detection j(k + 1) in Frame k + 1 is above the threshold, no correspondence is made to the Frame k, but the detection is compared to the detections of previous z frames for a match. This is explained in Fig.3(b), where detection 1 of frame k + 1 is compared with all the detections from Frame k to Frame k − z and each frame can contain different number of detections (a,b,g,h,i and j).If a match is found, a correspondence of j(k + 1) th detection is made to the corresponding id of the detection in (k − z) th frame . If there is no match after comparing the previous z frames, the detection is assumed as a new pedestrian entering the frame. The new pedestrian is allocated a new id and it is tracked in the subsequent frames. If a pedestrian leaves the scene or is totally occluded in Frame k + 1, the the corresponding detection in Frame k will not have any match in Frame k + 1, but, that id will be stored in the database for future correspondences. However, if the algorithm is able to re-identify the pedestrian after total occlusion in the z previous frames, it is allocated the corresponding id of the detection in the (k − z) th frame.Multiple correspondences from j(k + 1) detections to i(k) th detection might happen, if j(k + 1) &gt; i(k). Such situations may be resolved by creating a correspondence of j(k + 1) t h detection to the n(k) th detection having</figDesc><graphic coords="4,88.89,353.11,434.52,105.12" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: (a) shows the detections in video sequence that are 10 frames apart. (b) shows the tracking results. The number denoting each pedestrian is generated randomly in the first frame.</figDesc><graphic coords="5,64.80,161.91,485.00,78.41" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: The variation of precision and recall with frames.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">Proc. of the 4th Annual Conference of Research@Locate</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1">For MATLAB implementation visit https://github.com/debaditya − unimelb/CN N pedestrian tracking/. Proc. of the 4th Annual Conference of Research@Locate</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2">available at: http://www.robots.ox.ac.uk/ActiveV ision/Research/P rojects/2009bbenf old headpose/project.html Proc. of the 4th Annual Conference of Research@Locate</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This research was supported by a Research Engagement Grant from the Melbourne School of Engineering and the Melbourne Research Scholarship. The authors thank Active Vision Laboratory, Department of Engineering Science, University of Oxford for the publicly available dataset and ground-truth data.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Evaluating multiple object tracking performance: The clear mot metrics</title>
		<author>
			<persName><forename type="first">K</forename><surname>Bernardin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stiefelhagen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">EURASIP Journal on Image and Video Processing</title>
		<imprint>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">246309</biblScope>
			<date type="published" when="2008">2008. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Cnntracker: Online discriminative object tracking via deep convolutional neural network</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Soft Computing</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="1088" to="1098" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Decaf: A deep convolutional activation feature for generic visual recognition</title>
		<author>
			<persName><forename type="first">J</forename><surname>Donahue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hoffman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Tzeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="647" to="655" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Human tracking using convolutional neural networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Neural Networks</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="1610" to="1623" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Boosting object detection performance in crowded surveillance videos</title>
		<author>
			<persName><forename type="first">R</forename><surname>Feris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Datta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pankanti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Workshop on Applications of Computer Vision</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="427" to="432" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Online tracking by learning discriminative saliency map with convolutional neural network</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>You</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kwak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Han</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1502.06796</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">3d convolutional neural networks for human action recognition</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="221" to="231" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Tracking with deep neural networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dundar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bates</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Farabet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Culurciello</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information Sciences and Systems (CISS), 2013 47th Annual Conference on</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Imagenet classification with deep convolutional neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 25</title>
				<editor>
			<persName><forename type="first">F</forename><surname>Pereira</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><forename type="middle">J C</forename><surname>Burges</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1097" to="1105" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Gradient-based learning applied to document recognition</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Lecun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Haffner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE</title>
				<meeting>the IEEE</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="volume">86</biblScope>
			<biblScope unit="page" from="2278" to="2324" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Deeptrack: Learning discriminative feature representations online for robust visual tracking</title>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Porikli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Image Processing</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="1834" to="1848" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Learning multi-domain convolutional neural networks for visual tracking</title>
		<author>
			<persName><forename type="first">H</forename><surname>Nam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Han</surname></persName>
		</author>
		<idno>Repository abs/1510.07945</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">Computing Research</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Faster r-cnn: Towards real-time object detection with region proposal networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 28</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Cortes</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Lawrence</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lee</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sugiyama</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="91" to="99" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Visual tracking: An experimental survey</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">W</forename><surname>Smeulders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Chu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cucchiara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Calderara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dehghan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="1442" to="1468" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Transferring rich feature hierarchies for robust visual tracking</title>
		<author>
			<persName><forename type="first">N</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yeung</surname></persName>
		</author>
		<idno>Repository abs/1501.04587</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">Computing Research</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Learning a deep compact image representation for visual tracking</title>
		<author>
			<persName><forename type="first">N</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">Y</forename><surname>Yeung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 26</title>
				<editor>
			<persName><forename type="first">C</forename><forename type="middle">J C</forename><surname>Burges</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Z</forename><surname>Ghahramani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="809" to="817" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Intelligent multi-camera video surveillance: A review</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="3" to="19" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Object tracking: A survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Yilmaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Javed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="issue">4</biblScope>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
