<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">AUTH-Sheep: An Annotated Video Dataset for Detection and Tracking of Sheep in UAV Imagery</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Oliver</forename><surname>Doll</surname></persName>
							<email>oliver.doll@idmt.fraunhofer.de</email>
						</author>
						<author>
							<persName><forename type="first">Alexander</forename><surname>Loos</surname></persName>
							<email>alexander.loos@idmt.fraunhofer.de</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">Audio-Visual Systems</orgName>
								<orgName type="institution">Fraunhofer IDMT</orgName>
								<address>
									<addrLine>Ehrenbergstr. 31</addrLine>
									<postCode>98693</postCode>
									<settlement>Ilmenau</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">International Workshop on Camera Traps, AI, and Ecology</orgName>
								<address>
									<addrLine>September 5-6</addrLine>
									<postCode>2024</postCode>
									<settlement>Hagenberg</settlement>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">AUTH-Sheep: An Annotated Video Dataset for Detection and Tracking of Sheep in UAV Imagery</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">6E208344988F1CEC6C6F112EA907C590</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T20:07+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>dataset</term>
					<term>OBB</term>
					<term>sheep detection</term>
					<term>MOT</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Object detection and tracking in drone imagery is still an open research field, especially for livestock monitoring and when detection is carried out on the drone itself. In this paper, we present the first annotated aerial video dataset of sheep, which we will make publicly available to the research community to foster further research in this field. Our AUTH-Sheep dataset consists of 4 videos with frame-accurate annotations of oriented bounding boxes and consistent track IDs per object and video. Furthermore, we developed a full detection and tracking pipeline as a baseline implementation to give other researchers a reference approach to compare their algorithms against. For this, we compared horizontal and oriented bounding box detection for the task at hand. Therefor, the YOLOv8 nano detector is utilized, which was pre-trained on a different dataset. To be able to train this detector of oriented bounding boxes, we semi-automatically created new oriented annotations for an existing dataset of sheep images.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Recently, unmanned aerial vehicles (UAVs) equipped with camera systems and edge computing devices have become an alternative to camera traps located on the ground as a promising tool for monitoring wild as well as livestock animals. Due to the technical possibilities of UAVs, new perspectives on monitoring are opened up. Typically, UAVs can only fly and record for several minutes, but at the same time they can cover a larger area than camera traps. In this paper, we focus on the livestock farming use-case. In particular, we consider free ranging sheep living unattended at the island of Lesvos, Greece. The goal is to develop a system for autonomous detection and tracking of sheep to enable a reliable counting and monitoring of the flock. Usually, flocks of sheep are supervised by a shepherd who is continuously present to keep track of their numbers, health and position. In the case of free-range sheep, there is no such authority and those responsible must carry out checks at regular intervals. These inspections can be difficult to carry out on terrain that is difficult to access and where visibility is limited. UAVs are suitable for overcoming these difficulties, as they are not restricted by the terrain on the ground. However, for drones to be a practical solution for the task at hand, the information obtained from UAVs must be accurate and reliable. Instead of manual inspection of the obtained video footage, recent developments in deep-learning based computer vision methods for object detection and tracking paved the way for fast and accurate automatic analysis. One possible way to realize this is to stream videos from the drone to the ground and use dedicated hardware as well as large and cutting-edge deep learning models for sheep detection and tracking. Unfortunately, streaming high-quality videos in real-time from a drone to the ground is often not trivial and hardly feasible, especially in areas without suitable infrastructure. An arguably more practical approach is to integrate the necessary computer vision algorithms on the UAV itself, and only stream the resulting metadata to the ground, which requires drastically less bandwidth than streaming the video directly. However, this means that the complexity of the algorithms must be kept to a minimum, as the available computing power is limited.</p><p>In this paper, we present the first publicly available annotated dataset of aerial videos of sheep recorded at the University Farm of the Aristotle University of Thessaloniki (A.U.Th.). Our AUTH-Sheep dataset consists of 4 videos with frame-accurate ground truth annotations of oriented bounding boxes and consistent track ID per object and video. By providing such a dataset together with a baseline implementation of a full detection and tracking pipeline, we hope to stipulate further research in this field. As object detector we build upon the YOLOv8 nano model which we found to be most suitable in our previous work <ref type="bibr" target="#b0">[1]</ref>. In our experiments, we compare the utilization of horizontal bounding boxes (HBB) and oriented bounding boxes (OBB) for detection directly on the drone. Sheep are often clustered in flocks and bounding boxes are heavily overlapping, which often introduces ambiguity during tracking. We argue that when using OBB instead of HBB the ambiguity is greatly reduced and thus more accurate results can be expected. On top of that, a state-of-the-art tracking algorithm is tested based on the obtained detections in order to be able to assign unique object IDs to the detected sheep. This allows for more accurate counting and possibly even additional traits such as animal welfare assessment.</p><p>To enable comparison of horizontal and oriented object detection, we semi-automatically have created new annotations based on the available rectangular ground truth regions for a publicly available UAV image dataset of sheep named SheepCounter <ref type="bibr" target="#b1">[2]</ref>.</p><p>The dataset and scripts will be made publicly available at https://github.com/idmt-odoll/ AUTH-Sheep/.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Animal Detection in Aerial Images</head><p>Despite advancements in object detection, detecting animals in UAV imagery is still a challenging task and requires accurate detection models. At the same time, energy efficient models are desired to enable implementation on edge-devices. Thus, recent trends in computer vision investigate possibilities for smaller and more efficient models which do not suffer from a significant drop in accuracy.</p><p>In <ref type="bibr" target="#b2">[3]</ref>, YOLOv4 and YOLOv5 models were compared for counting cattle at various altitudes from 20 to 100 m, with YOLOv5 being better than YOLOv4 and all models exceeding a precision of 92 %. Interestingly, the simpler YOLOv5-s model outperformed the more complex YOLOv5-m model. Wang et al. <ref type="bibr" target="#b3">[4]</ref> enhanced the YOLOX nano model for small object detection, a common weakness of YOLO detectors, enabling detection of cattle, sheep and horse at an altitude of 300 m. They found that for increasing scale differences from training data, the detection performance decreased, but differently for all classes. For common cranes, <ref type="bibr" target="#b4">[5]</ref> showed that automatic counting with the YOLOv3 model (99.91 % precision, 94.59 % recall) was more accurate than manual counting for RGB images at daylight. In <ref type="bibr" target="#b5">[6]</ref>, YOLOv4 outperformed YOLOv3 and SSD in detecting deer, achieving 86 % precision and 75 % recall. A different approach in <ref type="bibr" target="#b6">[7]</ref> used a segmentation algorithm based on species-specific sRGB color profiles, achieving 100 % precision and 98.87 % recall for Arabian Oryx.</p><p>In our own previous work, we presented initial findings by comparing the performance of different state-of-the-art object detectors on publicly available UAV images of sheep <ref type="bibr" target="#b0">[1]</ref> in order to be able to better pre-select potential object detectors for the task at hand. In this paper, we will build on our previous work, where we showed that the nano version of the YOLOv8 model series is best suited for sheep detection in aerial imagery on edge devices. It will thus be utilized throughout the experiments in this paper as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Multiple Object Tracking</head><p>Multiple Object detection (MOT) is the task to detect and associate objects from a specific class across a video. One approach to accomplish this is to use heuristic information such as spatialbased and appearance-based information. In our work, we focus on tracker that strongly use those spatial-based information. For short time intervals between frames, the movement of an object is likely to be small and can usually be treated as linear. Most of those works, pioneered by SORT (Simple online and realtime tracking) <ref type="bibr" target="#b7">[8]</ref>, utilize Kalman filter <ref type="bibr" target="#b8">[9]</ref> to predict the location of the object in the new frame based on previous movement of that object. The association then is performed using the Intersection over Union (IoU) metric. ByteTrack improves this approach by introducing a two-stage association step <ref type="bibr" target="#b9">[10]</ref>. In the first step, the high confidence detections are matched. A new feature is the matching of low confidence detections in the second step, which can include partially occluded and motion blurred objects. BoT-SORT builds on ByteTrack and introduces an improved Kalman filter and camera motion compensation, resulting in better predictions of the object positions in new frames <ref type="bibr" target="#b10">[11]</ref>. OC-SORT on the other hand improves the prediction of new object positions during occlusion and non-linear movement <ref type="bibr" target="#b11">[12]</ref>. They compute a virtual trajectory using measurements of the object detector and allow the matching with lost tracks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Datasets</head><p>Two different datasets were used in this work. The SheepCounter dataset was used for training and validation of the YOLO detectors. For testing, the AUTH-Sheep dataset was used, which will be discussed in detail in section 3.2. The images of both datasets have a resolution of 3840 x 2160 pixels, while there are some images with a resolution of 4096 x 2160 pixels in SheepCounter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">SheepCounter</head><p>The SheepCounter dataset is available at roboflow and consists of 1727 images. They have green meadows as backgrounds with different lighting conditions, saturation and shadow lengths. The images are from several flights, but only selected frames were kept. Most of the sheep are white. Besides sheep, a few cows appear, but they are not annotated. The original annotations contain 55 435 instances of sheep.</p><p>We used these rectangular annotations as basis and transformed them into oriented bounding boxes to be able to train and evaluate OBB detectors. First, we utilized Microsoft's Segment Anything Model (SAM) to generate one segmentation mask per bounding box <ref type="bibr" target="#b12">[13]</ref>. If multiple masks were generated, only the largest was kept. The image moment of the object is calculated, which allows the determination of a major and minor axis and the orientation of the objects with respect to their major axis <ref type="bibr" target="#b13">[14]</ref>. In the next step, the found orientation is used to align the major axis with the x-axis. This allows the smallest box around the region contour and parallel to the major axis to be determined using a simple horizontal box. By reversing the alignment, the oriented bounding box was obtained.</p><p>In the next step, the new bounding boxes were manually verified using CVAT, a publicly available tool commonly used by researchers for ground truth annotation of images and videos <ref type="bibr" target="#b14">[15]</ref>. A common problem was the existence of multiple bounding boxes for a sheep, while the original bounding boxes of other sheep were erased. Other problems were multiple animals per bounding box or shadows included as part of the animal, segmented by SAM. A few images had no annotations at all or not all animals were annotated. The new annotations include 56 681 oriented bounding boxes and also include partial sheep that appear at the edge of the image. Horizontal bounding boxes for the comparison of OBB vs. HBB algorithms were created by taking the minimum and maximum pixel positions of the oriented box in each direction. These new horizontal annotations have been created to ensure that the annotations from SheepCounter and the new AUTH-Sheep dataset have similar label quality. For AUTH-Sheep, there are no such best-fitting horizontal annotations to work with, they would have to be created from scratch. This would have been a lot of extra work on top of the oriented bounding boxes, which we wanted to avoid.</p><p>The size of objects is directly related to the altitude of the UAV. To better estimate the altitude at which the detectors can reliably detect sheep, objects are classified by their bounding box size in each frame. Inspired by the COCO dataset, five scale groups were defined and evaluated separately <ref type="bibr" target="#b15">[16]</ref>. We found it necessary to define new groups because the area sizes for COCO were introduced for an image size of 640 x 480 pixels. The imagery used in this work has a minimum resolution of 3840 x 2160 pixels, which results in a completely different scale of objects. The five new scale groups are named nano, small, medium, large and extended. Objects in the nano group contain less than 64 2 pixels. The thresholds for small, medium and large objects are 96 2 , 128 2 and 160 2 respectively while all objects larger than 160 2 are considered as extended. Since young animals are usually smaller than older ones, they are also categorized as correspondingly smaller objects for most of the recording altitudes, which can lead to a kind of bias. For our work, we ignore this because we only evaluate object size without paying attention to the age of animals. SheepCounter was used for training and validation, but not for testing. Also, the frames of the original videos seem to be evenly distributed among the predefined training, validation, and test sets. This leads to similar frames in all three subsets, which is not beneficial for testing the generalization capability of the model. In an attempt to correct this, the SheepCounter dataset was restructured. The restructured dataset consists of a training and validation split only. All frames were sorted into their original source videos based on their naming and content, resulting in five source videos. These videos were then manually split up into two parts. This reduces the amount of very similar samples in both subsets.</p><p>The new dataset annotations have been broken down in more detail in Table <ref type="table" target="#tab_0">1</ref>. As expected, the horizontal bounding boxes are more often categorized into larger groups than the oriented boxes. While for oriented boxes the most common objects are categorized as small or medium, most horizontal boxes are categorized as medium or large. Oriented boxes have an average area of 10 021 pixels, while horizontal boxes have an average area of 18 618 pixels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">AUTH-Sheep</head><p>The new dataset we present in this paper consists of four videos recorded at the University Farm of the Aristotle University of Thessaloniki (A.U.Th.). Figure <ref type="figure" target="#fig_0">1</ref> shows the first, middle, and last frame of each video, which gives some idea of the movement of the drone and objects. The drone was moving in all the videos, constantly changing its position and altitude, but with different patterns. Videos 1 and 2 were recorded at the same location but at different times, with goats also present in video 2. Video 3 was recorded at a different location and also the animals and the camera movement are the least dynamic of all the videos. Video 4 seems to be the most challenging recording, with the highest altitude and most clustered sheep. The combined length of all the videos in the dataset is 2:58 minutes, or 5328 frames, and contains a total of 152 837 annotated instances. The annotations consist of oriented bounding boxes and unique object IDs, which allow the evaluation of tracking algorithms. For each video, the  ID of an object remains the same, even if the object leaves the frame or is occluded for some time. Four different classes are annotated, namely goats, horses, humans and sheep. One thing missing is metadata for accurate information about the drone's altitude, speed, and orientation.</p><p>A more detailed overview of the dataset is presented in Table <ref type="table" target="#tab_1">2</ref>, including the length of the videos and the number of instances for each class.</p><p>For our experiments, we focus only on the sheep class with a total of 95 533 object instances. In Table <ref type="table" target="#tab_2">3</ref>  said with a high degree of certainty that this video was recorded at the highest average flight altitude. Video 2 also has only 2 extended instances, but is more balanced in the remaining four groups than video 4. The most balanced video seems to be video 1. The lowest average altitude can be expected in video 3, where almost all instances are medium to extended.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Object Detection</head><p>For the detection task, two different variants of the YOLOv8-nano model were compared. The first was pre-trained on the COCO dataset <ref type="bibr" target="#b15">[16]</ref> and predicts horizontal bounding boxes, hence this version will be referred to as the HBB model. A second variant was pre-trained on the DOTAv1 dataset <ref type="bibr" target="#b16">[17]</ref> and predicts oriented bounding boxes, hence this variant is called OBB model. All pre-trained models used were provided by Ultralytics, whose environment was also used for the transfer learning for the task at hand. Both models were transfer learned and validated on the restructured SheepCounter dataset described in section 3.1. The loss was monitored on the validation set until convergence. If no improvement in the mAP50-95 score was observed for the last 50 epochs, the training was stopped. The model layers were not frozen and all weights could be adjusted. For augmentations during transfer learning, the standard Ultralytics hyperparameter optimized for the COCO challenge were used. These augmentations include translation, scaling, left-right flipping, altering of the HSV color space, and erasing random portions of the image. Only the mosaic augmentation was disabled, as previous experiments showed that this improves the learning process for our use case. The batch size was set to 16 and the AdamW optimizer was used with an initial learning rate of 0.002 and a momentum of 0.9. As a post-processing step, only predictions with a confidence of 0.25 or higher were kept and non-maximum suppression was performed with an IoU threshold of 0.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Metrics</head><p>The main metric used was the COCO variant of the mean average precision (mAP). The mAP is the mean value of the average precision (AP) over all classes averaged over ten IoU thresholds 𝐼𝑜𝑈 = 0.5, 0.55, ..., 0.95. In accordance with the Ultralytics framework <ref type="bibr" target="#b17">[18]</ref> used for the experiments, this metric is called mAP50-95 in the following. In addition, the mean average recall (mAR) is also used in the same version, resulting in the mAR50-95 score. To evaluate the oriented bounding boxes, they were treated as segmentation masks. For better insight, the </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Comparison of two model input sizes with a similar amount of pixels in terms of used area when the input image has a 16:9 aspect ratio. It is assumed that the image is padded to the full model input size. MACs (in billions), as measure of computational effort, have been calculated for ONNX models.</p><p>mAP50-95 is also calculated for the five scale groups defined in section 3.1. The evaluation was performed using pycocotools, an API for the evaluation methods used for COCO.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Model Input Size</head><p>When applying the object detector on the edge, power is limited and hence should be used optimally. Typically, deep learning models expect square images as input, while the actual images are often non-square. These images are then typically padded to fit the input size of the deep learning model, which introduces unnecessary data and thus avoidable overhead. Since we already knew that most of the training images and all the test videos had a 16:9 aspect ratio, a fixed new input size with a similar aspect ratio was calculated. There were two boundary conditions that were taken into account. First, the used YOLO model has five downsampling layers, which requires the input size to be a multiple of 2 5 = 32. Second, the new input should not contain more pixels than the original input size of 640 x 640.</p><p>The new model input size was set to 832 x 480 pixels. Table <ref type="table">4</ref> shows the theoretical comparison with the standard input size of 640 x 640. While the amount of pixels for the new input size is 2.5 % lower, the percentage of the input area used increases by 41.25 % to a total of 97.5 %. Therefore, it can be expected that there will be only minimal additional padding at the edge of the image. As expected, the computational effort, expressed in MACs (Multiply-Accumulate Operations), decreases by 2.5 %, proportional to the amount of pixels.</p><p>Comparing the actual results on the validation set, it's clear that the new model input size improves performance for both model types and for all metrics used. For the HBB model, all mAP metrics were improved by about 0.06 for all scale groups, except the nano objects. The mAR50-95 score also increased by 0.053. For the OBB model, however, the improvement is not as significant. mAR50-95 improved by 0.043 and mAP50-95 by 0.042. The largest gain was seen for small objects (0.056) and the smallest gain for extended objects (0.012). A notable result is that although the OBB model is better than the HBB model in all scale groups except medium objects, the value for mAP50-95 (all) is lower. This can be attributed to the fact that there are more small and nano objects for OBB, and the models generally perform worse on these compared to medium to large objects. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 6</head><p>Detection results of the HBB model on the AUTH-Sheep dataset. For videos 2 and 4, there are no mAP50-95 results for extended objects because there were only 1 and 2 ground truth instances, respectively. There were no nano objects in video 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Results on AUTH-Sheep</head><p>The AUTH-Sheep dataset is used for the final evaluation. In three cases there are no mAP50-95 results for a particular scale group because there were not enough or no ground truth instances. These cases are extended objects in videos 2 and 4 and nano objects in video 3. Table <ref type="table">6</ref> shows the detection results of the HBB model for each video. Compared to the results on the validation set of SheepCounter, the model performs worse. The only exception is that nano objects are detected much more reliably in all videos than on the validation set, with an increase of 0.379 for video 1. In the same video, nano to medium objects are better detected than large and extended objects, which is completely different from the training results. Similar observations can be made for video 2, but without the nano objects. For video 3, the mAP50-95 score is the most balanced across all scale groups. For video 4, the model seems to fail completely.</p><p>The trend of results for the OBB model, shown in Table <ref type="table" target="#tab_5">7</ref>, is comparable to that of the HBB model. In general, the OBB model performed worse than the HBB model for all metrics on all videos, with the only exceptions being extended objects in videos 1 and 3, and also large objects in video 2. While the OBB model performed better on nano objects in training than the HBB model, the opposite is true for the test set.</p><p>There are several possible reasons why the detection performance is worse on the test set. One reason could be overfitting of the models. The YOLOv8 nano models used are quite small and the diversity of training data was limited. In addition, the AUTH-Sheep dataset is quite different </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Object Tracking</head><p>For the tracking task, we used the BoT-SORT algorithm without the re-identification module.</p><p>As for the detection task, the implementation of the Ultralytics framework was used, since it includes the tracking of oriented bounding boxes. While the Kalman filter was not changed, the matching algorithm and the tracklet include the rotation of the boxes. The Kalman filter uses a constant-velocity model to predict the bounding box in the next frame. Camera motion can interfere with these predictions, resulting in an incorrect location of the predicted box. The BoT-SORT includes a camera motion compensation model to counteract this problem. An optional re-identification module was not used because such a pre-trained module was not available and would have a high impact on the computational complexity anyway. A main objective of our work is an application on the edge, which demands more lightweight algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Metrics</head><p>Tracking performance is evaluated using three metrics, namely CLEAR metrics <ref type="bibr" target="#b18">[19]</ref>, IDF1 <ref type="bibr" target="#b19">[20]</ref> and Higher-Order Tracking Accuracy (HOTA) <ref type="bibr" target="#b20">[21]</ref>. For testing, all frames were used consecutively without skipping any frames. The evaluation tool used was the TrackEval framework <ref type="bibr" target="#b21">[22]</ref> and all tracking results were transformed into the mots format <ref type="bibr" target="#b22">[23]</ref>. The most important score of the CLEAR metrics is MOTA (multiple object tracking accuracy), which focuses more on detection performance than identity association. IDF1 focuses more on the identity association performance of the tracker, while HOTA is a metric that considers both detection and identification almost equally. In addition to these specific metrics, the number of detections, ground truth objects, associated IDs, and ground truth IDs were considered.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Results on AUTH-Sheep</head><p>The results for the tracking task are less one-sided than those for detection. Overall, the HBB model (Table <ref type="table" target="#tab_6">8</ref>) outperformed the OBB model (Table <ref type="table" target="#tab_7">9</ref>) in HOTA, MOTA, and IDF1 scores. For both models, the performance is best on video 3, followed by videos 1 and 2, and worst on video 4, which is similar to the mAP50-95 score for detection. Comparing the performance with the distribution of sheep in different scale groups in Table <ref type="table" target="#tab_2">3</ref>, the results correspond to the sum of medium to extended objects per video. Video 3 has almost only medium to extended objects (98.8 %) and shows the best tracking results. At the same time, only a fraction of objects are medium to extended (5 %) in video 4, for which both models show equally poor performance. This suggests that the models are able to track sheep in UAV images when the sheep are large enough. The seemingly increased detection ability, in the form of the MOTA score, compared to the pure detection results from section 4 can be explained by a lower confidence threshold during tracking. BoT-SORT includes detections with a confidence of 0.1 or higher, while for detection the threshold was set to 0.25. Also, the MOTA and IDF1 scores were only calculated for an IoU threshold of 0.5, so the localization performance wasn't taken into account.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this study, we presented AUTH-Sheep, the first UAV video dataset of sheep with frameaccurate annotations of oriented bounding boxes and track IDs, which we will make publicly available to the scientific community. Furthermore, we also investigated two methods for object detection and tracking of sheep on the drone itself. The primary focus was on evaluating the performance of detection and tracking when using horizontal and oriented bounding boxes. For this purpose, the YOLOv8-nano model was used and tuned for specific input sizes. Surprisingly, and against our expectations, the HBB model outperformed the OBB model for detection and tracking. While the detection performance clearly favors the HBB model, the tracking results are less clear and vary depending on the video and metric. This behavior definitely needs further investigation in future work.</p><p>The restructured SheepCounter dataset, with its new annotations for horizontal and oriented bounding boxes, significantly contributed to the training process. The manual verification step ensured the accuracy of bounding boxes and ID tracks for both datasets used. The BoT-SORT algorithm, without the re-identification module, was effective for tracking. However, the tracking performance varied significantly between videos, indicating the influence of factors such as flight altitude and flight patterns.</p><p>The limited amount of data and the inherent variations in flight altitude, lighting conditions, and object size posed significant challenges. This was evident in the performance drop when models were tested on the AUTH-Sheep dataset, which differed from the training dataset in several aspects.</p><p>To further improve the robustness and accuracy of object detection and tracking in UAV videos, we propose to increase the dataset size and diversity by including more varied environmental conditions and flight parameters to improve model generalization. Despite the computational overhead, incorporating re-identification modules could improve tracking performance, especially in scenarios with frequent occlusions and object reappearances.</p><p>In conclusion, although the study demonstrates promising results in object detection and tracking of sheep in UAV videos, there is room for improvement. Addressing the identified challenges and following the recommended future work will pave the way for more reliable and efficient systems, with broader applications in wildlife monitoring and agricultural management.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Sample frames from the 4 videos of AUTH-Sheep, including the first, middle and last frame of each video.</figDesc><graphic coords="6,138.05,294.03,112.50,63.28" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>New annotations for the restructured SheepCounter dataset, broken down for the 5 new scale groups. Data are provided for oriented bounding boxes (OBB) and horizontal bounding boxes (HBB).</figDesc><table><row><cell></cell><cell>train</cell><cell></cell><cell>valid</cell><cell></cell><cell>all</cell><cell></cell></row><row><cell>images</cell><cell>1203</cell><cell></cell><cell>350</cell><cell></cell><cell>1727</cell><cell></cell></row><row><cell>instances</cell><cell cols="2">43 730</cell><cell cols="2">12 951</cell><cell cols="2">56 681</cell></row><row><cell></cell><cell>OBB</cell><cell cols="3">HBB OBB HBB</cell><cell>OBB</cell><cell>HBB</cell></row><row><cell>nano</cell><cell>2759</cell><cell>1772</cell><cell>649</cell><cell>538</cell><cell>3408</cell><cell>2310</cell></row><row><cell>small</cell><cell>24 416</cell><cell cols="4">6576 5278 1320 29 694</cell><cell>7896</cell></row><row><cell>medium</cell><cell cols="6">12 471 18 075 6219 4515 18 690 22 590</cell></row><row><cell>large</cell><cell cols="2">1302 11 696</cell><cell cols="2">356 4638</cell><cell cols="2">1658 16 334</cell></row><row><cell>extended</cell><cell>2782</cell><cell>5611</cell><cell cols="2">449 1940</cell><cell>3231</cell><cell>7551</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Overview of the annotations per video and class of AUTH-Sheep.</figDesc><table><row><cell></cell><cell cols="3">frames instances sheep human</cell><cell cols="2">goat horse</cell></row><row><cell>video 1</cell><cell>1198</cell><cell>21 336 20 814</cell><cell>522</cell><cell>-</cell><cell>-</cell></row><row><cell>video 2</cell><cell>929</cell><cell>31 408 16 509</cell><cell cols="2">1638 13 261</cell><cell>-</cell></row><row><cell>video 3</cell><cell>1406</cell><cell>46 644 26 598</cell><cell cols="2">62 19 984</cell><cell>-</cell></row><row><cell>video 4</cell><cell>1795</cell><cell>53 449 31 612</cell><cell>10 402</cell><cell cols="2">-11 435</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>the sheep instances are analyzed by size in the same way as for the SheepCounter dataset. Based on the amount of instances per scale group, it can be seen that the four videos were recorded for different flight altitudes and patterns. Video 4 has the highest amount of nano and small bounding boxes and lowest amount of medium to extended instances. It can be Distribution of the sheep annotations from AUTH-Sheep per video, with respect to the scale group.</figDesc><table><row><cell></cell><cell></cell><cell cols="3">sheep per scale group</cell></row><row><cell></cell><cell cols="3">extended large medium</cell><cell>small</cell><cell>nano</cell></row><row><cell>video 1</cell><cell cols="2">6109 1925</cell><cell>2889</cell><cell>5208</cell><cell>4683</cell></row><row><cell>video 2</cell><cell cols="2">2 1719</cell><cell>5863</cell><cell>5579</cell><cell>3346</cell></row><row><cell>video 3</cell><cell cols="2">17 448 6005</cell><cell>2833</cell><cell>312</cell><cell>-</cell></row><row><cell>video 4</cell><cell>1</cell><cell>407</cell><cell cols="3">1171 13 545 16 488</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Comparison of model performance when the model input size is adjusted to match the aspect ratio of the input images, while maintaining similar computational complexity. Results are for the validation set of the restructured SheepCounter dataset.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell cols="2">mAP50-95 (per scale group)</cell></row><row><cell></cell><cell>mAR50-95</cell><cell cols="4">all extended large medium small nano</cell></row><row><cell>HBB (640 x 640)</cell><cell cols="2">0.706 0.665</cell><cell>0.759 0.717</cell><cell cols="2">0.649 0.463 0.044</cell></row><row><cell>HBB (832 x 480)</cell><cell cols="2">0.759 0.724</cell><cell>0.817 0.781</cell><cell cols="2">0.711 0.527 0.055</cell></row><row><cell>OBB (640 x 640)</cell><cell cols="2">0.644 0.604</cell><cell>0.818 0.752</cell><cell cols="2">0.650 0.573 0.093</cell></row><row><cell>OBB (832 x 480)</cell><cell cols="2">0.687 0.646</cell><cell>0.830 0.787</cell><cell cols="2">0.687 0.629 0.122</cell></row><row><cell>HBB model</cell><cell></cell><cell></cell><cell cols="2">mAP50-95 (per scale group)</cell></row><row><cell></cell><cell>mAR50-95</cell><cell cols="4">all extended large medium small nano</cell></row><row><cell>video 1</cell><cell cols="2">0.403 0.298</cell><cell>0.076 0.336</cell><cell cols="2">0.462 0.497 0.434</cell></row><row><cell>video 2</cell><cell cols="2">0.302 0.220</cell><cell>-0.190</cell><cell cols="2">0.325 0.281 0.154</cell></row><row><cell>video 3</cell><cell cols="2">0.371 0.301</cell><cell>0.307 0.295</cell><cell>0.352 0.268</cell><cell>-</cell></row><row><cell>video 4</cell><cell cols="2">0.093 0.057</cell><cell>-0.015</cell><cell cols="2">0.034 0.109 0.082</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 7</head><label>7</label><figDesc>Detection results of the OBB model on the AUTH-Sheep dataset. For videos 2 and 4, there are no mAP50-95 results for extended objects because there were only 1 and 2 ground truth instances respectively. There were no nano objects in video 3.from the SheepCounter dataset used fpr training and is more challenging. AUTH-Sheep is more dynamic, including new perspectives, backgrounds, classes, and scaling of objects. Sheep are more often occluded with only small parts visible, making them more difficult to detect. Another factor is that the training images almost exclusively included sheep, so the model didn't learn to discriminate sheep from other classes. One aspect to consider is that the annotations for the horizontal bounding boxes were generated from the rotated boxes. This resulted in boxes that were coarser, including more background and parts of other objects. It can be assumed that this affected the adaptability of the HBB model to the new dataset.</figDesc><table><row><cell>OBB model</cell><cell></cell><cell></cell><cell cols="2">mAP50-95 (per scale group)</cell></row><row><cell></cell><cell>mAR50-95</cell><cell cols="4">all extended large medium small nano</cell></row><row><cell>video 1</cell><cell cols="2">0.353 0.259</cell><cell>0.121 0.294</cell><cell cols="2">0.394 0.355 0.254</cell></row><row><cell>video 2</cell><cell cols="2">0.261 0.193</cell><cell>-0.277</cell><cell cols="2">0.294 0.179 0.103</cell></row><row><cell>video 3</cell><cell cols="2">0.327 0.272</cell><cell>0.318 0.225</cell><cell>0.210 0.139</cell><cell>-</cell></row><row><cell>video 4</cell><cell cols="2">0.041 0.025</cell><cell>-0.010</cell><cell cols="2">0.015 0.031 0.030</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 8</head><label>8</label><figDesc>Looking at individual videos, the OBB model performed better on video 3 and slightly better on video 4. Especially for video 3 with mostly extended and large objects, the OBB model scored high for IDF1 (92.23) and MOTA (93.40). The performance in video 4 is very low for both models on all scores, which leads to the conclusion that tracking failed completely in this case. Tracking results for the HBB model on the AUTH-Sheep dataset.</figDesc><table><row><cell cols="3">HBB model HOTA MOTA IDF1</cell><cell cols="4">Dets GT-Dets IDs GT-IDs</cell></row><row><cell>video 1</cell><cell>49.53</cell><cell cols="2">60.44 72.84 17 801</cell><cell>20 814</cell><cell>63</cell><cell>19</cell></row><row><cell>video 2</cell><cell>40.75</cell><cell cols="2">63.74 53.08 14 601</cell><cell>16 509</cell><cell>84</cell><cell>19</cell></row><row><cell>video 3</cell><cell>63.37</cell><cell cols="2">85.21 90.18 28 504</cell><cell>26 598</cell><cell>70</cell><cell>19</cell></row><row><cell>video 4</cell><cell>6.91</cell><cell>6.17 12.19</cell><cell>5348</cell><cell>31 612</cell><cell>98</cell><cell>19</cell></row><row><cell>combined</cell><cell>45.64</cell><cell cols="2">49.95 61.09 66 254</cell><cell cols="2">95 533 315</cell><cell>76</cell></row><row><cell cols="3">OBB model HOTA MOTA IDF1</cell><cell cols="4">Dets GT-Dets IDs GT-IDs</cell></row><row><cell>video 1</cell><cell>39.62</cell><cell cols="2">55.64 67.43 12 788</cell><cell>20 814</cell><cell>32</cell><cell>19</cell></row><row><cell>video 2</cell><cell>35.14</cell><cell cols="2">51.68 54.72 11 746</cell><cell>16 509</cell><cell>77</cell><cell>19</cell></row><row><cell>video 3</cell><cell>66.68</cell><cell cols="2">93.40 92.23 27 574</cell><cell>26 598</cell><cell>51</cell><cell>19</cell></row><row><cell>video 4</cell><cell>7.40</cell><cell>6.36 11.91</cell><cell>8287</cell><cell cols="2">31 612 134</cell><cell>19</cell></row><row><cell>combined</cell><cell>42.65</cell><cell cols="2">49.16 59.53 60 395</cell><cell cols="2">95 533 294</cell><cell>76</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 9</head><label>9</label><figDesc>Tracking results for the OBB model on the AUTH-Sheep dataset.</figDesc><table /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Funded by HORIZON Europe HE-2022: SPADE -101060778 ©2023 IEEE. We thank our student worker Touseef Ashraf, who heavily contributed to the annotation of AUTH-Sheep. We also wish to extend our appreciation to Professor Bossis of Aristotle University of Thessaloniki (https://www.auth.gr/, http://www.agroctima.auth.gr/en/) and his team for organizing the first SPADE Livestock Trial and recording the videos of the presented AUTH-Sheep dataset.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Comparison of Object Detection Algorithms for Livestock Monitoring of Sheep in UAV images</title>
		<author>
			<persName><forename type="first">O</forename><surname>Doll</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Loos</surname></persName>
		</author>
		<idno type="DOI">10.24406/publica-2164</idno>
	</analytic>
	<monogr>
		<title level="m">Camera traps, AI, and Ecology -3rd International Workshop</title>
				<meeting><address><addrLine>Jena</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Nolan</surname></persName>
		</author>
		<ptr target="https://universe.roboflow.com/riisprivate/sheepcounter" />
		<title level="m">SheepCounter Dataset</title>
				<imprint>
			<date type="published" when="2023">2023. 2024-08-29</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Counting cattle in UAV images using convolutional neural network</title>
		<author>
			<persName><forename type="first">F</forename><surname>De Lima Weber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">A</forename><surname>De Moraes Weber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">H</forename><surname>De Moraes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">T</forename><surname>Matsubara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M B</forename><surname>Paiva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D N B</forename><surname>Gomes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">O F</forename><surname>De Oliveira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">R</forename><surname>De Medeiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Cagnin</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.rsase.2022.100900</idno>
	</analytic>
	<monogr>
		<title level="j">Remote Sensing Applications: Society and Environment</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page">100900</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A Lightweight and High Accuracy Deep Learning Method for Grassland Grazing Livestock Detection Using UAV Imagery</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ouyang</surname></persName>
		</author>
		<idno type="DOI">10.3390/rs15061593</idno>
	</analytic>
	<monogr>
		<title level="j">Remote Sensing</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page">1593</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Using Computer Vision, Image Analysis and UAVs for the Automatic Recognition and Counting of Common Cranes (Grus grus)</title>
		<author>
			<persName><forename type="first">A</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jacob</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Shoshani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Charter</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.jenvman.2022.116948</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Environmental Management</title>
		<imprint>
			<biblScope unit="volume">328</biblScope>
			<biblScope unit="page">116948</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Animal Detection and Counting from UAV Images Using Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">K</forename><surname>Rančić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Blagojević</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bezdan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ivošević</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Tubić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vranešević</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Pejak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Crnojević</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Marko</surname></persName>
		</author>
		<idno type="DOI">10.3390/drones7030179</idno>
	</analytic>
	<monogr>
		<title level="j">Drones</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page">179</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Semi-automated detection of ungulates using UAV imagery and reflective spectrometry</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>De Kock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Pohůnek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Hejcmanová</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.jenvman.2022.115807</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Environmental Management</title>
		<imprint>
			<biblScope unit="volume">320</biblScope>
			<biblScope unit="page">115807</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Simple online and realtime tracking</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bewley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ramos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Upcroft</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICIP.2016.7533003</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd IEEE International Conference on Image Processing (ICIP)</title>
				<meeting>the 23rd IEEE International Conference on Image Processing (ICIP)</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="3464" to="3468" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Contributions to the theory of optimal control</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">E</forename><surname>Kalman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Boletin Sociedad Matematica Mexicana</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="102" to="119" />
			<date type="published" when="1960">1960</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Byte-Track: Multi-object Tracking by Associating Every Detection Box</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-20047-2_1</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the European Conference on Computer Vision (ECCV)</title>
				<meeting>the European Conference on Computer Vision (ECCV)</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="1" to="21" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Aharon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Orfaig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B.-Z</forename><surname>Bobrovsky</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2206.14651</idno>
		<title level="m">BoT-SORT: Robust Associations Multi-Pedestrian Tracking</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Khirodkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kitani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="9686" to="9696" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Segment anything</title>
		<author>
			<persName><forename type="first">A</forename><surname>Kirillov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mintun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ravi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Rolland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gustafson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Whitehead</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Berg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-Y</forename><surname>Lo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dollar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICCV51070.2023.00371</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision (ICCV)</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="3992" to="4003" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Jähne</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-662-06731-4</idno>
		<title level="m">Digitale Bildverarbeitung</title>
				<meeting><address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="volume">5</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<ptr target="https://github.com/cvat-ai/cvat" />
		<title level="m">Computer Vision Annotation Tool (CVAT)</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
		<respStmt>
			<orgName>CVAT.ai Corporation</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Microsoft COCO: Common Objects in Context</title>
		<author>
			<persName><forename type="first">T.-Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Maire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Belongie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hays</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Perona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ramanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dollár</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Zitnick</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-10602-1_48</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the European Conference on Computer Vision (ECCV)</title>
				<meeting>the European Conference on Computer Vision (ECCV)</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="740" to="755" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">DOTA: A Large-Scale Dataset for Object Detection in Aerial Images</title>
		<author>
			<persName><forename type="first">G.-S</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Belongie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Datcu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pelillo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2018.00418</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="3974" to="3983" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Jocher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chaurasia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qiu</surname></persName>
		</author>
		<ptr target="https://github.com/ultralytics/ultralytics" />
		<title level="m">YOLO by Ultralytics</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics</title>
		<author>
			<persName><forename type="first">K</forename><surname>Bernardin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stiefelhagen</surname></persName>
		</author>
		<idno type="DOI">10.1155/2008/246309</idno>
	</analytic>
	<monogr>
		<title level="j">EURASIP Journal on Image and Video Processing</title>
		<imprint>
			<biblScope unit="volume">2008</biblScope>
			<biblScope unit="page" from="1" to="10" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Performance Measures and a Data Set for Multi-target, Multi-camera Tracking</title>
		<author>
			<persName><forename type="first">E</forename><surname>Ristani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Solera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cucchiara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tomasi</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-48881-3_2</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the European Conference on Computer Vision (ECCV)</title>
				<meeting>the European Conference on Computer Vision (ECCV)</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="17" to="35" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">HOTA: A Higher Order Metric for Evaluating Multi-object Tracking</title>
		<author>
			<persName><forename type="first">J</forename><surname>Luiten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Osep</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dendorfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Torr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Geiger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Leal-Taixé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Leibe</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11263-020-01375-2</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Vision</title>
		<imprint>
			<biblScope unit="volume">129</biblScope>
			<biblScope unit="page" from="548" to="578" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Luiten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hoffhues</surname></persName>
		</author>
		<ptr target="https://github.com/JonathonLuiten/TrackEval" />
		<title level="m">TrackEval</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">MOTS: Multi-Object Tracking and Segmentation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Voigtlaender</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Krause</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Osep</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Luiten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">B G</forename><surname>Sekar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Geiger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Leibe</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2019.00813</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="7942" to="7951" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
