1. Introduction

V. Husiev);

Differentiable Temporal Anchor Consensus with Graph Neural Anchor Matching for Robust UAV Object Tracking

Vasyl Tereshchenko

Embedded AI

UAV Tracking, Visual Object Tracking, Transformers, Graph Neural Networks, Test-time Adaptation,

0 Taras Shevchenko National University of Kyiv , Akademika Hlushkova Av. 4d, 03680 Kyiv , Ukraine

2026

000 0 0003

Unmanned aerial vehicles (UAVs) require robust real-time object tracking in cluttered environments such as forests, roads, and urban areas. Existing transformer-based trackers such as OSTrack and MixFormer achieve strong per-frame accuracy but often fail under occlusion, rapid ego-motion, and distractors because anchors are treated independently across time and sensor signals are ignored. We propose AnchorFormerUAV, a fully differentiable tracker that treats anchors as temporal entities and unifies: (i) an Anchor Tokenizer that fuses appearance, geometry, motion, attention priors, and IMU cues; (ii) AM-GNN for interframe anchor matching with Sinkhorn-based soft assignments; (iii) a STAT spatio-temporal transformer for temporal and spatial refinement; and (iv) a Reliability & Consensus head that down-weights failed anchors and fuses predictions. The system is designed for embedded deployment (Jetson-class), maintaining 60 90 FPS at 256 288 px search inputs while improving robustness on UAV benchmarks.

1. Introduction

Trackers of temporal reasoning and memory (KeepTrack [ 9 ] andToMP [ 10 ]) introduce memory and optimization for temporal robustness. However, they do not formulate anchors as temporal entities nor fuse them by learned consensus.

Graph neural networks for data association GNNs have improved MOT association [ 11 ], learning to connect detections across frames. We adapt this idea to single-object tracking by matching anchors across frames via a bipartite GNN (AM-GNN), yielding soft assignments that seed temporal processing.

Fresh unified/SOTA trackers and benchmarks (2023 2025) include: MixFormerV2 for efficient fully-transformer tracking [ 12 ], OneTracker that leverages foundation models and efficient tuning [ 13 ], Un-Track for any-modality tracking [ 14 ], and SUTrack that unifies five SOT tasks in a single model [ 15 ]. End-to-end transformer heads such as DETRack [ 16 ] and design variants like FETrack [ 17 ], IAC-Tracker [ 18 ], and TATrack [ 19 ] push accuracy/efficiency. New large-scale or domainspecific resources (VastTrack [ 20 ] and CST Anti-UAV [ 21 ]) increase category coverage and UAV difficulty. Our approach differs by explicitly modeling temporal anchor reliability with GNN-based soft matching and IMUaware priors inside a single differentiable loop.

UAV123 [ 22 ], UAVDT [ 23 ], and Anti-UAV [ 24 ] expose small objects, motion blur, and occlusions. Few works integrate UAV IMU/VIO signals into the NN. Our design encodes IMU priors for motion gating and feature biasing.

2. Problem Statement

Robust UAV tracking requires: temporal anchor stabilization, learned reliability to down-weight failed anchors, and motion priors from IMU/VIO. Therefore, our goal is to introduce the tracker AnchorFormer-UAV to unify these components in a single differentiable pipeline. To achieve the goal, we solved the following tasks: • a temporal anchor representation: anchors become tokens augmented with motion, attention, and IMU features; • AM-GNN: a graph neural module for inter-frame anchor matching with Sinkhorn-based soft assignments; • STAT: a spatio-temporal transformer that refines matched anchors across time and space; • Reliability & consensus: learned per-anchor trust and soft fusion producing robust predictions under occlusion; • a practical training recipe with occlusion survival, anchor/frame dropout, and Jetson friendly deployment.

3. Methodology

Our pipeline (Figer 1) comprises: transformer backbone + heads, Anchor Tokenizer, AM-GNN for interframe matching, STAT for temporal/spatial refinement, Reliability and Consensus heads. Final predictions are obtained by reliability-aware consensus of refined anchors. 3.1.

Anchor Tokenization (Step A: turning proposals into temporal tokens)

Goal. Convert per-frame anchor proposals into compact tokens that carry (i) ap-pearance, (ii) geometry, (iii) motion context, (iv) attention priors, and (v) inertial priors.

Inputs. For each top-M anchor i at frame t from the detection heads we have feature vector ( ), box = ( , , log , log ℎ), classification score , IoU score and an attention prior obtained by average pooling the backbone attention weights over the anchor region. IMU/VIO readings in a small-time window around are encoded into (yaw/pitch/roll deltas and planar velocities) by a two-layer MLP. Motion deltas. We compute ∆ = − ( ), where ( ) is the best anchor continuation − 1 (initially nearest center; later replaced by AM-GNN soft matches) (Figer 2). This provides a velocity proxy without explicit optical flow. refined temporally/spatially (STAT), scored for reliability, and fused by consensus.

Token. The final token is = [ ( ), , , ∆ , , , , ] ∈ ℝ passed through LayerNorm and a linear projection to size d (typically d=128). 3.2.

AM-GNN (Step B: inter-frame anchor matching)

priors. = 1).

Nearest-neighbor matching by IoU fails under fast motion and occlusion. We learn a bipartite association between anchors at (t-1) and t that combines geometry, appearance, reliability, and IMU Graph construction. We build a bipartite graph with nodes {i} at t

1 and {j} at t. For efficiency,

keep only k candidates per node using an IMU-stabilized motion gate (e.g., k = 16).

Edge features = [∆ , ∆ , ∆ log , ∆ log ℎ, cos( , ), , −1] , (1) where cos( , ) is cosine similarity of head features, is the residual after compensating rotation/translation using IMU, and −1 is the previous-frame reliability (bootstrapped as −1 at

Message passing. Two or three layers of edge-aware attention update node embeddings and produce edge affinities =

([ℎ −1‖ℎ ‖ ]).

Sinkhorn assignment with null. We form costs = − , append a null column to allow unmatched anchors, and compute a doubly-stochastic soft assignment: matches.

Seeding STAT. Soft-matched seeds are ̂ = ∑ , ̂ =

∑ ( ) which replace naive continuation and reduce ID switches. 3.3.

STAT (Step C: spatio-temporal refinement)

Inputs. For a window of T frames, STAT receives matched tokens { ̂ , ̂

, ̂ , ̂ , ̂ , } for = 1. . and = − + 1. . .

Temporal block (causal). Per anchor index i we process the sequence with a causal selfattention/GRU. We add relative positional biases in time to prefer smooth motion: ( , , ) = + ℬ

) . ( √ Temperature is annealed during training. Rows with high entropy are treated as uncertain (2) (3) (4) (5) (6) (7)

Spatial block (per frame). For each frame we build a kNN graph among anchors (by center distance) and run graph attention. Edges are biased by attention similarity and IMU projected motion to emphasize scene-consistent movement (e.g., anchors on the same object).

Neural motion refinement. A small MLP predicts residuals on top of constant-velocity: ,

= −1 + ( −1 − −2)+ ∆ ̂ .

Optionally we predict an uncertainty ∑ (∙) from pooled features to quantify confidence. 3.4.

Reliability Head (Step D: learning trustworthiness)

Purpose. Identify failed/drifting anchors and down-weightthem in fusion. We aggregate per-anchor indicators (cls, IoU score, attention prior, temporal consistency, matching entropy ( ∗) neighbor agreement) and predict = (

(ℎ )). Targets are soft labels derived from IoU to ground truth; we also apply focal reweighting to emphasize ambiguous anchors. 3.5.

Consensus Head (Step E: fusing anchors into a robust box)

Softmax fusion: =

( 1 + 2 + 3 ), ∗ = ∑ ,

Uncertainty-aware variant (optional). If STAT predicts ∑ (∙), we can use precision-weighted averaging: ∗ = (∑ Σ−1)−1 (∑ Σ−1 , ).

End-to-end training and inference algorithms we can see on Figer 3-4. 3.6.

Losses and Objectives

Detection loss combines QFL, GIoU+L1, and IoU-score losses. Reliability uses soft targets = ( , ) − )/( ℎ − ), 0,1) with BCE. Consensus loss penalizes ‖ ∗ − ‖ + Temporal smoothness penalizes second-order center differences. AM-GNN uses assignment crossentropy on with GT bipartite labels. The total loss is ℒ = ℒ + ℒ + ℒ STAT update (one-step causal), (5) reliability and consensus to output ∗, (6) update template bank or redundancy). During ∑ , = similar to negatives in .

≤ 64 anchors, = 8 neighbors, = 8frames; AM-GNN uses 2 3 layers and Sinkhorn with 5 7 iterations. On Jetson Orin NX (FP16), the added overhead over a 2 ms, keeping 60 90 FPS for 256 288 px search inputs Default hyperparameters and deploy-time knobs. Values are for Jetson Orin NX @ 256 288 PX Dynamic Template Policy.: We maintain a short-term EMA template and a keyframe bank ℳ = {( , )} =1, with a small distractor bank (hard negatives). Let ∗ = , = , = − ∑ ∗ log ∗ , , ≥ ( ∗, ∗−1). We allow template updates iff ≥ 2 ≥ ∆ . Then we update the EMA template by + (1 − ) ̂ , and add a new keyframe if cos( , ̂ ∗) ≤ (pruning by TTL , memory is frozen. For scoring, we use a soft mixture ̃ = 0 + ( )with a cosine-similarity scoring function, and suppress candidates

Symbol

T M ∆ − − ( 1, 2, 3) (0.5, 0.3, 0.2) Default 64 8 8 6 128 0.2 0.6 1.2 0.4 0.15 0.85 0.9 5 150 12 3 × 10−4( ) 5 × 10−2 3.8.

Neural Network Architectures & Variants

Backbone (feature extractor): we target embedded deployment and propose three interchangeable families. (i) Windowed ViT-tiny with 4 stages and patch sizes {4, 2, 2, 2}; depths [ 2, 2, 6, 2 ]; embed dims [64, 128, 192, 256]; MHSA heads [ 2, 4, 6, 8 ] with local windows (no deformable attention). (ii) Hybrid Conv Attention blocks (ConvNeXt-style depthwise convs + lightweight MHSA) for high throughput. (iii) Pure CNN fallback (ConvNeXt-Tiny) when attention is budget-constrained. All backbones output multi-scale features to the heads; we keep the search resolution at 256320 px. Heads (dense proposals): classification head predicts anchor scores ; regression head predicts(∆ , ∆ , ∆ log , ∆ log ℎ); IoU head predicts . Each head is an MLP/conv tower with two hidden layers of width . An attention-prior map is derived from the last backbone stage and pooled over anchor regions.

Anchor Tokenizer: for each top- proposal we concatenate ( ) with geometry, motion deltas, scores, attention priors, and IMU embedding. A linear layer projects to with LayerNorm.

AM-GNN (matching): two to three layers of edgeaware graph attention on a bipartite graph( − 1) ⟷ ; edge MLP hidden sizes [ , ]; node MLP hidden sizes [ , 2 ]. We use candidate edges per node and perform 5 7 Sinkhorn iterations with temperature ∈ [0.15,0.3] and a column for unmatched anchors.

STAT (temporal/spatial refinement): a causal temporal transformer (2 layers, 4 heads, FFN size 2 ) per anchor index, followed by a spatial -NN graph attention (2 layers) per frame. A motion head predicts residuals on top of a constant-velocity prior. Optionally, a covariance head produces diagonal Σ for uncertainty-aware fusion.

Reliability & Consensus: reliability head

MLP with widths [ , 2 , 1] and sigmoid; inputs include , , , temporal consistency, matching entropy, neighbor agreement. Consensus converts ( , , )to weights via a learned softmax (or precision-weighted).

Quantization & deployment: use post-training static quantization (INT8) for heads and MLPs; keep attention in FP16. Export with ONNX⟶TensorRT; fuse LayerNorm and linear layers where possible. Limit

≤ 64, ≤ 8, ≤ 8 for 60 FPS on Jetson-class SOCs.

Recovery cycle: a low-confidence/high-entropy state triggers a prior-only mode (STAT with IMU and neighbor flow), then controlled re-acquisition via AM-GNN and final refinement by consensus before resuming tracking.

Model Variants: we provide three sizes that share code and differ only by , depth, and window sizes. Module dimensions (defaults): unless otherwise stated we use = 128, MLP FFNs with expansion 2 , attention heads ℎ = 4, Sinkhorn iterations = 6, temperature = 0.2.

4. Experiments

tracking algorithms.

To evaluate the effectiveness of our proposed approach, we conducted extensive experiments on standard benchmark datasets and compared the results against several state-of-the-art object

Anchorformer-UAV model variants on Table 2. Depths refer to (temporal/spatial) stat layers. Targets are guidance for embedded deployment. difficulty: OTB-100 - a classical benchmark for short-term object tracking; LaSOT - a large-scale longterm tracking dataset with over 1,400 sequences; GOT-10k - a diverse dataset with unseen object categories to test generalization.

As baselines, we selected both traditional and recent deep learning-based models, with emphasis on transformer-based trackers: STARK, TransT, OS-Track, MixFormer, MixFormerV2, SiamRPN++, DiMP, and ECO. Performance was evaluated using standard metrics such as Precision, Recall, F1score, and mean Intersection-over-Union (mIoU). methods across all benchmarks. On OTB-100, our method achieved an mIoU of 0.87, surpassing OSTrack (0.84) and MixFormer (0.83). On LaSOT, our F1-score reached 0.92, which is a significant improvement compared to MixFormerV2 (0.88). On GOT-10k, we reduced false positives by 17% relative to DiMP and ECO. To quantify the performance gain from IMU integration, we conducted ablation experiments by systematically removing the IMU stream from our pipeline. Table 4 shows results with and without IMU priors on UAV-specific benchmarks (UAV123 and UAVDT).

The results demonstrate that IMU integration provides substantial performance gains: removing all IMU components reduces AUC by 6% on both benchmarks. The token-level IMU embedding ( ) contributes 4% improvement, the IMU-stabilized matching in AM-GNN adds 3%, and IMU-projected motion biases in STAT provide 2% gain. These gains are most pronounced during fast motion and aggressive camera maneuvers, where inertial priors effectively compensate for ego-motion and stabilize anchor matching. 4.2.

Discussion

Treating anchors as sequences and fusing them by learned reliability yields stable boxes under fast motion and clutter. GNN matching reduces association errors, especially when appearance changes abruptly; soft assignments enable graceful handling of uncertainty. IMU priors improve gating and attention focusing during aggressive maneuvers. Design for deployability (bounded , , , Sinkhorn iters, and no deformable attention) keeps the model fast and stable on embedded hardware.

Consensus may over-smooth thin/elongated targets; AM- -2 ms latency (tunable via , , ). Test-time adaptation must be rate-limited to avoid drift. Reliance on IMU assumes synchronization; if unavailable, we fall back to visual motion cues.

Multi-modal fusion (RGB+thermal), shared STAT across multiple objects for MOT, languageconditioned tracking, and coupling with SLAM (map priors) for long-term stability.

5. Conclusion

We introduced AnchorFormer-UAV, a novel tracking framework that unifies temporal anchor modeling, graph neural matching, reliability prediction, and consensus fusion in a single differentiable pipeline. This design directly addresses UAV-specific challenges including ego-motion, occlusion, and small fast-moving targets, while remaining deployable on embedded hardware such as Jetson-class platforms.

Our key contributions include: treating anchors as temporal entities augmented with appearance, geometry, motion, attention, and IMU features; AM-GNN for robust inter-frame matching using Sinkhorn-based soft assignments; STAT for spatio-temporal refinement; and a learned reliability mechanism that identifies and down-weights failed anchors during consensus fusion.

Experimental evaluation on standard benchmarks (OTB-100, LaSOT, GOT-10k) and UAV-specific datasets (UAV123, UAVDT) demonstrates consistent improvements over state-of-the-art trackers. Our method achieved an mIoU of 0.87 and F1-score of 0.92, outperforming recent transformer-based approaches. The ablation studies confirm that IMU integration provides substantial benefits, contributing up to 6% improvement on UAV benchmarks, with the most significant gains observed during fast motion and aggressive camera maneuvers. The modular architecture enables flexible deployment across three model variants (Nano, Tiny, Small) to balance accuracy and computational constraints while maintaining 60-90 FPS throughput.

This work establishes promising directions for future research, including multi-modal fusion with thermal and LiDAR sensors, extension to multi-object tracking scenarios where STAT can provide shared temporal reasoning, language-conditioned tracking for flexible target specification, and coupling with SLAM systems for long-term stability. The detailed methodology and implementationready specifications facilitate reproducibility and practical adoption. AnchorFormer-UAV provides a solid foundation for advancing embedded AI-powered UAV tracking systems.

Declaration on Generative AI

The authors have not employed any Generative AI tools.

[1]

Li ,

Wu ,

Wang ,

Zhang ,

Xing ,

Yan . Siamrpn++: Evolution of siamese visual tracking with very deep networks . Proceedings of the Conference on Computer Vision and Pattern Recognition 2019 CVPR , ( 2019 ). doi: 10 .1109/CVPR. 2019 . 00441 .

[2]

Guo ,

Wang ,

Cui ,

Wang , and

Chen , Siamcar. Siamese fully convolutional classification and regression for visual tracking . Proceedings of the Conference on Computer Vision and Pattern Recognition 2020 CVPR , ( 2020 ).

[3]

Zhang and

Peng . Ocean: Object-aware anchor-free tracking . Proceedings of the European Conference on Computer Vision ECCV 2020 ( 2020 ).

[4]

Bhat ,

Johnander ,

Danelljan ,

F. S.

Khan ,

Felsberg . Learning discriminative model prediction for tracking . Proceedings of the International Conference on Computer Vision ICCV 2019 ( 2019 ).

[5]

Yan ,

Peng ,

Fu ,

Wang ,

Lu . Learning spatio-temporal transformer for visual tracking . Proceedings of the International Conference on Computer Vision ICCV 2021 ( 2021 ).

[6]

Chen ,

Yan ,

Zhu ,

Wang ,

Lu ,

Yang . Transformer tracking . Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( 2021 ).

[7]

Ye ,

Chang ,

Ma , S. Shan. Joint feature learning and relation modeling for tracking: A one-stream framework . Proceedings of the European Conference on Computer Vision ECCV 2022 ( 2022 ).

[8]

Cui ,

Jiang ,

Wang ,

Wu . Mixformer: End-to-end tracking with iterative mixed attention . Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( 2022 ).

[9]

Mayer ,

Danelljan ,

Bhat ,

L. Van

Gool . Learning target candidate association to keep track of what not to track . Proceedings of the International Conference on Computer Vision ICCV 2021 ( 2021 ).

[10]

Mayer , G. Bhat,

Danelljan ,

L. Van

Gool . Towards learning a unified model for visual tracking . Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition ( 2022 ).

[11]

Brasó ,

Leal-Taixé . Learning a neural solver for multiple object tracking . Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( 2020 ).

[12]

Cui ,

Song , G. Wu,

Wang . Mixformerv2: Efficient fullytransformer tracking . arXiv preprint arXiv:2305.15896 ( 2023 ). doi: 10 .48550/arXiv.2305.15896.

[13]

Hong ,

Yan ,

Zhang ,

Li ,

Zhou ,

Guo ,

Jiang ,

Chen ,

Li ,

Chen ,

Zhang . Onetracker: Unifying visual object tracking with foundation models and efficient tuning . Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . pp. 19079 19091 ( 2024 ).

[14]

Wu ,

Zheng ,

Ren ,

F.-A.

Vasluianu ,

Ma ,

D. P.

Paudel , L. Van Gool ,

Timofte . Un-track: Single-model and any-modality for video object tracking . Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition . 31696 ( 2024 ).

[15]

Chen ,

Kang ,

Geng ,

Zhu ,

Liu ,

Wang ,

Lu . Sutrack: Towards simple and unified single object tracking . Thirty-Ninth AAAI Conference on Artificial Intelligence . 32223 ( 2025 ).

[16]

Wei ,

Zeng ,

Zeng . Detrack: An efficient end-to-end transformer for visual object tracking . arXiv preprint arXiv:2309.02676 ( 2023 ). doi: 10 .48550/arXiv.2309.02676.

[17]

Liu ,

Huang ,

Lin . Fetrack: Feature-enhanced transformer network for visual object tracking . Applied Sciences 14 ( 22 ), 10589 ( 2024 ). doi: 10 .3390/app142210589.

[18]

Li ,

Liu ,

Yuan ,

Wang ,

Wu , J. Liu. Iac-tracker: Transformer-based visual object tracker via learning immediate appearance change . Pattern Recognition 155 . 110705 . ( 2024 ). doi: 10 .1016/j.patcog. 2024 . 110705 .

[19]

Huang ,

Chu ,

Leng ,

Dong . Tatrack: Target-aware transformer for object tracking . Engineering Applications of Artificial Intelligence. 127 . Part

107304 . ( 2024 ). doi: 10 .1016/j.engappai. 2023 . 107304 .

[20]

Peng ,

Gao ,

Liu ,

Li ,

Dong ,

Zhang ,

Fan ,

Zhang . Vasttrack: Vast category visual object tracking . The Thirty-eight Annual Conference on Neural Information Processing Systems. Advances in Neural Information Processing Systems 37 (NeurIPS 2024 ). 97849 ( 2024 ). doi: 10 .52202/ 079017 - 4157 .

[21]

Xie ,

Zhang ,

Wang ,

Liu ,

Lu ,

Zhen ,

Weiming . Cst anti-uav: A thermal infrared benchmark for tiny uav single object tracking . arXiv preprint arXiv:2507.23473 ( 2025 ). doi: 10 .48550/arXiv.2507.23473.

[22]

Mueller ,

Smith ,

Ghanem . A benchmark and simulator for uav tracking . Proceedings of the European Conference on Computer Vision ECCV 2016 , ( 2016 ).

[23]

Du ,

Tan ,

Yu ,

Yang ,

Duan ,

Li ,

Zhang ,

Huang ,

Tian . Unmanned aerial vehicle benchmark: Object detection and tracking . Proceedings of the European Conference on Computer Vision ECCV 2018 . pp. 370 - 386 ( 2018 ).

[24]

Fan ,

Lin ,

Yang ,

Chu ,

Deng ,

Yu ,

Huang , P. Liu,

Xu , G. Bhat et al . Anti-uav: A large multi-modal benchmark for uav tracking . Proceedings of the European Conference on Computer Vision ECCV 2020 . ( 2020 ).