<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>V. Husiev);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Differentiable Temporal Anchor Consensus with Graph Neural Anchor Matching for Robust UAV Object Tracking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vasyl Tereshchenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Embedded AI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>UAV Tracking, Visual Object Tracking, Transformers, Graph Neural Networks, Test-time Adaptation,</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>Akademika Hlushkova Av. 4d, 03680 Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Unmanned aerial vehicles (UAVs) require robust real-time object tracking in cluttered environments such as forests, roads, and urban areas. Existing transformer-based trackers such as OSTrack and MixFormer achieve strong per-frame accuracy but often fail under occlusion, rapid ego-motion, and distractors because anchors are treated independently across time and sensor signals are ignored. We propose AnchorFormerUAV, a fully differentiable tracker that treats anchors as temporal entities and unifies: (i) an Anchor Tokenizer that fuses appearance, geometry, motion, attention priors, and IMU cues; (ii) AM-GNN for interframe anchor matching with Sinkhorn-based soft assignments; (iii) a STAT spatio-temporal transformer for temporal and spatial refinement; and (iv) a Reliability &amp; Consensus head that down-weights failed anchors and fuses predictions. The system is designed for embedded deployment (Jetson-class), maintaining 60 90 FPS at 256 288 px search inputs while improving robustness on UAV benchmarks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Trackers of temporal reasoning and memory (KeepTrack [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] andToMP [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) introduce memory
and optimization for temporal robustness. However, they do not formulate anchors as temporal
entities nor fuse them by learned consensus.
      </p>
      <p>
        Graph neural networks for data association GNNs have improved MOT association [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], learning
to connect detections across frames. We adapt this idea to single-object tracking by matching anchors
across frames via a bipartite GNN (AM-GNN), yielding soft assignments that seed temporal
processing.
      </p>
      <p>
        Fresh unified/SOTA trackers and benchmarks (2023 2025) include: MixFormerV2 for efficient
fully-transformer tracking [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], OneTracker that leverages foundation models and efficient tuning
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Un-Track for any-modality tracking [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and SUTrack that unifies five SOT tasks in a single
model [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. End-to-end transformer heads such as DETRack [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and design variants like FETrack
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], IAC-Tracker [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], and TATrack [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] push accuracy/efficiency. New large-scale or
domainspecific resources (VastTrack [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and CST Anti-UAV [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]) increase category coverage and UAV
difficulty. Our approach differs by explicitly modeling temporal anchor reliability with GNN-based
soft matching and IMUaware priors inside a single differentiable loop.
      </p>
      <p>
        UAV123 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], UAVDT [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and Anti-UAV [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] expose small objects, motion blur, and occlusions.
Few works integrate UAV IMU/VIO signals into the NN. Our design encodes IMU priors for motion
gating and feature biasing.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Statement</title>
      <p>Robust UAV tracking requires: temporal anchor stabilization, learned reliability to down-weight
failed anchors, and motion priors from IMU/VIO. Therefore, our goal is to introduce the tracker
AnchorFormer-UAV to unify these components in a single differentiable pipeline. To achieve the
goal, we solved the following tasks:
• a temporal anchor representation: anchors become tokens augmented with motion,
attention, and IMU features;
• AM-GNN: a graph neural module for inter-frame anchor matching with Sinkhorn-based soft
assignments;
• STAT: a spatio-temporal transformer that refines matched anchors across time and space;
• Reliability &amp; consensus: learned per-anchor trust and soft fusion producing robust
predictions under occlusion;
• a practical training recipe with occlusion survival, anchor/frame dropout, and Jetson friendly
deployment.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our pipeline (Figer 1) comprises: transformer backbone + heads, Anchor Tokenizer, AM-GNN for
interframe matching, STAT for temporal/spatial refinement, Reliability and Consensus heads. Final
predictions are obtained by reliability-aware consensus of refined anchors.
3.1.</p>
      <sec id="sec-3-1">
        <title>Anchor Tokenization (Step A: turning proposals into temporal tokens)</title>
        <p>Goal. Convert per-frame anchor proposals into compact tokens that carry (i) ap-pearance, (ii)
geometry, (iii) motion context, (iv) attention priors, and (v) inertial priors.</p>
        <p>Inputs. For each top-M anchor i at frame t from the detection heads we have feature vector  (   ),
box   = ( ,  , log  , log ℎ), classification score   , IoU score   and an attention prior   obtained
by average pooling the backbone attention weights over the anchor region. IMU/VIO readings in a
small-time window around  are encoded into   (yaw/pitch/roll deltas and planar velocities) by a
two-layer MLP.
Motion deltas. We compute ∆  =   − 
 ( ), where  ( ) is the best anchor continuation  − 1
(initially nearest center; later replaced by AM-GNN soft matches) (Figer 2). This provides a velocity
proxy without explicit optical flow.
refined temporally/spatially (STAT), scored for reliability, and fused by consensus.</p>
        <p>Token. The final token is   = [ (   ),  
, 


, ∆ 
, 


,  
,   ,   ] ∈ ℝ
passed through
LayerNorm and a linear projection to size d (typically d=128).
3.2.</p>
      </sec>
      <sec id="sec-3-2">
        <title>AM-GNN (Step B: inter-frame anchor matching)</title>
        <p>priors.
 = 1).</p>
        <p>Nearest-neighbor matching by IoU fails under fast motion and occlusion. We learn a bipartite
association between anchors at (t-1) and t that combines geometry, appearance, reliability, and IMU
Graph construction. We build a bipartite graph with nodes {i} at t</p>
        <sec id="sec-3-2-1">
          <title>1 and {j} at t. For efficiency,</title>
          <p>keep only k candidates per node using an IMU-stabilized motion gate (e.g., k = 16).</p>
          <p>Edge features
  = [∆ , ∆ , ∆ log  , ∆ log ℎ, cos(  ,   ), 
 ,    −1] ,
(1)
where cos(  ,   ) is cosine similarity of head features, 
 is the residual after compensating
rotation/translation using IMU, and   −1 is the previous-frame reliability (bootstrapped as   −1 at</p>
          <p>Message passing. Two or three layers of edge-aware attention update node embeddings and
produce edge affinities   =</p>
          <p>([ℎ −1‖ℎ ‖  ]).</p>
          <p>Sinkhorn assignment with null. We form costs   = −  , append a null column to allow
unmatched anchors, and compute a doubly-stochastic soft assignment:
matches.</p>
          <p>Seeding STAT. Soft-matched seeds are
 ̂  =
∑   


,  ̂ =</p>
          <p>∑    (   )
which replace naive continuation and reduce ID switches.
3.3.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>STAT (Step C: spatio-temporal refinement)</title>
        <p>Inputs. For a window of T frames, STAT receives matched tokens { ̂
,  ̂</p>
        <p>,  ̂ ,  ̂


,  ̂  ,   } for  = 1. . 
and  =  −  + 1. .  .</p>
        <p>Temporal block (causal). Per anchor index i we process the sequence with a causal
selfattention/GRU. We add relative positional biases in time to prefer smooth motion:

( ,  ,  ) = 
+ ℬ</p>
        <p>)  .
(
  
√
Temperature  is annealed during training. Rows with high entropy are treated as uncertain
(2)
(3)
(4)
(5)
(6)
(7)</p>
        <p>Spatial block (per frame). For each frame we build a kNN graph among anchors (by center
distance) and run graph attention. Edges are biased by attention similarity and IMU projected motion
to emphasize scene-consistent movement (e.g., anchors on the same object).</p>
        <p>Neural motion refinement. A small MLP predicts residuals on top of constant-velocity:
 
 ,</p>
        <p>=   −1 + (  −1 −   −2)+ ∆ ̂  .</p>
        <p>Optionally we predict an uncertainty ∑ (∙) from pooled features to quantify confidence.
3.4.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Reliability Head (Step D: learning trustworthiness)</title>
        <p>Purpose. Identify failed/drifting anchors and down-weightthem in fusion. We aggregate per-anchor
indicators (cls, IoU score, attention prior, temporal consistency, matching entropy  (  ∗) neighbor
agreement) and predict    =  (</p>
        <p>(ℎ )). Targets are soft labels derived from IoU to ground truth;
we also apply focal reweighting to emphasize ambiguous anchors.
3.5.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Consensus Head (Step E: fusing anchors into a robust box)</title>
        <p>Softmax fusion:
  
=</p>
        <p>( 1  +  2  +  3   ),
 ∗ = ∑   
 
 ,</p>
        <p>.</p>
        <p />
        <p>Uncertainty-aware variant (optional). If STAT predicts ∑ (∙), we can use precision-weighted
averaging:  ∗ = (∑    Σ−1)−1
(∑    Σ−1 
 ,
).</p>
        <p>End-to-end training and inference algorithms we can see on Figer 3-4.
3.6.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Losses and Objectives</title>
        <p>Detection loss combines QFL, GIoU+L1, and IoU-score losses. Reliability uses soft targets    =
( 
,   ) −   )/( ℎ −   ), 0,1) with BCE. Consensus loss penalizes ‖ ∗ −   ‖ + 
Temporal smoothness penalizes second-order center differences. AM-GNN uses assignment
crossentropy on  with GT bipartite labels. The total loss is ℒ =  
ℒ

+  
ℒ
+  
ℒ
STAT update (one-step causal), (5) reliability and consensus to output  ∗, (6) update template bank
or redundancy). During 
∑     ,  = 
similar to negatives in  .</p>
        <p>≤ 64 anchors,  = 8 neighbors,  = 8frames; AM-GNN uses
2 3 layers and Sinkhorn with 5 7 iterations. On Jetson Orin NX (FP16), the added overhead over a
2 ms, keeping 60 90 FPS for 256 288 px search inputs
Default hyperparameters and deploy-time knobs. Values are for Jetson Orin NX @ 256 288 PX
Dynamic Template Policy.: We maintain a short-term EMA template  
and a keyframe bank
ℳ = {(  ,   )}
 =1, with a small distractor bank 
(hard negatives). Let ∗ = 
    ,   =
    ,   = − ∑   ∗ log   ∗ , 
, 
 ≥ 
( ∗,  ∗−1). We allow template updates iff 
 ≥
2 ≥  ∆ . Then we update the EMA template by
+ (1 −  ) ̂ , and add a new keyframe if 
 cos(  ,  ̂ ∗) ≤  

(pruning by TTL
, memory is frozen. For scoring, we use a soft mixture  ̃ =  0 
+
( )with  a cosine-similarity scoring function, and suppress candidates</p>
        <sec id="sec-3-6-1">
          <title>Symbol</title>
          <p>M</p>
          <p>T
 M

 













 ∆
−
−
( 1,  2,  3)
(0.5, 0.3, 0.2)
Default
64
8
8
6
128
0.2
0.6
1.2
0.4
0.15
0.85
0.9
5
150
12
3 × 10−4(
)
5 × 10−2
3.8.</p>
        </sec>
      </sec>
      <sec id="sec-3-7">
        <title>Neural Network Architectures &amp; Variants</title>
        <p>
          Backbone (feature extractor): we target embedded deployment and propose three interchangeable
families. (i) Windowed ViT-tiny with 4 stages and patch sizes {4, 2, 2, 2}; depths [
          <xref ref-type="bibr" rid="ref2 ref2 ref2 ref6">2, 2, 6, 2</xref>
          ]; embed dims
[64, 128, 192, 256]; MHSA heads [
          <xref ref-type="bibr" rid="ref2 ref4 ref6 ref8">2, 4, 6, 8</xref>
          ] with local windows (no deformable attention). (ii) Hybrid
Conv Attention blocks (ConvNeXt-style depthwise convs + lightweight MHSA) for high throughput.
(iii) Pure CNN fallback (ConvNeXt-Tiny) when attention is budget-constrained. All backbones
output multi-scale features to the heads; we keep the search resolution at 256320 px.

Heads (dense proposals): classification head predicts anchor scores  ; regression head

predicts(∆ , ∆ , ∆ log  , ∆ log ℎ); IoU head predicts   . Each head is an MLP/conv tower with two
hidden layers of width  . An attention-prior map   is derived from the last backbone stage and
pooled over anchor regions.
        </p>
        <p>Anchor Tokenizer: for each top- proposal we concatenate  (   ) with geometry, motion deltas,
scores, attention priors, and IMU embedding. A linear layer projects to  with LayerNorm.</p>
        <p>AM-GNN (matching): two to three layers of edgeaware graph attention on a bipartite
graph( − 1) ⟷  ; edge MLP hidden sizes [ ,  ]; node MLP hidden sizes [ , 2 ]. We use  candidate
edges per node and perform 5 7 Sinkhorn iterations with temperature  ∈ [0.15,0.3] and a 
column for unmatched anchors.</p>
        <p>STAT (temporal/spatial refinement): a causal temporal transformer (2 layers, 4 heads, FFN size
2 ) per anchor index, followed by a spatial  -NN graph attention (2 layers) per frame. A motion head
predicts residuals on top of a constant-velocity prior. Optionally, a covariance head produces
diagonal Σ for uncertainty-aware fusion.</p>
        <sec id="sec-3-7-1">
          <title>Reliability &amp; Consensus: reliability head</title>
          <p>MLP with widths [ , 
2 , 1] and sigmoid; inputs include
 ,  ,  , temporal consistency, matching entropy, neighbor agreement. Consensus converts ( ,  ,  )to
weights via a learned softmax (or precision-weighted).</p>
          <p>Quantization &amp; deployment: use post-training static quantization (INT8) for heads and MLPs;
keep attention in FP16. Export with ONNX⟶TensorRT; fuse LayerNorm and linear layers where
possible. Limit</p>
          <p>≤ 64,  ≤ 8,  ≤ 8 for 60 FPS on Jetson-class SOCs.</p>
          <p>Recovery cycle: a low-confidence/high-entropy state triggers a prior-only mode (STAT with IMU
and neighbor flow), then controlled re-acquisition via AM-GNN and final refinement by consensus
before resuming tracking.</p>
          <p>Model Variants: we provide three sizes that share code and differ only by  , depth, and window
sizes. Module dimensions (defaults): unless otherwise stated we use  = 128, MLP FFNs with
expansion 2 , attention heads ℎ = 4, Sinkhorn iterations 
= 6, temperature  = 0.2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>tracking algorithms.</p>
      <p>To evaluate the effectiveness of our proposed approach, we conducted extensive experiments on
standard benchmark datasets and compared the results against several state-of-the-art object</p>
      <p>Anchorformer-UAV model variants on Table 2. Depths refer to (temporal/spatial) stat layers.
Targets are guidance for embedded deployment.
difficulty: OTB-100 - a classical benchmark for short-term object tracking; LaSOT - a large-scale
longterm tracking dataset with over 1,400 sequences; GOT-10k - a diverse dataset with unseen object
categories to test generalization.</p>
      <p>As baselines, we selected both traditional and recent deep learning-based models, with emphasis
on transformer-based trackers: STARK, TransT, OS-Track, MixFormer, MixFormerV2, SiamRPN++,
DiMP, and ECO. Performance was evaluated using standard metrics such as Precision, Recall,
F1score, and mean Intersection-over-Union (mIoU).
methods across all benchmarks. On OTB-100, our method achieved an mIoU of 0.87, surpassing
OSTrack (0.84) and MixFormer (0.83). On LaSOT, our F1-score reached 0.92, which is a significant
improvement compared to MixFormerV2 (0.88). On GOT-10k, we reduced false positives by 17%
relative to DiMP and ECO.
To quantify the performance gain from IMU integration, we conducted ablation experiments by
systematically removing the IMU stream from our pipeline. Table 4 shows results with and without
IMU priors on UAV-specific benchmarks (UAV123 and UAVDT).</p>
      <p>The results demonstrate that IMU integration provides substantial performance gains: removing
all IMU components reduces AUC by 6% on both benchmarks. The token-level IMU embedding (  )
contributes 4% improvement, the IMU-stabilized matching in AM-GNN adds 3%, and IMU-projected
motion biases in STAT provide 2% gain. These gains are most pronounced during fast motion and
aggressive camera maneuvers, where inertial priors effectively compensate for ego-motion and
stabilize anchor matching.
4.2.</p>
      <sec id="sec-4-1">
        <title>Discussion</title>
        <p>Treating anchors as sequences and fusing them by learned reliability yields stable boxes under fast
motion and clutter. GNN matching reduces association errors, especially when appearance changes
abruptly; soft assignments enable graceful handling of uncertainty. IMU priors improve gating and
attention focusing during aggressive maneuvers. Design for deployability (bounded  ,  ,  ,
Sinkhorn iters, and no deformable attention) keeps the model fast and stable on embedded hardware.</p>
        <p>Consensus may over-smooth thin/elongated targets; AM- -2 ms latency (tunable via
 ,  ,  ). Test-time adaptation must be rate-limited to avoid drift. Reliance on IMU assumes
synchronization; if unavailable, we fall back to visual motion cues.</p>
        <p>Multi-modal fusion (RGB+thermal), shared STAT across multiple objects for MOT,
languageconditioned tracking, and coupling with SLAM (map priors) for long-term stability.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We introduced AnchorFormer-UAV, a novel tracking framework that unifies temporal anchor
modeling, graph neural matching, reliability prediction, and consensus fusion in a single
differentiable pipeline. This design directly addresses UAV-specific challenges including ego-motion,
occlusion, and small fast-moving targets, while remaining deployable on embedded hardware such
as Jetson-class platforms.</p>
      <p>Our key contributions include: treating anchors as temporal entities augmented with appearance,
geometry, motion, attention, and IMU features; AM-GNN for robust inter-frame matching using
Sinkhorn-based soft assignments; STAT for spatio-temporal refinement; and a learned reliability
mechanism that identifies and down-weights failed anchors during consensus fusion.</p>
      <p>Experimental evaluation on standard benchmarks (OTB-100, LaSOT, GOT-10k) and UAV-specific
datasets (UAV123, UAVDT) demonstrates consistent improvements over state-of-the-art trackers.
Our method achieved an mIoU of 0.87 and F1-score of 0.92, outperforming recent transformer-based
approaches. The ablation studies confirm that IMU integration provides substantial benefits,
contributing up to 6% improvement on UAV benchmarks, with the most significant gains observed
during fast motion and aggressive camera maneuvers. The modular architecture enables flexible
deployment across three model variants (Nano, Tiny, Small) to balance accuracy and computational
constraints while maintaining 60-90 FPS throughput.</p>
      <p>This work establishes promising directions for future research, including multi-modal fusion with
thermal and LiDAR sensors, extension to multi-object tracking scenarios where STAT can provide
shared temporal reasoning, language-conditioned tracking for flexible target specification, and
coupling with SLAM systems for long-term stability. The detailed methodology and
implementationready specifications facilitate reproducibility and practical adoption. AnchorFormer-UAV provides a
solid foundation for advancing embedded AI-powered UAV tracking systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          . Siamrpn++:
          <article-title>Evolution of siamese visual tracking with very deep networks</article-title>
          .
          <source>Proceedings of the Conference on Computer Vision</source>
          and Pattern Recognition 2019
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          , (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2019</year>
          .
          <volume>00441</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Siamcar.
          <article-title>Siamese fully convolutional classification and regression for visual tracking</article-title>
          .
          <source>Proceedings of the Conference on Computer Vision</source>
          and Pattern Recognition 2020
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          . Ocean:
          <article-title>Object-aware anchor-free tracking</article-title>
          .
          <source>Proceedings of the European Conference on Computer Vision ECCV</source>
          <year>2020</year>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bhat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Danelljan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Felsberg</surname>
          </string-name>
          .
          <article-title>Learning discriminative model prediction for tracking</article-title>
          .
          <source>Proceedings of the International Conference on Computer Vision ICCV</source>
          <year>2019</year>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Learning spatio-temporal transformer for visual tracking</article-title>
          .
          <source>Proceedings of the International Conference on Computer Vision ICCV</source>
          <year>2021</year>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Transformer tracking</article-title>
          .
          <source>Proceedings of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S. Shan.
          <article-title>Joint feature learning and relation modeling for tracking: A one-stream framework</article-title>
          .
          <source>Proceedings of the European Conference on Computer Vision ECCV</source>
          <year>2022</year>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wu</surname>
          </string-name>
          . Mixformer:
          <article-title>End-to-end tracking with iterative mixed attention</article-title>
          .
          <source>Proceedings of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Danelljan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bhat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          .
          <article-title>Learning target candidate association to keep track of what not to track</article-title>
          .
          <source>Proceedings of the International Conference on Computer Vision ICCV</source>
          <year>2021</year>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mayer</surname>
          </string-name>
          , G. Bhat,
          <string-name>
            <given-names>M.</given-names>
            <surname>Danelljan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          .
          <article-title>Towards learning a unified model for visual tracking</article-title>
          .
          <source>Proceedings of the IEEE/CVF Conf. on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brasó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Leal-Taixé</surname>
          </string-name>
          .
          <article-title>Learning a neural solver for multiple object tracking</article-title>
          .
          <source>Proceedings of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Song</surname>
          </string-name>
          , G. Wu,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          . Mixformerv2:
          <article-title>Efficient fullytransformer tracking</article-title>
          .
          <source>arXiv preprint arXiv:2305.15896</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2305.15896.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          . Onetracker:
          <article-title>Unifying visual object tracking with foundation models and efficient tuning</article-title>
          .
          <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <fpage>19079</fpage>
          <lpage>19091</lpage>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.-A.</given-names>
            <surname>Vasluianu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Paudel</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Van Gool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Timofte</surname>
          </string-name>
          .
          <article-title>Un-track: Single-model and any-modality for video object tracking</article-title>
          .
          <source>Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition</source>
          .
          <volume>31696</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          . Sutrack:
          <article-title>Towards simple and unified single object tracking</article-title>
          .
          <source>Thirty-Ninth AAAI Conference on Artificial Intelligence</source>
          .
          <volume>32223</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zeng</surname>
          </string-name>
          . Detrack:
          <article-title>An efficient end-to-end transformer for visual object tracking</article-title>
          .
          <source>arXiv preprint arXiv:2309.02676</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2309.02676.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          . Fetrack:
          <article-title>Feature-enhanced transformer network for visual object tracking</article-title>
          .
          <source>Applied Sciences</source>
          <volume>14</volume>
          (
          <issue>22</issue>
          ),
          <volume>10589</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .3390/app142210589.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wu</surname>
          </string-name>
          , J. Liu.
          <article-title>Iac-tracker: Transformer-based visual object tracker via learning immediate appearance change</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>155</volume>
          .
          <fpage>110705</fpage>
          . (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1016/j.patcog.
          <year>2024</year>
          .
          <volume>110705</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Leng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          . Tatrack:
          <article-title>Target-aware transformer for object tracking</article-title>
          .
          <source>Engineering Applications of Artificial Intelligence. 127</source>
          .
          <string-name>
            <surname>Part</surname>
            <given-names>B.</given-names>
          </string-name>
          <year>107304</year>
          . (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1016/j.engappai.
          <year>2023</year>
          .
          <volume>107304</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          . Vasttrack:
          <article-title>Vast category visual object tracking</article-title>
          .
          <source>The Thirty-eight Annual Conference on Neural Information Processing Systems. Advances in Neural Information Processing Systems</source>
          <volume>37</volume>
          (NeurIPS
          <year>2024</year>
          ).
          <volume>97849</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .52202/
          <fpage>079017</fpage>
          -
          <lpage>4157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Weiming</surname>
          </string-name>
          .
          <article-title>Cst anti-uav: A thermal infrared benchmark for tiny uav single object tracking</article-title>
          .
          <source>arXiv preprint arXiv:2507.23473</source>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2507.23473.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ghanem</surname>
          </string-name>
          .
          <article-title>A benchmark and simulator for uav tracking</article-title>
          .
          <source>Proceedings of the European Conference on Computer Vision ECCV</source>
          <year>2016</year>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qi</given-names>
            <surname>Tian</surname>
          </string-name>
          .
          <article-title>Unmanned aerial vehicle benchmark: Object detection and tracking</article-title>
          .
          <source>Proceedings of the European Conference on Computer Vision ECCV</source>
          <year>2018</year>
          . pp.
          <fpage>370</fpage>
          -
          <lpage>386</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Bhat et al</article-title>
          .
          <article-title>Anti-uav: A large multi-modal benchmark for uav tracking</article-title>
          .
          <source>Proceedings of the European Conference on Computer Vision ECCV</source>
          <year>2020</year>
          .
          <article-title>(</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>