<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-Level Pose-Guidance with Cross-Modality Fusion for Long-Term Spatio-Temporal Person Re-Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qingyuan Deng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Keyu Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jindan Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoning Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinxin Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shihai He</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lin Feng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science, Sichuan Normal University</institution>
          ,
          <addr-line>Chengdu 610066</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer and Software, Chengdu Jincheng College</institution>
          ,
          <addr-line>Chengdu, 611731</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sichuan Mineral Electromechanic Technician College</institution>
          ,
          <addr-line>Chengdu, 610503</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Person re-identification (Re-ID) is an important visual task related to surveillance security, aimed at enhancing the tracking of the same individual across spatio-temporal regions. Traditional Re-ID methods predominantly depend on extracting garment-dominated texture features from global appearance representations. This inherent clothing bias leads to performance degradation in long-term spatio-temporal scenarios where appearance consistency cannot be guaranteed (e.g., clothing changes). In recent years, research on clothing changes in long-term scenarios has gained increasing attention. Although most existing methods for clothing changes Re-ID attempt to learn distinctive identity features of individuals (e.g., posture features), they are still subject to interference from clothing information. To mitigate this impact, this paper introduces a Multi-Level Pose-Guidance with Cross-Modality Fusion (MPCF) framework for clothing changes person re-identification. The framework consists of three main components: a Shape Embedding (SE) branch, a Cross-Modality Fusion (CMF) branch, and a Multi-Level Feature Guidance (MLFG) branch. The MLFG branch, in conjunction with the SE branch, helps the CMF branch learn more human pose information during the inference stage. We have demonstrated the efectiveness of this method through extensive experiments and achieved excellent performance in several clothing changes Re-ID benchmark tests.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Person re-identification</kwd>
        <kwd>Cross-temporal-spatial person tracking</kwd>
        <kwd>Long-Term scenarios</kwd>
        <kwd>Computer vision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Person re-identification (Re-ID) is an important automated person retrieval technology in video
surveillance systems. It aims to connect the movement trajectories of individuals across diferent
spatiotemporal regions, facilitating person tracking across time, locations, and devices. This technology holds
significant research value in the construction of public safety. Over the past decade, traditional person
Re-ID has been extensively researched, but few models have been deployed in practical applications.
This is because information in real-world spatio-temporal scenarios is complex and dynamic, and
multiple factors constrain model performance. One of the key factors afecting re-identification performance
is the change in person clothing.</p>
      <p>In real-life scenarios, persons may change their clothes for various reasons, such as weather changes,
personal preferences, or specific occasion requirements. These changes not only alter the appearance
of persons but also increase the instability of their identity features, posing a significant challenge
to traditional appearance-based Re-ID systems. Traditional Re-ID methods typically rely on shallow
features such as color, texture, and shape; these features frequently exhibit instability and limited
robustness in long-term scenarios.</p>
      <p>As shown in Fig. 1, the same person wearing diferent clothes across diferent spatiotemporal scenarios
exhibits significant appearance feature discrepancies. Conversely, diferent individuals dressed in
similar clothing show excessively similar texture information. Therefore, solely relying on appearance
information to address long-term problems is infeasible.</p>
      <p>To address the challenge of clothing changes in long-term scenarios, recent research focuses on
learning
clothing-agn</p>
      <p>Texture-confusing Person-ReID
gray
shirt</p>
      <p>Increase in distance
Texture feature
difference
match</p>
      <p>white
pink shirt
shirt
black
pants</p>
      <p>Decrease in distance</p>
      <p>Texture feature
similarity
white
shirt
black
pants
ID 1</p>
      <p>ID 1</p>
      <p>ID 2</p>
      <p>ID 3</p>
      <sec id="sec-1-1">
        <title>Clothing-change</title>
      </sec>
      <sec id="sec-1-2">
        <title>Similar Clothing</title>
        <p>ID 1</p>
      </sec>
      <sec id="sec-1-3">
        <title>Appearance entanglement</title>
        <p>
          ID 1
ostic identity features. While some methods [1, 2] directly decouple identity cues from raw images,
this often results in incomplete feature learning due to the absence of multi-modal guidance. Others
exploit biometric traits (e.g., body shape) as stable identity markers, yet their extraction from RGB
images remains non-trivial. Consequently, auxiliary modalities are widely adopted: pose estimation
[3, 4], gait recognition [2], and human keypoints/sketches [5] have been integrated to reduce clothing
dependency. However, two critical issues persist: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) clothing interference remains non-negligible even
with multi-modal inputs, and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) direct fusion of heterogeneous modalities risks information loss due
to feature discrepancies. To mitigate these limitations, we propose MPCF, a multi-level pose-guided
framework with cross-modal fusion for robust LT-ReID.
        </p>
        <p>Specifically, the MPCF framework consists of three main branches: Shape Embedding (SE),
CrossModality Fusion (CMF), and Multi-Level Feature Guidance (MLFG). In the first two branches, SE uses a
weight-frozen pose extractor to extract body shape-related features, capturing structured information
related to identity. CMF then reduces information diferences between modalities by cross-modal
aggregation of shape features and global appearance features, preserving more clothing-irrelevant
identity cues. To further minimize interference from residual clothing information in the aggregated
features, MLFG aligns the divergence between multi-level person appearance embeddings and SE’s shape
embeddings. This process not only helps extract pose information at diferent granularities from person
appearances but also guides CMF to focus more on pose information during cross-modal aggregation,
thereby better reducing the impact of clothing information. In summary, the main contributions of this
paper are as follows:
• We obtain clothing-agnostic human shape embeddings through a frozen pose estimator and
a shape encoder and interact these embeddings with pedestrian appearance in a cross-modal
manner to generate more robust fused features. To further reduce clothing-related interference
in appearance and highlight clothing-agnostic information in features, we use pose information
as supervision to extract fine-grained pose details from raw images;
• We propose a MLGF branch that leverages biological information as supervision. This branch
learns multi-granularity pose information from appearance features at three diferent levels,
guiding the model to focus more on clothing-agnostic information during cross-modal feature
aggregation and reducing clothing-related interference;
• The efectiveness of our method is demonstrated through extensive experiments on several
cloth-changing datasets test benchmarks;</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Person Re-Identification</title>
        <p>Traditional person re-identification methods primarily target scenarios with short-term appearance
consistency, distinguishing individuals via visual feature extraction. These methods typically rely on
the color, texture, and shape of clothing to characterize persons [6, 7, 8]. In recent years, with the
advancement of deep learning, the field of person re-identification has made significant progress. Most
methods now use deep neural networks to extract both global and local features for precise individual
descriptions [9, 10, 11].</p>
        <p>For example, Zheng et al. [10] employed a multi-class classification loss to learn discriminative global
features by treating each identity as a unique category. However, the abstraction of global features
weakens their sensitivity to subtle diferences, particularly for visually similar individuals. To mitigate
this, local feature-driven approaches have emerged, enhancing detail capture through localized regions
or key points. For instance, Rigoll et al. [11] designed a multi-branch architecture that combines global
features with local body region features, improving recognition performance from multiple aspects.
Wang et al. [12] proposed a Multi-Granular Network (MGN) to integrate fine-grained local features
with global features. Additionally, some studies have focused on optimizing similarity measurement
functions [13, 14, 15] to reduce the distance between samples of the same class and increase the distance
between diferent classes, thereby improving recognition accuracy. However, since clothing often
occupies a large portion of person images, these traditional appearance-based methods overly rely on
extracting clothing information, resulting in significant performance degradation in scenarios involving
long-term clothing changes. This has spurred the rise of research in Long-Term person re-identification
(LT-ReID).</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Long-Term Person Re-Identification</title>
        <p>Unlike traditional person Re-ID, LT-ReID concentrates on scenarios where pedestrian appearances
change over long-term spatio-temporal cycles. Clothing, which is the main part of pedestrian appearance,
poses a significant challenge for identity recognition due to its variability. Many studies have attempted
to address the problems caused by clothing changes. They have tried to bring in biometric attributes
that are not related to clothing to enhance the representation of persons and minimize the interference
from clothing. These biometric attributes include body shape, gait information, and facial features. By
incorporating these attributes, they aim to provide a more comprehensive and stable representation of
individuals, which can help improve the accuracy of identity recognition in LT-ReID scenarios.</p>
        <p>Yang et al. [16] demonstrated the superior reliability of body contour curves over color-based
appearance features under clothing variations. Their CC-ReID framework innovatively employs
contour sketches as auxiliary biometric descriptors, translating anatomical silhouettes into
identitydiscriminative embeddings. Chen et al. [17] addressed clothing texture interference through 3D shape
reconstruction, leveraging volumetric human models to capture anthropometric invariants like torso
proportions and limb geometry. Wang et al. [18] developed a cross-modal fusion architecture that
synergizes holistic appearance features with kinematic pose embeddings. By aligning spatiotemporal
patterns of body joints with global representations, their method amplifies clothing-agnostic cues while
suppressing transient apparel artifacts. Liu et al. [19] pioneered feature disentanglement via 3D human
mesh estimation, isolating persistent identity markers (e.g., skeletal structure, joint topology) from
transient non-identity variables like garment shape and dynamic postures. Their dual-path learning
architecture enables parallel extraction of identity-sensitive features (from nude mesh models) and
apparel-dependent features (from clothed RGB inputs). Through adversarial training, the model jointly
optimizes both feature streams, achieving cross-apparel invariance by explicitly decoupling biological
signatures from sartorial noise. This bidirectional learning paradigm not only enhances discrimination
under clothing changes but also mitigates pose-induced feature distortions.</p>
        <p>While existing multi-modal approaches have mitigated clothing dependency in traditional person
reidentification (Re-ID), complete elimination of clothing bias remains a persistent challenge. To address
this limitation, we propose a multi-level pose-guided feature learning framework that synergistically
integrates pose estimation with Re-ID feature extraction. Beyond simply employing pose features as
auxiliary inputs, our hierarchical design establishes explicit guidance mechanisms through progressively
refined pose representations. This architecture compels the model to preserve discriminative
nonappearance attributes including body geometry and motion patterns, thereby achieving enhanced
robustness in long-term scenarios with clothing variations.</p>
        <p>N images</p>
        <p>Multi-Level Feature Guidance branch
Concatenation Operation</p>
        <p>fres i The appearance features of Stage i</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Overview</title>
        <p>In this section, we introduce our proposed MPCF framework in detail. The framework is mainly
composed of three core branches: the Shape-Embedding (SE) branch, the Cross-Modality Fusion (CMF)
branch, and the Multi-Level Feature Guidance (MLFG) branch, as shown in Fig. 2.</p>
        <p>Specifically, given the person image x ∈ R×  × 3, the SE branch extracts pose features from the
original image and generates embedding information to supervise the MLFG branch. We use ResNet-50
[20] as the backbone to extract the person’s global appearance features. These appearance features are</p>
        <p>Pose
Estimator</p>
        <p>ResNet-50</p>
        <p>Fa</p>
        <p>Shape
Encoder</p>
        <p>Fs
fres5
fres4
fres3
fs
Fs '</p>
        <p>Fa'</p>
        <p>Cross-Modality Fusion branch
Feature
Alignment
Projector</p>
        <p>Cross-Modality</p>
        <p>Attention
f3 f4 f5</p>
        <p>Lguide
Shape Embedding branch</p>
        <p>F</p>
        <p>Lid
then aligned and aggregated with the pose features from the SE branch via CMF, producing robust
fused features. The MLFG branch extracts intermediate features from stages 3, 4, and 5 of the backbone
network. Through a series of projection operations, it generates multi-level appearance embeddings,
which are then aligned with the pose embeddings from SE. This alignment process helps guide the
CMF branch during training to focus more on clothing-irrelevant identity information. The following
sections will provide a detailed explanation of the specifics of each branch.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Shape-Embedding branch</title>
        <p>To learn clothing-invariant discriminative features, we utilize the semantic information of human body
shape, attributed to its stable manifestation across spatio-temporal scenarios and minimal impact from
appearance changes. As shown in Fig. 2, the SE branch consists mainly of two modules: a pose estimator
and a shape encoder. For the pose estimator, we adopt the well-established OpenPose [21] framework
to extract pedestrian pose heatmaps.</p>
        <p>Shape Encoder
k heatmaps</p>
        <p>Global Averrage</p>
        <p>Pooling
Patch-embed</p>
        <p>FC</p>
        <p>Muti-head
Self-attention</p>
        <p>fs
Fs</p>
        <p>For a given input image x, OpenPose can generate k pose heatmaps, each heatmap highlights a key
part of the human body (e.g., face, hands, feet). These heatmaps are then fed into the shape encoder
to produce features related to overall body posture f ∈ R1× (/8)× (/8) and body semantic features
F ∈ Rℎ× × .</p>
        <p>The structure of the Shape Encoder is depicted in Fig. 3 and includes two branches for processing
the input pose heatmaps. The upper branch transforms the human pose heatmap into a body shape
feature embedding f ∈ R1× 1152 through a global average pooling layer followed by a fully connected
layer. To enable body shape information to interact more efectively with appearance information in
the cross-modal fusion branch, the lower branch employs a method similar to CAMC [18] for shape
embedding. This branch consists of an image patch embedding module and a multi-head attention
module based on ViT [22]. The goal is to capture the relationships between diferent key points of the
human body. The image patch embedding module processes the heatmap of size h ×  by cutting it
into a series of overlapping patches using a sliding window. The stride is denoted as S, and the patch
size as P (e.g., 4), resulting in an overlap between adjacent patches of shape ( − ) ×  . In this way,
the entire heatmap is divided into N such patches.</p>
        <p>
          = ℎ ×  = ⌊  +  −  ⌋ × ⌊  + −  ⌋ (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>Afterwards, through the self-attention mechanism, the patches are correlated with each other and
thus learn to obtain more robust semantic features of human shapes F ∈ R288× 2048.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Cross-Modality Fusion branch</title>
        <p>
          In our approach, we utilize ResNet-50 as the backbone network and set the stride of its fifth convolutional
layer to 1. We extract the intermediate outputs from the third, fourth, and final layers to obtain
multiscale feature representations. Within this branch, we flatten the output features from the fifth layer to
obtain the texture feature representation F ∈ R × . To prevent information loss when aggregating
texture features F and body shape features F from diferent modalities, we first use a feature alignment
module to concatenate the features from both modalities along the channel dimension, resulting in
Fℎ = [F, F] ∈ R × 2. Based on the channel attention mechanism [23], this module, which
consists of two fully connected layers forming a bottleneck structure, is used to model the inter-channel
relationships within Fℎ and outputs weights of the same quantity as the input features. We
ifrst reduce the feature dimension to one-fourth of the input (e.g., D/2), then pass it through a ReLU
activation, and then through a fully connected layer to restore the original dimension, followed by a
sigmoid to obtain normalized weight scores s. These weights s are then added to the channels of both
modal features and summed with their original features to obtain the aligned features F′ ∈ R × 
and F′ ∈ R × . The overall process can be represented as follows:
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(3)
(4)
(5)
(6)
(7)
 = (2 (1ℎ + 1) + 2)
′ = [:, 0 :  ] ⊗  + 
′ = [:,  : 2 ] ⊗  + 
where W1 is the weight matrix of the first fully connected layer with dimensions R2× /2, b1 is
its bias vector with dimensions R/2. W2 ∈ R/2× 2, and b2 ∈ R2. After aligning features from
both modalities, we use a multi-head cross-modal attention module for adaptive fusion of texture and
morphological semantic features. The queries, keys, and values in the attention block are represented
as:
        </p>
        <p>
          // =  (
          <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
          )(ℎ3( ))
where F represents the features from both modalities, and Reshape3 indicates reshaping F into a
three-dimensional feature map. To integrate information across diferent modalities, we use texture
features and body shape information as queries, with corresponding body shape features and appearance
features serving as keys and values:
→ = ′ + ℎ2( (, , ))
→ = ′ + ℎ2( (, , ))
        </p>
        <p>This bidirectional access helps texture features to enhance shape features that are
clothingindependent, while body shape features incorporate necessary identity traits, minimizing the
information gap between modalities. The concatenated features  will be utilized to compute the ID recognition
loss.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Multi-Level Feature Guidance branch</title>
        <p>Furthermore, to fully leverage the body semantic information embedded in a person’s appearance and
reduce the interference of clothing information, we opt to use pose information as guidance on top of
cross-modal aggregated appearance features and body semantic features. This approach aims to steer
the model’s focus towards regions closely related to posture, thereby enhancing recognition accuracy.
Specifically, we align the body shape embeddings f obtained from the shape embedding branch with
person feature embeddings. Without compromising other essential information, this highlights the
pose information within person representations, allowing for the retention of more posture-related
details during cross-modal feature aggregation. To capture richer original body shape information from
images, we extract three levels of person appearance feature maps f3, f4, f5 from intermediate
layers of the backbone network. These feature maps are then passed through a feature projection layer,
which maps them into a feature space identical to the body shape embeddings f without significantly
harming the original information, forming implicit multi-level person feature embeddings f3, f4, f5.
The projection layer is designed with linear projection, Transformer encoder, global pooling, and a
normalization layer to ensure efective feature transformation and integration.</p>
        <p>Ultimately, the person feature embeddings  (where  = 3, 4, 5) obtained will be combined with the
body shape embeddings  to jointly compute the guidance loss. To ensure the alignment of information
between the two and to emphasize the pose information within the person feature embeddings, we use
the Kullback-Leible (KL) divergence as the guidance loss guide to consistently measure the similarity
between the appearance embeddings  and the body shape embeddings . The lower the value of
guide, the more semantically consistent information the model has learned, meaning it can better
capture features related to posture. The specific formulation of the overall loss function is as follows:
 = (1 −  ) +  
where ID represents the identification loss function based on cross-entropy, with inputs being the
cross-modal aggregated features  and the identity labels , and  is a fixed value. The guide function
can be specifically expressed as:
(8)
(9)
(10)
(11)</p>
        <p>In the calculation of KL divergence,  and  represent two probability distributions, where  and 
are the probabilities of these distributions for the -th category, respectively. We obtain probability
vectors for the person feature embeddings and body shape embeddings through normalization, and
then compute the diference between them. The divergence value is divided by 2 to balance the scale of
the loss function.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>Datasets. As shown in Table 1. To evaluate the efectiveness of our proposed MPCF framework,
we conducted assessments primarily on three widely-used long-term clothing-change person Re-ID
datasets: LTCC [5], PRCC [16], and Celeb-reID [24]. LTCC comprises 17,119 images of diferent
individuals, covering 152 distinct identities and 416 diferent outfits, with an average of 5 varying outfits
per person, and the number of outfit changes ranging from 2 to 14. PRCC includes 33,698 images of 221
mAP
individuals captured from three camera views. The training set consists of 150 individuals, while the
test set comprises the remaining 71. During training, 25% of the images from the training set are used as
a validation set. Celeb-reID utilizes street photos of celebrities to address long-term clothing changes.
The dataset contains 34,186 images of 1,052 identities, each with unique clothing, thus presenting a
greater challenge in clothing changes scenarios compared to the previous two datasets.</p>
        <p>Implementation details. Our model is constructed on the PyTorch framework. We utilized a
pre-trained ResNet-50 from ImageNet [31] as the backbone network to extract texture features of
persons. The dimensions of the multi-level features extracted by the backbone network are 512, 1024,
and 2048, respectively. All training was conducted on
mAP
mAP
a single NVIDIA RTX 3090 GPU. During both training and testing phases, images were resized to
a uniform size of 384x192. Data augmentation included color jittering, random horizontal flipping,
padding, random cropping, and random erasing [36]. We employed the Adam optimizer [37] for model
optimization and performed 150 training epochs, with a warm-up strategy applied in the first 10 epochs,
gradually increasing the learning rate from 3e-5 to 3e-4. The learning rate was reduced by 1/10 at
epochs 40 and 80. For the PRCC and Celeb-reID datasets, the batch size was set to 48, while for the
LTCC dataset, it was set to 32, with each identity ID having 4 images. For pose estimation, we used
the OpenPose model pre-trained on the COCO dataset [38], generating 18 heatmaps, and we froze its
weights during the training process.</p>
        <p>Evaluation metrics. We employed the two standard metrics commonly used in most
clothingchange Re-ID literature: mean Average Precision (mAP) and Cumulative Matching Characteristic (CMC).
To ensure a fair comparison with existing studies, we evaluated LTCC and PRCC under both standard
and clothing-change settings. Under standard settings, the test set included both consistent and varied
clothing samples. In clothing-change settings, the test set exclusively contained samples with varied
outfits.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Performance Comparisons</title>
        <p>Performance on the LTCC dataset. We evaluated our proposed method on the LTCC dataset and
compared it with baseline models and other state-of-the-art clothing-change person re-identification
approaches, as shown in Table 2. Compared to the baseline model, our model achieved improvements
of +2.0% in mAP and +2.2% in R1 under standard settings. In clothing-change settings, compared to
FSAM [26], although our method slightly underperformed in the mAP metric, it achieved a significant
+2.0% improvement in the R1 metric. Moreover, compared to the second-best performing method LDF
[29], our approach outperformed in both mAP and R1 metrics under both settings. It also surpassed the
MBUNet [27] method, which had the second-best R1 performance in the clothing-change setting.</p>
        <p>Performance on the PRCC dataset. We also assessed our proposed method on the PRCC dataset,
with results shown in Table 3. It is noteworthy that the original baseline
model was not evaluated on this dataset. We faithfully reproduced the experimental results by strictly
adhering to the implementation protocols outlined in the original paper. It can be observed that under
the clothing changes setting, our method significantly outperforms the baseline model on both the R1
and mAP metrics, with improvements of +4.1% and +3.7% respectively. Although the baseline model
integrates clothing-agnostic pose information into person identity representation and minimizes the
information discrepancy between appearance texture and pose features as much as possible, it is still
inevitably afected by clothing information. Our method, however, with multi-level pose information
supervision, can further reduce clothing noise. Other comparative results indicate that our method
achieves comparable results with other advanced approaches.</p>
        <p>Performance on the Celeb-reID dataset. Compared to the first two datasets, Celeb-reID is larger
and more challenging, with images captured from uncontrolled street snapshots without any clothing
annotations.</p>
        <p>As shown in Table 4, all advanced methods perform relatively poorly. Competitors such as FSAM
[26] and MBUNet [27] have not reported results in this area. Our method, MPCF, achieved notable
performance improvements of 62.7%, 77.3%, and 16.1% in R1, R5, and mAP metrics, respectively. Compared
to the baseline model, our method significantly improved by +5.2% in R1 and +3.8% in mAP. When
compared to the second-best performing method, 3DInvarReID [19], our method improved by +0.9% in
mAP and +1.5% in R1.</p>
        <p>The performance results across the three datasets demonstrate that our approach helps person
re-identification models prioritize pose information over clothing during training, efectively addressing
the challenge of clothing changes.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ablation Study</title>
        <p>Component Analysis. To demonstrate the efectiveness of our approach, we evaluated the multi-level
pose guidance and the efectiveness of the two branches, SE and CMF, on the LTCC dataset under the
standard Settings and compared them with the baseline model. The results are shown in Table 5.</p>
        <p>In single-level guidance, the pose guidance at stage 5 showed the most significant improvement over
the baseline model. The guidance at stages 3 and 4 resulted in slight
increases in mAP, but there was no noticeable improvement in the Rank metrics, and even a decrease
was observed. When combining two levels of guidance, the joint pose guidance at stages 4 and 5
performed the best, while the other two methods improved mAP but did not perform well on the Rank
metrics. Ultimately, our method integrated guidance across three levels, and after experimentation, the
optimal weight ratio for the three levels in the guidance loss was found to be 5:3:2, achieving the best
overall performance. Compared to other methods, our approach achieved the best results in both R1 and
mAP metrics. This also confirms the efectiveness of using multi-level guidance for pose information.</p>
        <p>To show our framework is efective, we did ablation studies on its branches. Since all branches use
pedestrian pose features, removing the SE branch leaves only the ResNet-50 backbone working. This
leads to much worse performance on the LTCC dataset, as shown in Table 5. If we remove the CMF
branch, the model loses key info due to the diference between pose and appearance features, harming
performance. The final MPCF results prove the CMF branch’s cross - modal fusion is necessary.</p>
        <p>Computational Complexity Analysis. We systematically evaluated the impact of adding three
levels of pose guidance components on the model under the PRCC dataset’s cloth-changing setting,
focusing on changes in computational cost and performance improvements. The results are shown in
Table 6.</p>
        <p>Experiments show that the introduction of a single level pose guidance component leads to a
significant increase in the training parameters (Params) of the model, but the increase in the computational
time complexity (FLOPs) of the model is small. This is mainly because the projection module in the
MLFG branch uses fully connected layers and Transformer encoders, which add parameters but have
relatively low computational complexity. Furthermore, our MPCF framework integrates all three levels
of pose guidance components. Compared to the baseline model, while Params increased by 25%, the
performance metrics showed significant improvements: Rank-1 improved by +4.1 %, and mAP improved by
+3.7%. Meanwhile, the increase in FLOPs remained small, indicating that the computational complexity
did not rise significantly.</p>
        <p>This design shows that our method can achieve significant performance improvements with limited
computational cost, proving that these additions are worthwhile.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Visualization of retrieval results</title>
        <p>Our proposed method integrates multi-modal feature fusion and multi-level pose guidance to better
address the challenges of person re-identification in long-term clothing changes scenarios. To visually
demonstrate this conclusion, we visualized the top-10 retrieval results of the baseline model CAMC and
our method on the LTCC dataset under clothing changes settings, as shown in Fig. 4.</p>
        <p>Our proposed model significantly reduces the dependency on clothing information during the
identification process. As shown in the first row of Fig. 4(a), the baseline model’s matching results
mostly display persons with similar clothing but diferent identities compared to the query image. In
contrast, as depicted in the second row of Fig. 4(a), our method’s matching results can still efectively
identify the correct person identities even in clothing changes scenarios where there may be similarities
between samples of diferent categories. Additionally, as demonstrated in the results of Fig. 4(b), due
to the interference of clothing information, the top retrieval results in the first row are images with
similar clothing textures and colors. However, thanks to the multi-level pose guidance in our approach,
the model focuses more on body shape information that is independent of clothing. Consequently, in
the second row of results shows that even when the queried person is wearing diferent clothing, our
model can still achieve more robust person identity representations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>To mitigate information interference caused by long-term and cross-scenario appearance variations in
persons, this paper proposes a Multi-Level Pose-Guidance with Cross-Modality Fusion for Long-Term
Spatio-Temporal Re-ID (MPCF). Specifically, we introduce additional modality human pose feature
embeddings through a SE branch, supplementing identity information independent of clothing. Then,
a CMF branch reduces the modality gap between person appearance features and pose features,
preventing the loss of key information across modalities when aggregating clothing-independent features.
Furthermore, to further reduce the model’s focus on clothing information during the aggregation
process, we propose a MLFG branch that uses multi-level person pose embeddings as guidance, compelling
the model to concentrate attention on clothing-independent information areas, ensuring that aggregated
features include more clothing-independent, distinctive identity information. Our proposed method
has been extensively tested on multiple datasets, validating its efectiveness.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This paper is in part supported by the National Natural Science Foundation of China under Grants
62376231,the Sichuan Science and Technology Program 24NSFSC1070, the Sichuan Education
Informatization and Big Data Center (Sichuan Audio-visual Education Hall) 2024KTPSLX001, respectively.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[3] M. Liu, Z. Ma, T. Li, Y. Jiang, K. Wang, Long-term person re-identification with dramatic appearance
change: Algorithm and benchmark, in: Proceedings of the 30th ACM International Conference on
Multimedia, 2022, pp. 6406–6415.
[4] Y. Xian, J. Yang, F. Yu, J. Zhang, X. Sun, Graph-based self-learning for robust person re-identification,
in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023,
pp. 4789–4798.
[5] X. Qian, W. Wang, L. Zhang, F. Zhu, Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, Long-term cloth-changing
person re-identification, in: Proceedings of the Asian Conference on Computer Vision, 2020.
[6] W. Li, R. Zhao, T. Xiao, X. Wang, Deepreid: Deep filter pairing neural network for person
reidentification, in: Proceedings of the IEEE conference on computer vision and pattern recognition,
2014, pp. 152–159.
[7] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, S. Z. Li, Salient color names for person re-identification, in:
Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12,
2014, Proceedings, Part I 13, Springer, 2014, pp. 536–551.
[8] O. Oreifej, R. Mehran, M. Shah, Human identity recognition in aerial images, in: 2010 IEEE
computer society conference on computer vision and pattern recognition, IEEE, 2010, pp. 709–716.
[9] R. R. Varior, B. Shuai, J. Lu, D. Xu, G. Wang, A siamese long short-term memory architecture for
human re-identification, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam,
The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, Springer, 2016, pp. 135–153.
[10] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, Q. Tian, Person re-identification in the wild,
in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.
1367–1376.
[11] F. Herzog, X. Ji, T. Teepe, S. Hörmann, J. Gilg, G. Rigoll, Lightweight multi-branch network for
person re-identification, in: 2021 IEEE international conference on image processing (ICIP), IEEE,
2021, pp. 1129–1133.
[12] G. Wang, Y. Yuan, X. Chen, J. Li, X. Zhou, Learning discriminative features with multiple
granularities for person re-identification, in: Proceedings of the 26th ACM international conference on
Multimedia, 2018, pp. 274–282.
[13] Y. Suh, J. Wang, S. Tang, T. Mei, K. M. Lee, Part-aligned bilinear representations for person
re-identification, in: Proceedings of the European conference on computer vision (ECCV), 2018,
pp. 402–419.
[14] F. Yang, K. Yan, S. Lu, H. Jia, X. Xie, W. Gao, Attention driven person re-identification, Pattern</p>
      <p>Recognition 86 (2019) 143–155.
[15] L. Zhao, X. Li, Y. Zhuang, J. Wang, Deeply-learned part-aligned representations for person
reidentification, in: Proceedings of the IEEE international conference on computer vision, 2017, pp.
3219–3228.
[16] Q. Yang, A. Wu, W.-S. Zheng, Person re-identification by contour sketch under moderate clothing
change, IEEE transactions on pattern analysis and machine intelligence 43 (2019) 2029–2046.
[17] J. Chen, X. Jiang, F. Wang, J. Zhang, F. Zheng, X. Sun, W.-S. Zheng, Learning 3d shape feature
for texture-insensitive person re-identification, in: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2021, pp. 8146–8155.
[18] Q. Wang, X. Qian, Y. Fu, X. Xue, Co-attention aligned mutual cross-attention for cloth-changing
person re-identification, in: Proceedings of the Asian Conference on Computer Vision, 2022, pp.
2270–2288.
[19] F. Liu, M. Kim, Z. Gu, A. Jain, X. Liu, Learning clothing and pose invariant 3d shape representation
for long-term person re-identification, in: Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2023, pp. 19617–19626.
[20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[21] Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using part afinity
ifelds, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017,
pp. 7291–7299.
[22] D. Alexey, An image is worth 16x16 words: Transformers for image recognition at scale, arXiv
preprint arXiv: 2010.11929 (2020).
[23] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2018, pp. 7132–7141.
[24] Y. Huang, Q. Wu, J. Xu, Y. Zhong, Celebrities-reid: A benchmark for clothes variation in long-term
person re-identification, in: 2019 International Joint Conference on Neural Networks (IJCNN),
IEEE, 2019, pp. 1–8.
[25] Y. Sun, L. Zheng, Y. Yang, Q. Tian, S. Wang, Beyond part models: Person retrieval with refined
part pooling (and a strong convolutional baseline), in: Proceedings of the European conference on
computer vision (ECCV), 2018, pp. 480–496.
[26] P. Hong, T. Wu, A. Wu, X. Han, W.-S. Zheng, Fine-grained shape-appearance mutual learning for
cloth-changing person re-identification, in: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2021, pp. 10513–10522.
[27] G. Zhang, J. Liu, Y. Chen, Y. Zheng, H. Zhang, Multi-biometric unified network for cloth-changing
person re-identification, IEEE Transactions on Image Processing 32 (2023) 4555–4566.
[28] Z. Yang, X. Zhong, Z. Zhong, H. Liu, Z. Wang, S. Satoh, Win-win by competition:
Auxiliaryfree cloth-changing person re-identification, IEEE Transactions on Image Processing 32 (2023)
2985–2999.
[29] P. P. Chan, X. Hu, H. Song, P. Peng, K. Chen, Learning disentangled features for person
reidentification under clothes changing, ACM Transactions on Multimedia Computing,
Communications and Applications 19 (2023) 1–21.
[30] M. Li, S. Cheng, P. Xu, X. Zhu, C.-G. Li, J. Guo, Unsupervised long-term person re-identification
with clothes change, in: 2023 8th IEEE International Conference on Network Intelligence and
Digital Content (IC-NIDC), IEEE, 2023, pp. 167–171.
[31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of
computer vision 115 (2015) 211–252.
[32] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, X. Chen, Interaction-and-aggregation network for person
re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 2019, pp. 9317–9326.
[33] C. Yan, G. Pang, J. Jiao, X. Bai, X. Feng, C. Shen, Occluded person re-identification with single-scale
global representations, in: Proceedings of the IEEE/CVF international conference on computer
vision, 2021, pp. 11875–11884.
[34] W. Xu, H. Liu, W. Shi, Z. Miao, Z. Lu, F. Chen, Adversarial feature disentanglement for long-term
person re-identification., in: IJCAI, 2021, pp. 1201–1207.
[35] Y. Yan, H. Yu, S. Li, Z. Lu, J. He, H. Zhang, R. Wang, Weakening the influence of clothing: Universal
clothing attribute disentanglement for person re-identification., in: IJCAI, 2022, pp. 1523–1529.
[36] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, in: Proceedings
of the AAAI conference on artificial intelligence, volume 34, 2020, pp. 13001–13008.
[37] P. K. Diederik, Adam: A method for stochastic optimization, (No Title) (2014).
[38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755.
[39] S. Yang, B. Kang, Y. Lee, Sampling agnostic feature representation for long-term person
reidentification, IEEE Transactions on Image Processing 31 (2022) 6412–6423.
[40] J. Wu, Y. Huang, M. Gao, Z. Gao, J. Zhao, H. Zhang, A. Zhang, A two-stream hybrid
convolutiontransformer network architecture for clothing-change person re-identification, IEEE Transactions
on Multimedia (2023).
[41] Y. Huang, Q. Wu, Z. Zhang, C. Shan, Y. Zhong, L. Wang, Meta clothing status calibration for
long-term person re-identification, IEEE Transactions on Image Processing (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ye</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Patching your clothes: Semantic-aware learning for cloth-changed person re-identification</article-title>
          , in: International Conference on Multimedia Modeling, Springer,
          <year>2022</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-S.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <article-title>Clothchanging person re-identification from a single image with gait prediction and regularization</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>14278</fpage>
          -
          <lpage>14287</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>