1. Introduction

Multi-Level Pose-Guidance with Cross-Modality Fusion for Long-Term Spatio-Temporal Person Re-Identification

Qingyuan Deng

Keyu Zhu

Jindan Wu

Xiaoning Li

Xinxin Li

Shihai He

Lin Feng

0 0 School of Computer Science, Sichuan Normal University , Chengdu 610066 , China 1 School of Computer and Software, Chengdu Jincheng College , Chengdu, 611731 , China 2 Sichuan Mineral Electromechanic Technician College , Chengdu, 610503 , China

2025

Person re-identification (Re-ID) is an important visual task related to surveillance security, aimed at enhancing the tracking of the same individual across spatio-temporal regions. Traditional Re-ID methods predominantly depend on extracting garment-dominated texture features from global appearance representations. This inherent clothing bias leads to performance degradation in long-term spatio-temporal scenarios where appearance consistency cannot be guaranteed (e.g., clothing changes). In recent years, research on clothing changes in long-term scenarios has gained increasing attention. Although most existing methods for clothing changes Re-ID attempt to learn distinctive identity features of individuals (e.g., posture features), they are still subject to interference from clothing information. To mitigate this impact, this paper introduces a Multi-Level Pose-Guidance with Cross-Modality Fusion (MPCF) framework for clothing changes person re-identification. The framework consists of three main components: a Shape Embedding (SE) branch, a Cross-Modality Fusion (CMF) branch, and a Multi-Level Feature Guidance (MLFG) branch. The MLFG branch, in conjunction with the SE branch, helps the CMF branch learn more human pose information during the inference stage. We have demonstrated the efectiveness of this method through extensive experiments and achieved excellent performance in several clothing changes Re-ID benchmark tests.

eol>Person re-identification Cross-temporal-spatial person tracking Long-Term scenarios Computer vision

1. Introduction

Person re-identification (Re-ID) is an important automated person retrieval technology in video surveillance systems. It aims to connect the movement trajectories of individuals across diferent spatiotemporal regions, facilitating person tracking across time, locations, and devices. This technology holds significant research value in the construction of public safety. Over the past decade, traditional person Re-ID has been extensively researched, but few models have been deployed in practical applications. This is because information in real-world spatio-temporal scenarios is complex and dynamic, and multiple factors constrain model performance. One of the key factors afecting re-identification performance is the change in person clothing.

In real-life scenarios, persons may change their clothes for various reasons, such as weather changes, personal preferences, or specific occasion requirements. These changes not only alter the appearance of persons but also increase the instability of their identity features, posing a significant challenge to traditional appearance-based Re-ID systems. Traditional Re-ID methods typically rely on shallow features such as color, texture, and shape; these features frequently exhibit instability and limited robustness in long-term scenarios.

As shown in Fig. 1, the same person wearing diferent clothes across diferent spatiotemporal scenarios exhibits significant appearance feature discrepancies. Conversely, diferent individuals dressed in similar clothing show excessively similar texture information. Therefore, solely relying on appearance information to address long-term problems is infeasible.

To address the challenge of clothing changes in long-term scenarios, recent research focuses on learning clothing-agn

Texture-confusing Person-ReID gray shirt

Increase in distance Texture feature difference match

white pink shirt shirt black pants

Decrease in distance

Texture feature similarity white shirt black pants ID 1

ID 1

ID 2

ID 3

Clothing-change Similar Clothing

ID 1

Appearance entanglement

ID 1 ostic identity features. While some methods [1, 2] directly decouple identity cues from raw images, this often results in incomplete feature learning due to the absence of multi-modal guidance. Others exploit biometric traits (e.g., body shape) as stable identity markers, yet their extraction from RGB images remains non-trivial. Consequently, auxiliary modalities are widely adopted: pose estimation [3, 4], gait recognition [2], and human keypoints/sketches [5] have been integrated to reduce clothing dependency. However, two critical issues persist: ( 1 ) clothing interference remains non-negligible even with multi-modal inputs, and ( 2 ) direct fusion of heterogeneous modalities risks information loss due to feature discrepancies. To mitigate these limitations, we propose MPCF, a multi-level pose-guided framework with cross-modal fusion for robust LT-ReID.

Specifically, the MPCF framework consists of three main branches: Shape Embedding (SE), CrossModality Fusion (CMF), and Multi-Level Feature Guidance (MLFG). In the first two branches, SE uses a weight-frozen pose extractor to extract body shape-related features, capturing structured information related to identity. CMF then reduces information diferences between modalities by cross-modal aggregation of shape features and global appearance features, preserving more clothing-irrelevant identity cues. To further minimize interference from residual clothing information in the aggregated features, MLFG aligns the divergence between multi-level person appearance embeddings and SE’s shape embeddings. This process not only helps extract pose information at diferent granularities from person appearances but also guides CMF to focus more on pose information during cross-modal aggregation, thereby better reducing the impact of clothing information. In summary, the main contributions of this paper are as follows: • We obtain clothing-agnostic human shape embeddings through a frozen pose estimator and a shape encoder and interact these embeddings with pedestrian appearance in a cross-modal manner to generate more robust fused features. To further reduce clothing-related interference in appearance and highlight clothing-agnostic information in features, we use pose information as supervision to extract fine-grained pose details from raw images; • We propose a MLGF branch that leverages biological information as supervision. This branch learns multi-granularity pose information from appearance features at three diferent levels, guiding the model to focus more on clothing-agnostic information during cross-modal feature aggregation and reducing clothing-related interference; • The efectiveness of our method is demonstrated through extensive experiments on several cloth-changing datasets test benchmarks;

2. Related Work 2.1. Person Re-Identification

Traditional person re-identification methods primarily target scenarios with short-term appearance consistency, distinguishing individuals via visual feature extraction. These methods typically rely on the color, texture, and shape of clothing to characterize persons [6, 7, 8]. In recent years, with the advancement of deep learning, the field of person re-identification has made significant progress. Most methods now use deep neural networks to extract both global and local features for precise individual descriptions [9, 10, 11].

For example, Zheng et al. [10] employed a multi-class classification loss to learn discriminative global features by treating each identity as a unique category. However, the abstraction of global features weakens their sensitivity to subtle diferences, particularly for visually similar individuals. To mitigate this, local feature-driven approaches have emerged, enhancing detail capture through localized regions or key points. For instance, Rigoll et al. [11] designed a multi-branch architecture that combines global features with local body region features, improving recognition performance from multiple aspects. Wang et al. [12] proposed a Multi-Granular Network (MGN) to integrate fine-grained local features with global features. Additionally, some studies have focused on optimizing similarity measurement functions [13, 14, 15] to reduce the distance between samples of the same class and increase the distance between diferent classes, thereby improving recognition accuracy. However, since clothing often occupies a large portion of person images, these traditional appearance-based methods overly rely on extracting clothing information, resulting in significant performance degradation in scenarios involving long-term clothing changes. This has spurred the rise of research in Long-Term person re-identification (LT-ReID).

2.2. Long-Term Person Re-Identification

Unlike traditional person Re-ID, LT-ReID concentrates on scenarios where pedestrian appearances change over long-term spatio-temporal cycles. Clothing, which is the main part of pedestrian appearance, poses a significant challenge for identity recognition due to its variability. Many studies have attempted to address the problems caused by clothing changes. They have tried to bring in biometric attributes that are not related to clothing to enhance the representation of persons and minimize the interference from clothing. These biometric attributes include body shape, gait information, and facial features. By incorporating these attributes, they aim to provide a more comprehensive and stable representation of individuals, which can help improve the accuracy of identity recognition in LT-ReID scenarios.

Yang et al. [16] demonstrated the superior reliability of body contour curves over color-based appearance features under clothing variations. Their CC-ReID framework innovatively employs contour sketches as auxiliary biometric descriptors, translating anatomical silhouettes into identitydiscriminative embeddings. Chen et al. [17] addressed clothing texture interference through 3D shape reconstruction, leveraging volumetric human models to capture anthropometric invariants like torso proportions and limb geometry. Wang et al. [18] developed a cross-modal fusion architecture that synergizes holistic appearance features with kinematic pose embeddings. By aligning spatiotemporal patterns of body joints with global representations, their method amplifies clothing-agnostic cues while suppressing transient apparel artifacts. Liu et al. [19] pioneered feature disentanglement via 3D human mesh estimation, isolating persistent identity markers (e.g., skeletal structure, joint topology) from transient non-identity variables like garment shape and dynamic postures. Their dual-path learning architecture enables parallel extraction of identity-sensitive features (from nude mesh models) and apparel-dependent features (from clothed RGB inputs). Through adversarial training, the model jointly optimizes both feature streams, achieving cross-apparel invariance by explicitly decoupling biological signatures from sartorial noise. This bidirectional learning paradigm not only enhances discrimination under clothing changes but also mitigates pose-induced feature distortions.

While existing multi-modal approaches have mitigated clothing dependency in traditional person reidentification (Re-ID), complete elimination of clothing bias remains a persistent challenge. To address this limitation, we propose a multi-level pose-guided feature learning framework that synergistically integrates pose estimation with Re-ID feature extraction. Beyond simply employing pose features as auxiliary inputs, our hierarchical design establishes explicit guidance mechanisms through progressively refined pose representations. This architecture compels the model to preserve discriminative nonappearance attributes including body geometry and motion patterns, thereby achieving enhanced robustness in long-term scenarios with clothing variations.

N images

Multi-Level Feature Guidance branch Concatenation Operation

fres i The appearance features of Stage i

3. Methodology 3.1. Overview

In this section, we introduce our proposed MPCF framework in detail. The framework is mainly composed of three core branches: the Shape-Embedding (SE) branch, the Cross-Modality Fusion (CMF) branch, and the Multi-Level Feature Guidance (MLFG) branch, as shown in Fig. 2.

Specifically, given the person image x ∈ R× × 3, the SE branch extracts pose features from the original image and generates embedding information to supervise the MLFG branch. We use ResNet-50 [20] as the backbone to extract the person’s global appearance features. These appearance features are

Pose Estimator

ResNet-50

Shape Encoder

Fs fres5 fres4 fres3 fs Fs '

Fa'

Cross-Modality Fusion branch Feature Alignment Projector

Cross-Modality

Attention f3 f4 f5

Lguide Shape Embedding branch

Lid then aligned and aggregated with the pose features from the SE branch via CMF, producing robust fused features. The MLFG branch extracts intermediate features from stages 3, 4, and 5 of the backbone network. Through a series of projection operations, it generates multi-level appearance embeddings, which are then aligned with the pose embeddings from SE. This alignment process helps guide the CMF branch during training to focus more on clothing-irrelevant identity information. The following sections will provide a detailed explanation of the specifics of each branch.

3.2. Shape-Embedding branch

To learn clothing-invariant discriminative features, we utilize the semantic information of human body shape, attributed to its stable manifestation across spatio-temporal scenarios and minimal impact from appearance changes. As shown in Fig. 2, the SE branch consists mainly of two modules: a pose estimator and a shape encoder. For the pose estimator, we adopt the well-established OpenPose [21] framework to extract pedestrian pose heatmaps.

Shape Encoder k heatmaps

Global Averrage

Pooling Patch-embed

Muti-head Self-attention

fs Fs

For a given input image x, OpenPose can generate k pose heatmaps, each heatmap highlights a key part of the human body (e.g., face, hands, feet). These heatmaps are then fed into the shape encoder to produce features related to overall body posture f ∈ R1× (/8)× (/8) and body semantic features F ∈ Rℎ× × .

The structure of the Shape Encoder is depicted in Fig. 3 and includes two branches for processing the input pose heatmaps. The upper branch transforms the human pose heatmap into a body shape feature embedding f ∈ R1× 1152 through a global average pooling layer followed by a fully connected layer. To enable body shape information to interact more efectively with appearance information in the cross-modal fusion branch, the lower branch employs a method similar to CAMC [18] for shape embedding. This branch consists of an image patch embedding module and a multi-head attention module based on ViT [22]. The goal is to capture the relationships between diferent key points of the human body. The image patch embedding module processes the heatmap of size h × by cutting it into a series of overlapping patches using a sliding window. The stride is denoted as S, and the patch size as P (e.g., 4), resulting in an overlap between adjacent patches of shape ( − ) × . In this way, the entire heatmap is divided into N such patches.

= ℎ × = ⌊ + − ⌋ × ⌊ + − ⌋ ( 1 )

Afterwards, through the self-attention mechanism, the patches are correlated with each other and thus learn to obtain more robust semantic features of human shapes F ∈ R288× 2048.

3.3. Cross-Modality Fusion branch

In our approach, we utilize ResNet-50 as the backbone network and set the stride of its fifth convolutional layer to 1. We extract the intermediate outputs from the third, fourth, and final layers to obtain multiscale feature representations. Within this branch, we flatten the output features from the fifth layer to obtain the texture feature representation F ∈ R × . To prevent information loss when aggregating texture features F and body shape features F from diferent modalities, we first use a feature alignment module to concatenate the features from both modalities along the channel dimension, resulting in Fℎ = [F, F] ∈ R × 2. Based on the channel attention mechanism [23], this module, which consists of two fully connected layers forming a bottleneck structure, is used to model the inter-channel relationships within Fℎ and outputs weights of the same quantity as the input features. We ifrst reduce the feature dimension to one-fourth of the input (e.g., D/2), then pass it through a ReLU activation, and then through a fully connected layer to restore the original dimension, followed by a sigmoid to obtain normalized weight scores s. These weights s are then added to the channels of both modal features and summed with their original features to obtain the aligned features F′ ∈ R × and F′ ∈ R × . The overall process can be represented as follows: ( 2 ) (3) (4) (5) (6) (7) = (2 (1ℎ + 1) + 2) ′ = [:, 0 : ] ⊗ + ′ = [:, : 2 ] ⊗ + where W1 is the weight matrix of the first fully connected layer with dimensions R2× /2, b1 is its bias vector with dimensions R/2. W2 ∈ R/2× 2, and b2 ∈ R2. After aligning features from both modalities, we use a multi-head cross-modal attention module for adaptive fusion of texture and morphological semantic features. The queries, keys, and values in the attention block are represented as:

// = ( 1,2 )(ℎ3( )) where F represents the features from both modalities, and Reshape3 indicates reshaping F into a three-dimensional feature map. To integrate information across diferent modalities, we use texture features and body shape information as queries, with corresponding body shape features and appearance features serving as keys and values: → = ′ + ℎ2( (, , )) → = ′ + ℎ2( (, , ))

This bidirectional access helps texture features to enhance shape features that are clothingindependent, while body shape features incorporate necessary identity traits, minimizing the information gap between modalities. The concatenated features will be utilized to compute the ID recognition loss.

3.4. Multi-Level Feature Guidance branch

Furthermore, to fully leverage the body semantic information embedded in a person’s appearance and reduce the interference of clothing information, we opt to use pose information as guidance on top of cross-modal aggregated appearance features and body semantic features. This approach aims to steer the model’s focus towards regions closely related to posture, thereby enhancing recognition accuracy. Specifically, we align the body shape embeddings f obtained from the shape embedding branch with person feature embeddings. Without compromising other essential information, this highlights the pose information within person representations, allowing for the retention of more posture-related details during cross-modal feature aggregation. To capture richer original body shape information from images, we extract three levels of person appearance feature maps f3, f4, f5 from intermediate layers of the backbone network. These feature maps are then passed through a feature projection layer, which maps them into a feature space identical to the body shape embeddings f without significantly harming the original information, forming implicit multi-level person feature embeddings f3, f4, f5. The projection layer is designed with linear projection, Transformer encoder, global pooling, and a normalization layer to ensure efective feature transformation and integration.

Ultimately, the person feature embeddings (where = 3, 4, 5) obtained will be combined with the body shape embeddings to jointly compute the guidance loss. To ensure the alignment of information between the two and to emphasize the pose information within the person feature embeddings, we use the Kullback-Leible (KL) divergence as the guidance loss guide to consistently measure the similarity between the appearance embeddings and the body shape embeddings . The lower the value of guide, the more semantically consistent information the model has learned, meaning it can better capture features related to posture. The specific formulation of the overall loss function is as follows: = (1 − ) + where ID represents the identification loss function based on cross-entropy, with inputs being the cross-modal aggregated features and the identity labels , and is a fixed value. The guide function can be specifically expressed as: (8) (9) (10) (11)

In the calculation of KL divergence, and represent two probability distributions, where and are the probabilities of these distributions for the -th category, respectively. We obtain probability vectors for the person feature embeddings and body shape embeddings through normalization, and then compute the diference between them. The divergence value is divided by 2 to balance the scale of the loss function.

4. Experiment 4.1. Experimental Setup

Datasets. As shown in Table 1. To evaluate the efectiveness of our proposed MPCF framework, we conducted assessments primarily on three widely-used long-term clothing-change person Re-ID datasets: LTCC [5], PRCC [16], and Celeb-reID [24]. LTCC comprises 17,119 images of diferent individuals, covering 152 distinct identities and 416 diferent outfits, with an average of 5 varying outfits per person, and the number of outfit changes ranging from 2 to 14. PRCC includes 33,698 images of 221 mAP individuals captured from three camera views. The training set consists of 150 individuals, while the test set comprises the remaining 71. During training, 25% of the images from the training set are used as a validation set. Celeb-reID utilizes street photos of celebrities to address long-term clothing changes. The dataset contains 34,186 images of 1,052 identities, each with unique clothing, thus presenting a greater challenge in clothing changes scenarios compared to the previous two datasets.

Implementation details. Our model is constructed on the PyTorch framework. We utilized a pre-trained ResNet-50 from ImageNet [31] as the backbone network to extract texture features of persons. The dimensions of the multi-level features extracted by the backbone network are 512, 1024, and 2048, respectively. All training was conducted on mAP mAP a single NVIDIA RTX 3090 GPU. During both training and testing phases, images were resized to a uniform size of 384x192. Data augmentation included color jittering, random horizontal flipping, padding, random cropping, and random erasing [36]. We employed the Adam optimizer [37] for model optimization and performed 150 training epochs, with a warm-up strategy applied in the first 10 epochs, gradually increasing the learning rate from 3e-5 to 3e-4. The learning rate was reduced by 1/10 at epochs 40 and 80. For the PRCC and Celeb-reID datasets, the batch size was set to 48, while for the LTCC dataset, it was set to 32, with each identity ID having 4 images. For pose estimation, we used the OpenPose model pre-trained on the COCO dataset [38], generating 18 heatmaps, and we froze its weights during the training process.

Evaluation metrics. We employed the two standard metrics commonly used in most clothingchange Re-ID literature: mean Average Precision (mAP) and Cumulative Matching Characteristic (CMC). To ensure a fair comparison with existing studies, we evaluated LTCC and PRCC under both standard and clothing-change settings. Under standard settings, the test set included both consistent and varied clothing samples. In clothing-change settings, the test set exclusively contained samples with varied outfits.

4.2. Performance Comparisons

Performance on the LTCC dataset. We evaluated our proposed method on the LTCC dataset and compared it with baseline models and other state-of-the-art clothing-change person re-identification approaches, as shown in Table 2. Compared to the baseline model, our model achieved improvements of +2.0% in mAP and +2.2% in R1 under standard settings. In clothing-change settings, compared to FSAM [26], although our method slightly underperformed in the mAP metric, it achieved a significant +2.0% improvement in the R1 metric. Moreover, compared to the second-best performing method LDF [29], our approach outperformed in both mAP and R1 metrics under both settings. It also surpassed the MBUNet [27] method, which had the second-best R1 performance in the clothing-change setting.

Performance on the PRCC dataset. We also assessed our proposed method on the PRCC dataset, with results shown in Table 3. It is noteworthy that the original baseline model was not evaluated on this dataset. We faithfully reproduced the experimental results by strictly adhering to the implementation protocols outlined in the original paper. It can be observed that under the clothing changes setting, our method significantly outperforms the baseline model on both the R1 and mAP metrics, with improvements of +4.1% and +3.7% respectively. Although the baseline model integrates clothing-agnostic pose information into person identity representation and minimizes the information discrepancy between appearance texture and pose features as much as possible, it is still inevitably afected by clothing information. Our method, however, with multi-level pose information supervision, can further reduce clothing noise. Other comparative results indicate that our method achieves comparable results with other advanced approaches.

Performance on the Celeb-reID dataset. Compared to the first two datasets, Celeb-reID is larger and more challenging, with images captured from uncontrolled street snapshots without any clothing annotations.

As shown in Table 4, all advanced methods perform relatively poorly. Competitors such as FSAM [26] and MBUNet [27] have not reported results in this area. Our method, MPCF, achieved notable performance improvements of 62.7%, 77.3%, and 16.1% in R1, R5, and mAP metrics, respectively. Compared to the baseline model, our method significantly improved by +5.2% in R1 and +3.8% in mAP. When compared to the second-best performing method, 3DInvarReID [19], our method improved by +0.9% in mAP and +1.5% in R1.

The performance results across the three datasets demonstrate that our approach helps person re-identification models prioritize pose information over clothing during training, efectively addressing the challenge of clothing changes.

4.3. Ablation Study

Component Analysis. To demonstrate the efectiveness of our approach, we evaluated the multi-level pose guidance and the efectiveness of the two branches, SE and CMF, on the LTCC dataset under the standard Settings and compared them with the baseline model. The results are shown in Table 5.

In single-level guidance, the pose guidance at stage 5 showed the most significant improvement over the baseline model. The guidance at stages 3 and 4 resulted in slight increases in mAP, but there was no noticeable improvement in the Rank metrics, and even a decrease was observed. When combining two levels of guidance, the joint pose guidance at stages 4 and 5 performed the best, while the other two methods improved mAP but did not perform well on the Rank metrics. Ultimately, our method integrated guidance across three levels, and after experimentation, the optimal weight ratio for the three levels in the guidance loss was found to be 5:3:2, achieving the best overall performance. Compared to other methods, our approach achieved the best results in both R1 and mAP metrics. This also confirms the efectiveness of using multi-level guidance for pose information.

To show our framework is efective, we did ablation studies on its branches. Since all branches use pedestrian pose features, removing the SE branch leaves only the ResNet-50 backbone working. This leads to much worse performance on the LTCC dataset, as shown in Table 5. If we remove the CMF branch, the model loses key info due to the diference between pose and appearance features, harming performance. The final MPCF results prove the CMF branch’s cross - modal fusion is necessary.

Computational Complexity Analysis. We systematically evaluated the impact of adding three levels of pose guidance components on the model under the PRCC dataset’s cloth-changing setting, focusing on changes in computational cost and performance improvements. The results are shown in Table 6.

Experiments show that the introduction of a single level pose guidance component leads to a significant increase in the training parameters (Params) of the model, but the increase in the computational time complexity (FLOPs) of the model is small. This is mainly because the projection module in the MLFG branch uses fully connected layers and Transformer encoders, which add parameters but have relatively low computational complexity. Furthermore, our MPCF framework integrates all three levels of pose guidance components. Compared to the baseline model, while Params increased by 25%, the performance metrics showed significant improvements: Rank-1 improved by +4.1 %, and mAP improved by +3.7%. Meanwhile, the increase in FLOPs remained small, indicating that the computational complexity did not rise significantly.

This design shows that our method can achieve significant performance improvements with limited computational cost, proving that these additions are worthwhile.

4.4. Visualization of retrieval results

Our proposed method integrates multi-modal feature fusion and multi-level pose guidance to better address the challenges of person re-identification in long-term clothing changes scenarios. To visually demonstrate this conclusion, we visualized the top-10 retrieval results of the baseline model CAMC and our method on the LTCC dataset under clothing changes settings, as shown in Fig. 4.

Our proposed model significantly reduces the dependency on clothing information during the identification process. As shown in the first row of Fig. 4(a), the baseline model’s matching results mostly display persons with similar clothing but diferent identities compared to the query image. In contrast, as depicted in the second row of Fig. 4(a), our method’s matching results can still efectively identify the correct person identities even in clothing changes scenarios where there may be similarities between samples of diferent categories. Additionally, as demonstrated in the results of Fig. 4(b), due to the interference of clothing information, the top retrieval results in the first row are images with similar clothing textures and colors. However, thanks to the multi-level pose guidance in our approach, the model focuses more on body shape information that is independent of clothing. Consequently, in the second row of results shows that even when the queried person is wearing diferent clothing, our model can still achieve more robust person identity representations.

5. Conclusion

To mitigate information interference caused by long-term and cross-scenario appearance variations in persons, this paper proposes a Multi-Level Pose-Guidance with Cross-Modality Fusion for Long-Term Spatio-Temporal Re-ID (MPCF). Specifically, we introduce additional modality human pose feature embeddings through a SE branch, supplementing identity information independent of clothing. Then, a CMF branch reduces the modality gap between person appearance features and pose features, preventing the loss of key information across modalities when aggregating clothing-independent features. Furthermore, to further reduce the model’s focus on clothing information during the aggregation process, we propose a MLFG branch that uses multi-level person pose embeddings as guidance, compelling the model to concentrate attention on clothing-independent information areas, ensuring that aggregated features include more clothing-independent, distinctive identity information. Our proposed method has been extensively tested on multiple datasets, validating its efectiveness.

Acknowledgments

This paper is in part supported by the National Natural Science Foundation of China under Grants 62376231,the Sichuan Science and Technology Program 24NSFSC1070, the Sichuan Education Informatization and Big Data Center (Sichuan Audio-visual Education Hall) 2024KTPSLX001, respectively.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools. [3] M. Liu, Z. Ma, T. Li, Y. Jiang, K. Wang, Long-term person re-identification with dramatic appearance change: Algorithm and benchmark, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 6406–6415. [4] Y. Xian, J. Yang, F. Yu, J. Zhang, X. Sun, Graph-based self-learning for robust person re-identification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 4789–4798. [5] X. Qian, W. Wang, L. Zhang, F. Zhu, Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, Long-term cloth-changing person re-identification, in: Proceedings of the Asian Conference on Computer Vision, 2020. [6] W. Li, R. Zhao, T. Xiao, X. Wang, Deepreid: Deep filter pairing neural network for person reidentification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 152–159. [7] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, S. Z. Li, Salient color names for person re-identification, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, Springer, 2014, pp. 536–551. [8] O. Oreifej, R. Mehran, M. Shah, Human identity recognition in aerial images, in: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, 2010, pp. 709–716. [9] R. R. Varior, B. Shuai, J. Lu, D. Xu, G. Wang, A siamese long short-term memory architecture for human re-identification, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, Springer, 2016, pp. 135–153. [10] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, Q. Tian, Person re-identification in the wild, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1367–1376. [11] F. Herzog, X. Ji, T. Teepe, S. Hörmann, J. Gilg, G. Rigoll, Lightweight multi-branch network for person re-identification, in: 2021 IEEE international conference on image processing (ICIP), IEEE, 2021, pp. 1129–1133. [12] G. Wang, Y. Yuan, X. Chen, J. Li, X. Zhou, Learning discriminative features with multiple granularities for person re-identification, in: Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 274–282. [13] Y. Suh, J. Wang, S. Tang, T. Mei, K. M. Lee, Part-aligned bilinear representations for person re-identification, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 402–419. [14] F. Yang, K. Yan, S. Lu, H. Jia, X. Xie, W. Gao, Attention driven person re-identification, Pattern

Recognition 86 (2019) 143–155. [15] L. Zhao, X. Li, Y. Zhuang, J. Wang, Deeply-learned part-aligned representations for person reidentification, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 3219–3228. [16] Q. Yang, A. Wu, W.-S. Zheng, Person re-identification by contour sketch under moderate clothing change, IEEE transactions on pattern analysis and machine intelligence 43 (2019) 2029–2046. [17] J. Chen, X. Jiang, F. Wang, J. Zhang, F. Zheng, X. Sun, W.-S. Zheng, Learning 3d shape feature for texture-insensitive person re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8146–8155. [18] Q. Wang, X. Qian, Y. Fu, X. Xue, Co-attention aligned mutual cross-attention for cloth-changing person re-identification, in: Proceedings of the Asian Conference on Computer Vision, 2022, pp. 2270–2288. [19] F. Liu, M. Kim, Z. Gu, A. Jain, X. Liu, Learning clothing and pose invariant 3d shape representation for long-term person re-identification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19617–19626. [20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [21] Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using part afinity ifelds, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291–7299. [22] D. Alexey, An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv: 2010.11929 (2020). [23] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141. [24] Y. Huang, Q. Wu, J. Xu, Y. Zhong, Celebrities-reid: A benchmark for clothes variation in long-term person re-identification, in: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1–8. [25] Y. Sun, L. Zheng, Y. Yang, Q. Tian, S. Wang, Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline), in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 480–496. [26] P. Hong, T. Wu, A. Wu, X. Han, W.-S. Zheng, Fine-grained shape-appearance mutual learning for cloth-changing person re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10513–10522. [27] G. Zhang, J. Liu, Y. Chen, Y. Zheng, H. Zhang, Multi-biometric unified network for cloth-changing person re-identification, IEEE Transactions on Image Processing 32 (2023) 4555–4566. [28] Z. Yang, X. Zhong, Z. Zhong, H. Liu, Z. Wang, S. Satoh, Win-win by competition: Auxiliaryfree cloth-changing person re-identification, IEEE Transactions on Image Processing 32 (2023) 2985–2999. [29] P. P. Chan, X. Hu, H. Song, P. Peng, K. Chen, Learning disentangled features for person reidentification under clothes changing, ACM Transactions on Multimedia Computing, Communications and Applications 19 (2023) 1–21. [30] M. Li, S. Cheng, P. Xu, X. Zhu, C.-G. Li, J. Guo, Unsupervised long-term person re-identification with clothes change, in: 2023 8th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC), IEEE, 2023, pp. 167–171. [31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of computer vision 115 (2015) 211–252. [32] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, X. Chen, Interaction-and-aggregation network for person re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9317–9326. [33] C. Yan, G. Pang, J. Jiao, X. Bai, X. Feng, C. Shen, Occluded person re-identification with single-scale global representations, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11875–11884. [34] W. Xu, H. Liu, W. Shi, Z. Miao, Z. Lu, F. Chen, Adversarial feature disentanglement for long-term person re-identification., in: IJCAI, 2021, pp. 1201–1207. [35] Y. Yan, H. Yu, S. Li, Z. Lu, J. He, H. Zhang, R. Wang, Weakening the influence of clothing: Universal clothing attribute disentanglement for person re-identification., in: IJCAI, 2022, pp. 1523–1529. [36] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, in: Proceedings of the AAAI conference on artificial intelligence, volume 34, 2020, pp. 13001–13008. [37] P. K. Diederik, Adam: A method for stochastic optimization, (No Title) (2014). [38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755. [39] S. Yang, B. Kang, Y. Lee, Sampling agnostic feature representation for long-term person reidentification, IEEE Transactions on Image Processing 31 (2022) 6412–6423. [40] J. Wu, Y. Huang, M. Gao, Z. Gao, J. Zhao, H. Zhang, A. Zhang, A two-stream hybrid convolutiontransformer network architecture for clothing-change person re-identification, IEEE Transactions on Multimedia (2023). [41] Y. Huang, Q. Wu, Z. Zhang, C. Shan, Y. Zhong, L. Wang, Meta clothing status calibration for long-term person re-identification, IEEE Transactions on Image Processing (2024).

[1]

Jia ,

Zhong ,

Ye , W. Liu,

Huang ,

Zhao , Patching your clothes: Semantic-aware learning for cloth-changed person re-identification , in: International Conference on Multimedia Modeling, Springer, 2022 , pp. 121 - 133 .

[2]

Jin ,

He ,

Zheng ,

Yin ,

Shen ,

Huang ,

Feng ,

Huang ,

Chen ,

X.-S.

Hua , Clothchanging person re-identification from a single image with gait prediction and regularization , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022 , pp. 14278 - 14287 .