<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>One-Shot Video-Based Person Re-Identification by Global-Local Feature and Subsampling Strategy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shuang Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Li</string-name>
          <email>liyang162@shu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Shanghai University</institution>
        </aff>
      </contrib-group>
      <fpage>63</fpage>
      <lpage>71</lpage>
      <abstract>
        <p>This paper focuses on the video-based person re-identification task under the one-shot setting, and proposes a method based on global-local feature and subsampling strategy. In order to obtain more discriminative features of people for pseudo-label estimation, global features and local features are considered together, and multiple loss functions are used to optimize the model, which improves the discriminant power of the model. In addition, a more reasonable method, the subsampling strategy, is proposed for pseudo-label estimation. By dividing the pseudo-labeled dataset into two subsets with proper proportions, the influence of false labeled samples on model training is reduced. A large number of experiments have been carried out on MARS and DukeMTMC-VideoReID, which are commonly used public video datasets. Experimental results show that our method achieves better performance than the most advanced methods, which verifies the superiority of this method.</p>
      </abstract>
      <kwd-group>
        <kwd>1 One-shot</kwd>
        <kwd>video-based person re-identification</kwd>
        <kwd>global-local feature</kwd>
        <kwd>subsampling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Person re-identification (Re-ID) is a sub-problem of image retrieval, which mainly uses computer
vision technology to judge whether there is a specific person in an image or video sequence. In recent
years, with the continuous development of intelligent monitoring field, person Re-ID technology has
gradually attracted wide attention [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The current person Re-ID tasks can be mainly divided into two
categories, one is image-based person Re-ID, the other is video-based person Re-ID. Compared with
images, video-based person Re-ID can be said to be a more natural way, because in real life, cameras
usually capture video sequences, and video sequences also provide richer and more detailed features,
making person modeling more accurate.
      </p>
      <p>
        With the rise of deep neural networks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], video-based person Re-ID has also made substantial
progress. However, most of the existing methods are supervised learning methods, means they need to
rely on a large number of labeled data, and datasets for labeling is a very expensive and costly
undertaking, especially the correlation between multiple cameras person data, note the high cost of data
is also supervised method is difficult to extend to the real application of one of the important reasons.
Therefore, many scholars also begin to pay attention to semi-supervised video-based person Re-ID [
        <xref ref-type="bibr" rid="ref4 ref5">4,
5</xref>
        ].
      </p>
      <p>
        Semi-supervised methods mainly use a small amount of labeled data to conduct network learning.
This paper considers one of the semi-supervised methods based on one-shot setting. For one-shot
videobased person Re-ID, only one labeled video tracklet is available for each identity, and the rest are
unlabeled data. In view of the few available labeled samples and the complex information contained in
video sequence, how to effectively use video sequence information for feature extraction and how to
assign pseudo labels to unlabeled data have become key challenges. In recent years, there has been a
lot of works based on one-shot methods [
        <xref ref-type="bibr" rid="ref6 ref7 ref8 ref9">6, 7, 8, 9, 10</xref>
        ], however most of these methods rely on global
feature representation while ignoring local feature description. Only fuzzy global feature representation
is likely to produce incorrect label estimation, which will have more and more influence on model
training as the iteration goes on, and discriminability information is likely to be hidden in some local
information in the video tracklet. In addition, for better utilization of unlabeled data, Wu et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
propose a strategy of progressive sampling, in which the most reliable pseudo-labeled data is gradually
selected for training. However, for one-shot tasks, there are few initial labeled data, so there are still a
large number of incorrect labeled data in the pseudo-labeled data selected by this progressive sampling
strategy, which affects the training of subsequent models.
      </p>
      <p>
        In order to solve the above problems, this paper proposes a method based on global-local feature
and subsampling strategy. Specifically, we adopt a progressive learning framework. When extracting
features, global features and local features are taken into account, and the sample data obtained by CNN
is more discriminative in global-local feature representation. In order to improve the discriminability of
acquired features and enhance the discriminability of the model, we introduce the center loss function
[11], minimize intra-class differences, and combine cross-entropy loss to perform identity classification
tasks. In addition, exclusive loss [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is adopted to distance unlabeled samples from each other in the
feature space and reduce the influence of irrelevant features on the model. Finally, a subsampling
strategy is proposed to select reliable pseudo-labeled data, which is divided into two subsets with
appropriate proportions. A more reliable pseudo-labeled data subset is selected and incorporated into
the training set to re-train the model, ensuring the robustness of model training.
      </p>
      <p>The main contributions of this paper are summarized as follows:
⚫ In this paper, a method based on global-local feature learning is proposed, which considers the
influence of global-local feature jointly, and adopts multiple loss functions to optimize different
sample data, which effectively improves the discriminant ability of the model.
⚫ A subsampling strategy is proposed, in which the pseudo-labeled samples are divided into two
subsets with proper proportion when the pseudo-labeled samples are selected dynamically, so
as to reduce the influence of the false labeled samples on model training and improve the
robustness of the model.
⚫ A large number of experiments on MARS and DukeMTMC-VideoReID, two publicly available
video datasets, show that our proposed method has better performance than other advanced
methods.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Supervised Video-Based Person Re-ID</title>
      <p>In recent years, with the rise of deep learning [12], it has been widely used in the field of image
recognition. Video-based person Re-ID has achieved great success and its performance has been
significantly improved. In 2016, McLaughlin et al. [13] apply deep learning theory to video-based
person Re-ID task for the first time, and a new recurrent neural network structure is proposed, which
captures features from each video and then uses recurrent layer and temporal pooling layer to extract
video-level features. Since then, video-based person Re-ID based on deep learning has developed
rapidly and achieved remarkable results in various datasets. For the problem of occlusion in person
ReID task, Li et al. [14] design a spatio-temporal attention model with diversity regularization. The model
uses multiple spatial attention models to locate and discriminate image regions, and then collects these
local features across time through temporal attention. Yang et al. [15] propose a novel spatio-temporal
graph convolutional network to model the potential relationships between different parts of the human
body within and across the same frame, providing more discriminative and robust information for
ReID, and effectively overcoming the problem of occlusion and visual ambiguity.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Semi-Supervised Video-Based Person Re-ID</title>
      <p>In order to solve the problems of high cost of dataset annotation and difficult application of actual
scenes, many researchers also begin to pay attention to semi-supervised video-based person Re-ID.
Compared with supervised video-based person Re-ID, there are few studies on semi-supervised
videobased person Re-ID, but in recent years, with the attention of a large number of researchers, the field of
semi-supervised video-based person Re-ID has made great progress.</p>
      <p>
        Ye et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] propose a dynamic graph matching method in 2017, which iteratively updates image
graph matching and label estimation to learn better features. Liu et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] update the classifier with
Kreciprocal Nearest Neighbors (KNN) in the gallery set, and refine the nearest neighbor by applying
negative sample mining with KNN in the query set. It has to be mentioned that although the two methods
in literature [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] claim that they are unsupervised, they require at least one labeled tracklet for
each identity in the experiment, so they are regarded as a one-shot video-based person Re-ID task. Both
methods adopt a static sampling strategy to determine the amount of selected pseudo-labeled data for
further training, in which pseudo labels whose confidence score is higher than the predefined threshold
are selected in each step, which will result in a large number of unreliable pseudo labels for training
and hinder the improvement of model performance. To solve this problem, Wu et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] propose a new
progressive learning method. In this method, a dynamic sampling strategy is adopted to allocate pseudo
labels, and the number of pseudo-labeled candidates is gradually increased, which significantly
improves the performance of label estimation. On this basis, literature [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] propose a progressive method
of joint learning, which divides training data into labeled data, pseudo-labeled data and index-labeled
data. In the iterative process, CNN model is updated through joint training of these three parts of data,
making full use of the unreliable unlabeled data of pseudo labels. Yin et al. [10] propose a method based
on multi-loss learning and joint distance measurement to optimize different loss functions for different
data and improve the discriminant power of the model. In label estimation, sample distance and nearest
neighbor distance are considered together to further improve the accuracy of pseudo-label prediction.
      </p>
      <p>Although the above methods achieve good performance, they all adopt temporal average pooling to
obtain global feature representation of people during feature extraction, which may lead to incorrect
label estimation and errors will accumulate with iteration. In addition, the existing methods usually use
the cross-entropy loss function to classify identities, but the cross-entropy loss function only ensures
that features of different categories can be separated. Therefore, in order to improve the distinguishing
ability of acquired features, we consider the importance of local features, propose a learning method
based on global-local features, introduce the center loss function into the identity classification task,
and combine multiple loss functions to optimize the model. At the initial stage of model training, due
to the small number of initial labeled samples, the model has over-fitting phenomenon. Even if the
dynamic sampling strategy is adopted, there will still be a large number of mislabeled samples.
Therefore, in order to reduce the influence of incorrectly labeled samples on subsequent training, we
propose a subsampling strategy, which further divides the dynamically selected pseudo-labeled sample
data into two subsets with appropriate proportions, and selects the more reliable pseudo-labeled sample
data to be added to the training set for the next training.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Method</title>
      <p>The method framework proposed by us is shown in Figure 1. In this paper, an iterative method is
adopted to train the model. The specific process is as follows: First, the model is trained by using the
method of global-local feature learning for labeled data, pseudo-labeled data I, pseudo-labeled data II
and index-labeled data. It is worth noting that in the initial stage of model training, no label estimation
has been carried out on the unlabeled data, so there are only labeled data and unlabeled data. We
initialize the video-based Re-ID model according to the sample data with labels initially. Then, the
method based on distance measurement is adopted in the feature space, that is, pseudo labels are
assigned to each unlabeled data based on the distance between labeled data and unlabeled data, several
reliable pseudo-labeled data are selected, and index labels are assigned to the unselected data as
indexlabeled data. Then, the subsampling strategy is adopted to divide the selected reliable pseudo-labeled
data into pseudo-labeled data I and pseudo-labeled data II, and the pseudo-labeled data I is merged into
the training set for the next training. In the iterative process, the size of the selected pseudo-labeled data
is constantly expanded until all the unlabeled data is used for model training.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1. Problem Description</title>
      <p>II</p>
      <p>and index-labeled data   .</p>
      <p>Under the one-shot setting, for n identities, the labeled
dataset is defined as  =
{(  1,   1 ), (  2,   2 ), … , ( 
,  
)}, and the unlabeled dataset is defined as 
= {  1,   2, … ,   },
where x represents the video tracklet, and y represents the label corresponding to the video tracklet. Our
goal is to train feature extractor  ( ,⋅) parameterized by θ using these sample data. In the t iteration,
the training set contains four parts of data: labeled data L, pseudo-labeled data I    , pseudo-labeled data</p>
    </sec>
    <sec id="sec-7">
      <title>3.2. Global-Local Feature Learning</title>
      <p>It is very important to obtain distinguishing feature representation for label estimation. In the
previous work, the global features at the video level are usually obtained through temporal average
pooling, without any fine grained local information. Therefore, this paper combines global features with
local features and proposes a method based on global-local feature learning. The feature extraction
module is shown in Figure 2.</p>
      <p>For each video tracklet there are two branches: the global branch and the local branch. For global
feature learning, frame-level feature embedding is first obtained through a CNN model, and then the
frame-level feature is represented as the global feature of the video tracklet through global average
pooling. Global features include the general features of people, but some fine grained local features are
ignored. In the person Re-ID task, local information is also very effective, with strong discrimination.
Therefore, we introduce local features to learn more discriminative information, and extract frame-level
features from CNN model as local feature representation of video tracklets. Finally, the global-local
feature representation is obtained by superposing the two level features.</p>
      <p>We use labeled data, pseudo-labeled data I and pseudo-labeled data II for identity classification
learning, and the cross-entropy loss function is as follows:
where   and   are the input samples and their corresponding labels respectively,  (⋅) is the
probability of correctly predicting sample labels, and n is the total number of sample data. It is worth
mentioning that in identity classification, cross-entropy loss only ensures that features of different
categories can be separated. Therefore, in order to improve the discernibility of acquired features, we
introduce the center loss function [11] for identity classification, the center loss formula is as follows:

1

 =1
 
= −</p>
      <p>
        ∑ log( (  |  )),
1
2

 =1
∑‖ ( ;   ) −    ‖2
2
(1)
(2)
where α is the hyperparameter of balancing cross-entropy loss and center loss. For index-labeled data
  , due to its unreliable pseudo-label, we hope to reduce its impact on model performance. Therefore,
exclusive loss [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is used to keep index-labeled data in feature space away from other data, so as to
avoid irrelevant features learned by the model and further optimize CNN model. Exclusive loss is
expressed as follows:
 
= − log
exp (   ( ;   )/ )
|  |
∑
      </p>
      <p>=1 exp (    ( ;   )/ )
where   =  ( ;   )  is the L2 normalized feature embedding of samples, and the hyperparameter τ
is mainly used to control distribution. We combine the two loss functions to train the model, and the
final objective function is as follows:
 
= min</p>
      <p>+ (1 −  )</p>
      <p>The hyperparameter β is used to adjust the contribution of identity loss and exclusive loss.
where    represents the grade   center of deep features. CNN model is trained jointly with
crossentropy loss and center loss, so the loss function of identity classification task is expressed as follows:
 
=  
+   
(3)
(4)
(5)
(6)
(7)</p>
    </sec>
    <sec id="sec-8">
      <title>3.3. Subsampling Strategy</title>
      <p>=   −1 +  ⋅</p>
      <p>
        Label estimation is undoubtedly a big challenge for one-shot video-based person Re-ID task, that is,
how to assign correct pseudo labels to the rich unlabeled data, and the accuracy of pseudo labels will
greatly affect the performance of the model. The early works [16] are to assign pseudo labels to
unlabeled data through classification prediction model. However, for one-shot tasks, the classifier is
easy to over-fit and cannot accurately predict the categories of unlabeled data. To this end, we adopt
the distance-based nearest neighbor method in literature [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for label estimation, as shown below:
 ( ;   ) = min‖ ( ;   ) −  ( ;   )‖
where   and   represent labeled data and unlabeled data respectively, and ‖⋅‖ represents Euclidean
Distance. Specifically, the confidence degree of label estimation is defined as the Euclidean Distance
between unlabeled data and labeled data in the feature space, and then pseudo labels are assigned to
unlabeled data according to the labeled data nearest to it in the feature space. The enlarging factor  ∈
(0,1) is set to dynamically select pseudo-labeled data with high confidence for the next training. In
tstep iteration, the selected pseudo-labeled data set size is expressed as:
      </p>
      <p>Although the dynamic sampling strategy selecting pseudo-labeled data used to train the next step by
step, to avoid the model training involves a large number of unreliable in the early years of the
pseudolabeled data, but in view of the one-shot task, each identity only one labeled video tracklet, the original
labeled data is less, so the model is still easy to over-fit. Specifically, during label estimation, there are
still a large number of false labeled samples in the pseudo-labeled dataset selected through the dynamic
sampling strategy, which will inevitably affect the subsequent model training, and this influence will
gradually increase with the iteration. Therefore, we propose a subsampling strategy based on the
dynamic sampling strategy, and divide the pseudo-labeled dataset selected by the dynamic sampling
strategy into pseudo-labeled set I and pseudo-labeled set II, as shown below</p>
      <p>The hyperparameter λ is used to control the ratio of two sets. In order to select a suitable λ, we study
the influence of the size of the hyperparameter λ on the model performance in Section 4.5.</p>
    </sec>
    <sec id="sec-9">
      <title>4. Experiments</title>
    </sec>
    <sec id="sec-10">
      <title>4.1. Datasets and Evaluation Metrics</title>
      <p>MARS dataset [17] is the largest video dataset in the video-based person Re-ID task so far. It is
mainly the person video tracklets in the university campus shot by 6 cameras, including 17,503 video
tracklets, corresponding to 1263 identities, and 3248 interference clips.625 identities are used for
training and 636 for testing.</p>
      <p>
        DukeMTMC-VideoReID dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a subset of DukeMTMC, consisting of 4832 video tracklets
of 1812 identities, including 2,196 video tracklets of 702 identities in the training set, 2363 video
tracklets of 702 identities in the test set, and 408 interference items.
      </p>
      <p>In this paper, Cumulative Matching Characteristic (CMC) and Mean Average Precision (mAP) are
used to evaluate the effectiveness of the proposed method.</p>
    </sec>
    <sec id="sec-11">
      <title>4.2. Implementation Details</title>
      <p>
        All experiments in this paper are based on PyTorch framework, and we adopt basically the same
experimental setup as literature [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] . We adopt ResNet-50 [18] with the last classification layer removed
as our feature embedding model φ to conduct all the experiments and initialize it by the ImageNet [19]
pre-trained model. α is set to 0.1 during the training phase when the model is initialized, and then to
0.001. The value of the hyperparameter τ in formula (4) is set to 0.1, β is set to 0.8 initially in formula
(5) and 1 in the last 15 epochs. For label estimation, the value of λ is set to 0.4.
      </p>
    </sec>
    <sec id="sec-12">
      <title>4.3. Comparison with the State-of-the-Art Methods</title>
      <p>To verify the effectiveness of our method, we conduct experiments on MARS and
DukeMTMCVideoReID, two commonly used video datasets, and the experimental results are shown in Table 1.
Among them, the baseline method does not use additional unlabeled data, but only uses one-shot labeled
data for training. The proposed method achieved 69.8% rank-1 and 49.5% mAP (when p = 0.05) on
MARS dataset, 1.3% and 1.7% higher than the most advanced method. In addition, the proposed method
performs better on the DukeMTMC-VideoReID dataset, with rank-1 and mAP (when p = 0.05) reaching
87.2% and 84.5%, respectively, an improvement of 10.7% and 15.8% over the most advanced method,</p>
      <sec id="sec-12-1">
        <title>Comparison with the state-of-the-art methods on MARS and DukeMTMC-VideoReID datasets. Bold</title>
        <p>respectively.
MLL+JDM [12] (p = 0.10)
MLL+JDM [12] (p = 0.05)</p>
        <p>Our (p = 0.10)
Our (p = 0.05)</p>
        <p>The experimental results show that the performance of the proposed method is better than the most
advanced one-shot method, especially in DukeMTMC-VideoReID dataset. This may be because
previous methods only focus on the global features of samples and ignore the implied local information,
which often contains some distinguishing features. Secondly, MARS dataset is richer and more complex
than DukeMTMC-VideoReID dataset, and there are more label noises in it, which affects the
discriminant performance of the model.</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>4.4. Ablation Studies</title>
      <p>For the two key parts of global-local feature learning and subsampling strategy proposed in this
paper, we conduct ablation studies on MARS and DukeMTMC-VideoReID datasets, as shown in Table
2. Where "Ours w/o L" means that only global features of the sample are considered while local features
are ignored during feature extraction. "Ours w/o S" means that the subsampling module is removed
during label estimation, that is, only dynamic sampling is adopted. "Ours" represents the complete
model presented in this paper.</p>
      <p>Table 2</p>
      <sec id="sec-13-1">
        <title>Ablation studies on MARS and DukeMTMC-VideoReID datasets.</title>
      </sec>
      <sec id="sec-13-2">
        <title>Factor</title>
      </sec>
      <sec id="sec-13-3">
        <title>Method</title>
      </sec>
      <sec id="sec-13-4">
        <title>MARS</title>
        <p>mAP</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>4.5. Algorithm Analysis</title>
      <p>In this paper, we propose a subsampling strategy in the stage of label estimation, in which the
hyperparameter λ determines the size of two subsets, how to select the appropriate partition ratio
becomes the key. Therefore, this paper compares the effects of different λ values on the model
performance on the DukeMTMC-VideoReID dataset (where the enlarging factor p = 0.05), and the
experimental results are shown in Table 3. It can be seen that with the increase of λ, rank-1 and mAP
increase gradually. When λ is 0.4, rank-1 and mAP reach the maximum and the overall performance is
the best. When λ continues to increase, rank-1 and mAP begin to decrease. Therefore, to get the best
performance of the model, we set λ to 0.4.</p>
      <p>Table 3
Comparison of different λ values on DukeMTMC-VideoReID dataset. Bold numbers are the best.
(where the enlarging factor p = 0.05)</p>
    </sec>
    <sec id="sec-15">
      <title>5. Conclusion</title>
      <p>In this paper, a method based on global-local feature and subsampling strategy is proposed for
oneshot video-based person Re-ID. In feature extraction, this method not only focuses on the coarse-grained
global features of the video tracklet, but also considers the fine-grained local features, and combines
the features of the two dimensions to obtain a more discriminative feature representation. On the basis
of cross-entropy loss, joint center loss and exclusive loss further improve discriminative feature learning.
In addition, in order to reduce the influence of false label samples, a subsampling strategy is proposed
during label estimation, which divides the selected pseudo-labeled samples into two appropriate subsets
for training. Good performance on MARS and DukeMTMC-VideoReID datasets proves the
effectiveness of the proposed method.
6. References
[10] Yin YC, et al." One-shot video-based person re-identification with multi-loss learning and joint
metric." Journal of Computer Applications 42 (2022):764-769.
[11] Y. Wen, K. Zhang, Z. Li, Y. Qiao, A discriminative feature learning approach for deep face
recognition, in: European Conference on Computer Vision, Springer, 2016, pp. 499–515.
[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv:1502.03167, 2015.
[13] N. McLaughlin, J. Martinez del Rincon and P. Miller, "Recurrent Convolutional Network for
Video-Based Person Re-identification," 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 1325-1334, doi: 10.1109/CVPR.2016.148.
[14] S. Li, S. Bak, P. Carr and X. Wang, "Diversity Regularized Spatiotemporal Attention for
VideoBased Person Re-identification," 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2018, pp. 369-378, doi: 10.1109/CVPR.2018.00046.
[15] J. Yang, W. -S. Zheng, Q. Yang, Y. -C. Chen and Q. Tian, "Spatial-Temporal Graph Convolutional
Network for Video-Based Person Re-Identification," 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2020, pp. 3286-3296, doi:
10.1109/CVPR42600.2020.00335.
[16] X. Dong, D. Meng, F. Ma, and Y. Yang, “A dual-network progressive approach to weakly
supervised object detection,” in ACM on Multimedia Conference, 2017, pp. 279–287.
[17] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for
large-scale person re-identification,” in European Conference on Computer Vision, 2016, pp. 868–
884.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE</p>
      <p>Conf. Comput. Vis. Pattern Recognit.,Jun. 2016, pp. 770–778.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional
neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          , Y . Cheng, K. Gu,
          <string-name>
            <given-names>Y .</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P .</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Jointly attentive spatial-temporal pooling networks for video-based person re-identification.</article-title>
          <string-name>
            <surname>ICCV</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          et al.,
          <article-title>"Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification,"</article-title>
          <source>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2896</fpage>
          -
          <lpage>2905</lpage>
          , doi: 10.1109/CVPR42600.
          <year>2020</year>
          .
          <volume>00297</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          et al.,
          <article-title>"Semi-Supervised Cross-View Projection-Based Dictionary Learning for VideoBased Person Re-Identification," in IEEE Transactions on Circuits and Systems for Video Technology</article-title>
          , vol.
          <volume>28</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>2599</fpage>
          -
          <lpage>2611</lpage>
          , Oct.
          <year>2018</year>
          , doi: 10.1109/TCSVT.
          <year>2017</year>
          .
          <volume>2718036</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al.
          <article-title>Video-based person re-identification by semi-supervised adaptive stepwise learning</article-title>
          .
          <source>Pattern Anal Applic</source>
          <volume>24</volume>
          ,
          <fpage>1769</fpage>
          -
          <lpage>1776</lpage>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Ma</surname>
          </string-name>
          , L. Zheng,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and P. C.</given-names>
            <surname>Yuen</surname>
          </string-name>
          .
          <article-title>Dynamic label graph matching for unsupervised video re-identification.</article-title>
          <string-name>
            <surname>ICCV</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Stepwise metric promotion for unsupervised video person reidentification</article-title>
          .
          <source>In ICCV</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , “
          <article-title>Exploit the unknown gradually: Oneshot video-based person re-identification by stepwise learning,”</article-title>
          <source>in IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>5177</fpage>
          -
          <lpage>5186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y .</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y .</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          , Y . Y an, W. Bian, and Y . Y ang, “
          <article-title>Progressive learning for person reidentification with one example,”</article-title>
          <source>IEEE Trans. Image Process.</source>
          , vol.
          <volume>28</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>2872</fpage>
          -
          <lpage>2881</lpage>
          , Jun.
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>