<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>August</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Based on</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yiwen Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhenyang Dong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pengxia Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yujie Liu</string-name>
          <email>liuyujie@upc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Micro-gesture, action classification, CLIP, Vision-Language Model</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Computer Science and Technology, Qingdao Software College, China University of Petroleum</institution>
          ,
          <addr-line>Qingdao 266580</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>3</volume>
      <issue>2024</issue>
      <abstract>
        <p>This paper primarily introduces our approach in the 2nd MiGA-IJCAI Challenge Track 1, which focuses on micro-gesture recognition. The micro-gesture dataset has the characteristics of small action amplitude, short duration, and concentrated actions in specific parts. Regarding these issues, We propose a multimodal micro-gesture recognition network based on CLIP. In the video modality, we use a frozen CLIP model as the teacher network and train the student model via distillation. For the skeleton modality, we convert the data into 3D heatmaps, reducing the inherent sparsity of skeleton data. Additionally, we apply text features learned from CLIP to the skeleton modality, enabling interaction between the two models. Our approach achieved an accuracy of 68.9% in micro-gesture recognition.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org
CLIP⋆</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Pose recognition refers to the automatic recognition and analysis of human posture and
movements through computer technology. It can involve identifying information such as human
posture, actions, and posture angles to infer the state and intention of the human body.
Microgesture classification is a critical research direction in the field of computer vision. In this domain,
most eforts are dedicated to recognizing descriptive gestures. “ Descriptive gestures” refer
to purposeful and more prominent body movements, such as drinking water or running, through
which people can clearly express their emotions. However, in certain contexts like interviews
and competitions, individuals may deliberately hide their true feelings, making it dificult for
computers to further analyze their emotions. In contrast, “Micro-gestures” are spontaneous,
unconscious subtle movements that can provide valuable insights into an individual’s internal
state, revealing hidden emotional conditions. This makes micro-gesture detection significant in
psychology, behavior analysis, and communication studies.</p>
      <p>
        Gesture recognition typically relies on video or skeleton data. Video data usually contains
richer information but requires more computational resources and time to process. Skeleton
data can provide abstract gesture information, reducing the impact of background noise, but its
sparsity may lead to the loss of some detailed information. Therefore, there are currently many
works for multi-modal recognition. DCSNet[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]utilizes the complementary information between
RGB and skeleton modes, and uses the human skeleton as guidance information to crop out key
activity areas of the human body in RGB frames for recognition, greatly eliminating background
interference. VPN [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] generates feature maps that are more discriminative for subtle actions
through spatial embedding and attention networks. S Kim [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed a Transformer model
based on 3D deformable attention, which can better learn spatiotemporal attention for cross
membrane action recognition. However, these methods are all models trained from scratch and
have high computational complexity.
      </p>
      <p>
        In the video modality, Transformer-based methods [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] dominate due to their ability to
capture long-range temporal dependencies, which better understand temporal action sequences
in videos. However, Transformer models usually require large datasets to fully utilize their
powerful parameterization capabilities. With the advent of CLIP [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in recent years, the
largescale pre-training on image-text pairs has addressed some limitations of Transformers, allowing
for more efective use of large-scale data and enabling transfer learning across various tasks.
This has also led to improved performance in gesture recognition tasks [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. In this paper, we
consider two aspects:
1. Micro-gesture recognition needs more attention to detail, and background information
should not be overly emphasized;
2. Training video and skeleton multimodal models is computationally expensive.
      </p>
      <p>Therefore, In order to reduce the high computational complexity caused by using multimodal
data, based on Froster CLIP [9], we propose a token attenuation strategy in the video encoding
module, we delete a portion of tokens based on attention weights every time we pass through
the Transformer, gradually filtering out unimportant tokens layer by layer. Experimental results
demonstrate the efectiveness of this method.</p>
      <p>For the skeleton modality, to align with the video modality’s CLIP model, we apply the text
embeddings learned from the video modality to the skeleton network. The PoseConv-3D model
is specifically augmented with CLIP text embeddings [ 10], facilitating collaboration with the
CLIP text encoder. With the integration of CLIP text embeddings, the model is enabled to work
collaboratively. Through the comprehensive utilization of feature extraction methods from
both skeleton and video modalities, the performance of micro-gesture recognition tasks can be
enhanced by leveraging both skeleton sequences and video images.</p>
      <p>Our method’s main contributions are as follows:
• In the video modality, we propose a video action recognition network based on CLIP.</p>
      <p>Specifically, we enhance the focus on details by implementing a token weight attenuation
strategy.
• In the skeleton modality, we apply the CLIP text embeddings trained in the video model
to a 3D-CNN network, improving the correlation between the models.
• In the micro-gesture classification competition, our method achieved an accuracy of 68.9%
on the IMIGUE dataset. Experimental results demonstrate that this approach efectively
recognizes micro-gestures.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Methodology</title>
      <sec id="sec-3-1">
        <title>2.1. Data Preprocessing</title>
        <p>For video data, the first step is to segment the entire video into smaller clips containing actions.
Each video clip can be represented as   ∈ ℝ × × ×3
represents the number of frames.
, where H,W indi cate resolution, T
For skeleton data, the input is represented as  ∈ ℝ × ×
, where  = 3 represents the
coordinate dimensions,  = 22</p>
        <p>represents the number of keypoints, and T represents the
number of frames. Then, the skeleton data is represented as a heatmap of size  ×  ×  × 
where H and W represent the height and width of the image, respectively. Each heatmap,
centered around a keypoint, is composed of K Gaussian heatmaps to obtain the heatmap J,
where K represents the number of joint points.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Video-specific Fine-tuning with Distillation</title>
        <p>As shown in Figure 1 is our model structure. To apply CLIP for video action recognition task,
FROSTER[9] introduces distillation into their full fine-tuning method, demonstrating superior
performance. Given a video clip x and textual prompts of all categories, they are processed
by frozen CLIP’s vision encoder and text encoder respectively to obtain frame-specific visual
features   and textual features   , denoted as   ∈ ℝ ×

and   ∈ ℝ× . Here,  means the
dimension of extracted features,  denotes the number of classes and  denotes the number of
frames. Similarity, an improved VCLIP student model (see Section 3.3) converts the visual and
textual data to corresponding embeddings   and  
 , whose shape is the same as the frozen
branch counterpart’s. In most cases, fully fine-tuning method on CLIP directly calculate the
similarity between the   and   and use a cross-entropy loss function to optimize the tuned
model, which is defined as:

1

=1
ℓ
),   ]/ ,   }] ,
(1)
where   ∈ ℝ represents the ground truth, (·) denotes temporal average pooling strategy,
(·, ·)</p>
        <p>denotes cosine similarity calculation and  is a temperature parameter. However, Froster
attempts to enhance model’s generalization ability by additionally introducing a residual MLP
structure and distillation method. Specifically, the tuned features are transformed as follow:
where  is a balancing coeficient. For simplicity, 
 and   are uniformly represented as   , and
similar simplifications have been made in next formula. Then the distillation process can be
written as:
 ̂ =   +  × MLP(  ),


ℓ
  = ∑ ‖(  −  ̂ )‖2.
ℒ = ℓ + (ℓ</p>
        <p>+ ℓ  ),
(,  ) =
softmax (
 = (,  ) ,
 
√
),
The overall loss function is defined as:
where  is a balancing coeficient.</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Improved VCLIP</title>
        <p>When tuning a pretrained CLIP into video downstream task, one question is how to capture
temporal relationships in videos.[11] solve this problem by expanding the temporal attention
view. Specifically, normal self-attention mechanism proposed in [ 12]operates as follows:
(2)
(3)
(4)
(5)
where (,  ) ∈ ℝ
 ×
means the similarity matrix,  ∈ ℝ  × ,  ∈ ℝ  × and  ∈ ℝ  ×
means
query, key and value features,  means transpose operation,  refers to the dimension of 
and  is the number of tokens. It is clear that self-attention fails to boost the interaction
within inter-frame information in this case. So VCLIP aggregates the temporal information by
concatenating   ,   and   along token dimension, while   ,   and   represent original key
features of the previous, current and following frames respectively. Similarly,  is converted to
its temporal version as well. In this way, VCLIP can model spatial-temporal correlation jointly
without extra parameters. However, we notice that the growing number of tokens greatly
increases unnecessary compute costs, since redundant information in consecutive frames is
also aggregated. To overcome this issue, we design a tokens-decay strategy based on vector
similarity. Given the query features of [cls] token denoted as   , we remain k tokens in  and
drop the others according to the top-k scores in similarity vector (  ,   ) ∈ ℝ . Note [cls]
token is a fixed reserved token and will not be involved in the filtering process.</p>
      </sec>
      <sec id="sec-3-4">
        <title>2.4. Action Modeling and Classification with Heatmaps</title>
        <p>Assuming we have obtained the heatmaps of any modal(joint/limb). Then, following the
practices in [10], we extract clip-level features via 3D ResNet-50 network and map these
features into the same dimension as   (see Section 3.3). For simplicity, we use  ̂ represent the
mapped embedding. We employ the text encoder trained in Section 3.2 to generate well aligned
embedding as an auxiliary supervision. Similar to common classification tasks,  ̂ is fed into
a linear layer to classify. Finally, the overall loss function used to optimize the 3D temporal
network is defined as:

1
 
 

1
 
 
ℓ
where  is a balancing coeficient. We separately train two models for each modality and fuse
their classification results in Section 3.5.
(6)
(7)
2.5. Ensemble
as follows:
3. Experiments
3.1. Datasets
iMiGUE[13] dataset. The iMiGUE dataset is derived from post-match interview videos
with athletes. After an intense competition, a professional athlete needs to be interviewed by
reporters. In these videos, a total of 18,499 micro-gesture samples were annotated, divided
into 32 category labels. On average, each video contains about 51 micro-gesture samples. The
duration of these micro-gesture samples varies from 0.18 seconds to 80.92 seconds, with an
average duration of 2.55 seconds.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.2. Comparison to State-of-the-art Methods</title>
        <p>We validated our proposed method through comparative analysis on the IMIGUE dataset. As
shown in Table 1, we compared our approach with state-of-the-art methods, including the
GCN-based Hyperformer method for the skeleton modality, the PoseC3D method based on
3D-CNN, and the Frozen CLIP method for the video modality. Compared to the DSCNet method,
which also employs CNNs in a multimodal approach, our method improved the Top-1 accuracy
by 6.37%. Moreover, compared to the Frozen CLIP method that uses a text encoder, our method
significantly enhanced the Top-1 accuracy by 11.11%, demonstrating the efectiveness of our
approach.</p>
        <sec id="sec-3-5-1">
          <title>Methods</title>
          <p>ST-GCN [14]</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>Hyperformer [15] PoseC3D [10] PoseC3D [10] TRN [16]</title>
        </sec>
        <sec id="sec-3-5-3">
          <title>Frozen clip [7]</title>
        </sec>
        <sec id="sec-3-5-4">
          <title>Froster clip [9]</title>
        </sec>
        <sec id="sec-3-5-5">
          <title>DSCNet [1]</title>
        </sec>
        <sec id="sec-3-5-6">
          <title>Ours</title>
        </sec>
        <sec id="sec-3-5-7">
          <title>Modality</title>
        </sec>
        <sec id="sec-3-5-8">
          <title>Skeleton</title>
        </sec>
        <sec id="sec-3-5-9">
          <title>Skeleton</title>
        </sec>
        <sec id="sec-3-5-10">
          <title>Skeleton+Joint</title>
        </sec>
        <sec id="sec-3-5-11">
          <title>Skeleton+Limb RGB RGB RGB</title>
        </sec>
        <sec id="sec-3-5-12">
          <title>RGB+Skeleton</title>
          <p>RGB+Skeleton</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>3.3. Ablation Study</title>
        <p>We investigated the impact of diferent weights on the results, assigning diferent weights to
three models. As shown in Table 2, the results revealed that the highest accuracy was achieved
when the weights for RGB, Joint, and Limb models were set to 0.55, 0.4, and 0.05 respectively.
The accuracy on the IMIGUE dataset reached 68.90% under these weight configurations.</p>
        <p>We explored the impact of the Attenuation token on the results. As shown in Table 3,
the Attenuation token strategy was adopted in the Transformer module, where tokens were
attenuated based on attention weights, filtering out unimportant tokens to make the model
focus more on crucial parts, thus enhancing the recognition accuracy. Additionally, reducing
the number of tokens can decrease computational complexity. On the IMIGUE test set, we
improved the accuracy by 0.26% while reducing memory usage to 1.28G, thus enhancing the
overall performance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper, we presented our solution for the MIGA Challenge organized by IJCAI 2024. Our
approach involves a multimodal model based on CLIP. In the RGB modality, we proposed an
attenuation token strategy building upon Froster CLIP as the baseline. In the skeleton modality,
we integrated the text encoder from CLIP into PoseC3D to enhance interaction between the
two modalities. Ultimately, our multimodal approach achieved an accuracy of 68.9%. In the
future, we plan to address the strengths and weaknesses of both video and skeleton modalities
by implementing targeted complementary operations. For example, leveraging the sparsity of
skeletons to crop videos could improve the capture of detailed information in micro-gestures.
general video recognition, in: S. Avidan, G. Brostow, M. Cissé (Eds.), Computer Vision –
ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 1–18.
[9] X. Huang, H. Zhou, K. Yao, K. Han, Froster: Frozen clip is a strong teacher for
openvocabulary action recognition, 2024. arXiv:2402.03241.
[10] H. Duan, Y. Zhao, K. Chen, D. Lin, B. Dai, Revisiting skeleton-based action recognition,
2022. arXiv:2104.13586.
[11] Z. Weng, X. Yang, A. Li, Z. Wu, Y.-G. Jiang, Open-vclip: Transforming clip to an
openvocabulary video model via interpolated weight optimization, 2023. arXiv:2302.00624.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I.
Polosukhin, Attention is all you need, 2023. arXiv:1706.03762.
[13] X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, G. Zhaoz, imigue: An identity-free video dataset for
micro-gesture understanding and emotion analysis, 2021. arXiv:2107.00285.
[14] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based
action recognition, Proceedings of the AAAI Conference on Artificial Intelligence 32
(2018). URL: https://ojs.aaai.org/index.php/AAAI/article/view/12328. doi:10.1609/aaai.
v32i1.12328.
[15] Y. Zhou, Z.-Q. Cheng, C. Li, Y. Fang, Y. Geng, X. Xie, M. Keuper, Hypergraph transformer
for skeleton-based action recognition, 2023. arXiv:2211.09590.
[16] B. Zhou, A. Andonian, A. Oliva, A. Torralba, Temporal relational reasoning in videos, in:
Proceedings of the European Conference on Computer Vision (ECCV), 2018.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , J. Cheng,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>A dense-sparse complementary network for human action recognition based on rgb and skeleton modalities</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>244</volume>
          (
          <year>2024</year>
          )
          <article-title>123061</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0957417423035637. doi:https://doi.org/10.1016/j.eswa.
          <year>2023</year>
          .
          <volume>123061</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bremond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Thonnat</surname>
          </string-name>
          ,
          <article-title>Vpn: Learning video-pose embedding for activities of daily living</article-title>
          , in: Computer Vision-ECCV
          <year>2020</year>
          : 16th European Conference, Glasgow, UK,
          <year>August</year>
          23-
          <issue>28</issue>
          ,
          <year>2020</year>
          , Proceedings,
          <source>Part IX 16</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>72</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <article-title>Cross-modal learning with 3d deformable attention for action recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>10265</fpage>
          -
          <lpage>10275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Recurring the transformer for video action recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>14063</fpage>
          -
          <lpage>14073</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <article-title>Star-transformer: A spatio-temporal cross attention transformer for human action recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>3330</fpage>
          -
          <lpage>3339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <article-title>Sastry, Learning transferable visual models from natural language supervision</article-title>
          , in: M.
          <string-name>
            <surname>Meila</surname>
          </string-name>
          , T. Zhang (Eds.),
          <source>Proceedings of the 38th International Conference on Machine Learning</source>
          , volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , P. Gao,
          <article-title>Frozen clip models are eficient video learners</article-title>
          , in: Computer Vision - ECCV 2022, Springer Nature Switzerland, Cham,
          <year>2022</year>
          , pp.
          <fpage>388</fpage>
          -
          <lpage>404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Zhang,</surname>
          </string-name>
          <article-title>Expanding language-image pretrained models for</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>