<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiangshuo Qiao</string-name>
          <email>qiaoxs@yeah.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xianxin Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaozhe Qu</string-name>
          <email>xiaozhequ@tencent.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jie Zhang</string-name>
          <email>jeyzzhang@tencent.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Liu</string-name>
          <email>jelmeliu@tencent.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Luo</string-name>
          <email>yamiluo@tencent.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cihang Jin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jin Ma</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tencent PCG</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science and Technology of China</institution>
          ,
          <addr-line>Hefei, Anhui</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Vision-Language Models pre-trained on large-scale image-text datasets have shown superior performance in downstream tasks such as image retrieval. Most of the images for pre-training are presented in the form of open domain common-sense visual elements. Diferently, video covers in short video search scenarios are presented as user-originated contents that provide important visual summaries of videos. In addition, a portion of the video covers come with manually designed cover texts that provide semantic complements. In order to ifll in the gaps in short video cover data, we establish the first large-scale cover-text benchmark for Chinese short video search scenarios. Specifically, we release two large-scale datasets CBVS-5M/10M to provide short video covers, and the manual fine-labeling dataset CBVS-20K to provide real user queries, which serves as an image-text benchmark test in the Chinese short video search field. To integrate the semantics of cover text in the case of modality missing, we propose UniCLIP where cover texts play a guiding role during training, however are not relied upon by inference. Extensive evaluation on CBVS-20K demonstrates the excellent performance of our proposal. UniCLIP has been deployed to Tencent online video search systems with hundreds of millions of visits and achieved significant gains. The complete dataset, code and checkpoints are available at CIKM MMSR '24: 1st Workshop on Multimodal Search and Recommendations, October 21-25, 2024, Boise, Idaho, USA ∗Corresponding author. †These authors contributed equally.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>vision-language models, video search, contrastive learning</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        CLIP [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] demonstrates the promise of performing contrastive learning pre-training on large-scale
imagetext data from the web with a data size of 400 million. In this work, visual base models represented
by ViT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] are aligned with textual base models represented by Bert [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] by learning on large-scale
unsupervised data. These base models can be transferred to downstream tasks such as image search [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
via natural language prompts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In the field of Chinese multi-modal representation learning, previous work [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ] supplements
high-quality Chinese image-text datasets and successfully pre-trains Chinese visual language models.
Most of the data are open-domain images collected from the web or multiplexed from publicly available
English datasets. These images are captured by a camera and presented in the form of common-sense
visual elements, including animals, buildings, activities, etc. with corresponding descriptive text.
      </p>
      <p>
        With the rise of short videos, video search has become a popular topic [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. Previous work
[
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ] create large-scale datasets for the Chinese short-video search domain and provide publicly
available video frames or video features to support content-based search. Online video search systems
integrate features from several information domains such as video cover, video frame, and video title.
There has been academic work on video search using titles, video frames, but there is a lack of research
on video search based on video covers. During the creation of short videos, creators craft video covers
CEUR
      </p>
      <p>ceur-ws.org
for short videos with the aim of attracting the interest of the most relevant viewers. Therefore, in short
video search scenarios, short video covers provide direct overviews and serve as crucial visual features
of the videos. Besides, cover-based search has eficiency advantages over content-based search.</p>
      <p>However, there are remarkable morphological diferences between short video cover images and open
domain images. As shown in Fig. 1, compared to open-domain visual elements, short video covers, as
user-originated content, are mostly artificial combinations of various visual elements and may undergo
post-processing such as cropping and splicing. On the other hand, many creators craft cover texts for
video covers to complement or emphasize the semantic information of the video. This is a feature that
open domain images do not share. Therefore, short video cover images represent a diferent form of
data from open domain images, and the availability of large-scale cover dataset is crucial. However,
available large-scale cover datasets are lacking.</p>
      <p>In this work, we release a large-scale Chinese image-text Benchmark for short Video Search scenarios
(CBVS) to fill the gap of data in real Chinese video search scenarios. CBVS is designed in three versions:
the manually fine-labeled CBVS-20K and the large-scale unsupervised CBVS-5M/10M. Fig. 2 shows their
data examples. Specifically, CBVS-20K contains 20K high-quality &lt;user query-video cover&gt; pairs, which
serves as an image-text benchmark test in the field of Chinese short video search. Well-trained human
experts annotate the relevance of each user query to the video cover and at least two cross-validations
are performed. In addition, Optical Character Recognition (OCR) texts of cover images are provided
after machine extraction and human correction. Due to the constraints of user privacy and platform
rules, the large-scale CBVS-5M/10M contains about 5M/10M &lt;video title-video cover&gt; pairs, where the
text is provided in the form of video titles and OCR texts. These data are available for visual language
models to learn modal alignment in pre-training or fine-tuning tasks. The CBVS dataset includes 32
categories such as Film and animation, Character, Education, Game, Commodity, etc. to avoid data
distribution bias. Tab. 1 shows a detailed comparison of the various versions of CBVS.</p>
      <p>
        In short video search scenarios, cover texts complement the semantics of cover images. On one hand,
CLIP lacks the ability to fuse multi-semantic signals on the visual side. On the other hand, not all cover
images come with cover texts, so the modality missing problem needs to be considered. In order to
efectively integrate the semantics of cover images with cover texts, we propose UniCLIP inspired by
the work of OCR-free [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]. Cover texts signals are unified to guide image-text contrastive learning
Query=⼩鹏G6和特斯拉ModelY（Xpeng G6 and Tesla Model Y）
Query=⻄红柿炒鸡蛋（Tomato and Egg Stir-fry）
OCR text：致敬？还是超越？
特斯拉ModelY VS ⼩鹏G6
Pay tribute? Or beyond? Tesla
Model Y VS Xpeng G6
Relevance Level：2
      </p>
      <p>OCR text：⼩鹏G6 买它！
Go for the Xpeng G6!
Relevance Level：1</p>
      <p>OCR text：尼古拉·特斯拉究
竟有多强？
Just how brilliant was Nikola
Tesla?
Relevance Level：0</p>
      <p>OCR text：⽆
Null</p>
      <p>OCR text：⽆</p>
      <p>Null
Relevance Level：2</p>
      <p>Relevance Level：1</p>
      <p>OCR text：鸡胸⾁⻩⽠丁
Diced Chicken Breast with
Cucumber
Relevance Level：0
Title
故宫三⼤殿最后⼀作——保和殿！⼀千三百个零件 历时两年 完整三⼤殿，不可错过！
The final masterpiece of the Forbidden City's three main halls - the Hall of Preserving Harmony! With 1,300 components and two years in the making, the complete trio of halls is
a must-see!
OCR text
故宫保和殿</p>
      <p>The Hall of Preserving Harmony in the Forbidden City
in an presence-guided and semantic-guided manner.</p>
      <p>The training of UniCLIP integrates the semantics of cover text and can be generalised to the retrieval
of other image-text mashup data (not limited to video covers), as shown in Fig. 3. Image-text mashup
data refers to images with OCR text, such as e-commerce posters, social media images, and video covers.
It is worth emphasizing that the inference process does not depend on any module related to OCR
and the model is immune to the problem of missing cover text modalities. Extensive experimental
evaluations demonstrate the efectiveness of our proposal.</p>
      <p>Our contributions can be summarized as follows:
• In order to fill in the lack of cover data for short video search scenarios, we release the largest</p>
      <p>Chinese cover image-text dataset with video title texts and cover texts.
• We build a manual fine-labeling image-text benchmark test for Chinese short video search
scenarios, containing real user queries from browser logs.
• We propose UniCLIP, which introduces an image classification task and an image-text matching
task to guide image-text contrastive learning training. UniCLIP imposes no additional inference
cost and training is immune to the modality missing problem.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <sec id="sec-3-1">
        <title>2.1. Chinese Video/Image-text benchmark</title>
        <p>
          Compared to English multi-modal pre-training, the Chinese community is lagging behind. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] introduces
translated versions of the English multi-modal datasets [
          <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
          ] to support Chinese multi-modal
pretraining. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] releases a large-scale Chinese dataset Wukong containing 100 million image-text pairs
collected from the web to bridge the language gap. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] further establishes large-scale Chinese
crossmodal benchmarks by releasing two pre-training datasets and five fine-tuning datasets. Besides,
Product1M [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] provides additions to the e-commerce domain. [
          <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
          ] supplements the Chinese
video-text data, and visual modalities are provided in the form of video frames. However, large-scale
video cover data is scarce.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Image-text Matching</title>
        <p>
          The image-text matching task aims to measure the semantic similarity of diferent modalities in the
same embedding space[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Existing implementations fall into two categories: the first is
embeddingbased, i.e., encoding global representations of vision and text separately, and then performing similarity
computation [
          <xref ref-type="bibr" rid="ref21 ref22 ref23">21, 22, 23</xref>
          ]. The second is score-based, i.e., performing cross-modal interactions locally
and calculating cumulative scores [
          <xref ref-type="bibr" rid="ref24 ref25 ref26">24, 25, 26</xref>
          ]. Due to the advantages of performance and eficiency,
embedding-based methods have attracted the attention of researchers [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. In particular, CLIP [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
provides new ideas for multi-modal representation learning. A series of studies following CLIP have
been applied to downstream tasks including image text matching. For example, GCLIP[28],
CLIPAdapter[29] and SLIP[30] raise the upper performance limit of CLIP, AltCLIP[31], CN-CLIP[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], and
TaiyiCLIP[32] expand the CLIP language domain.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. CBVS Dataset</title>
      <sec id="sec-4-1">
        <title>3.1. Comparison</title>
        <p>
          A comparison with other Chinese image-text/video-text datasets is shown in Tab. 2. On the one hand,
CBVS provides cover images, user queries, and is larger in size compared to publicly available video
datasets such as VATEX [33], BFVD/FFVD [34], and CNVid-3.5M [35]. CREATE [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], Kwai-SVC [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
ALIVOL-10M [36] with similar scale are access restricted. On the other hand, compared to image-text
datasets such as Wukong [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], Product1M [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], M6-Corpus [37], ZERO-Corpus/R2D2 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], etc., the biggest
advantage of CBVS lies in the uniqueness of the video cover image and the cover text specific to the
cover image. To the best of our knowledge, CBVS is the largest publicly available Chinese image-text
dataset providing cover images.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Data Collection</title>
        <p>In order to provide real video search data, we capture the user query logs of mainstream mobile browsers
and divide user queries into two parts. We retrieve more than 8M videos from the Chinese video website
BiliBili1 through the first part of user queries. To avoid data distribution bias due to a single platform, we
retrieve more than 5M videos from Tencent Video2 as a supplement in the same way, and finally obtain
more than 13M &lt;cover-title&gt; pairs as a data source for CBVS-5M/10M. In this process, there are videos
with inconsistent cover and content in the original data. The video cover-video frame consistency is
ifltered by the video long click signal and the video completion rate signal. Slight data noise is allowed.
Besides, we manually select more than 2K high-quality user queries from the second part, and collect
20K high-quality &lt;user query-cover image&gt; pairs in the same way, as the data source of CBVS-20K.
The number of cover images under each user query is controlled in [5, 30).
1https://www.bilibili.com
2https://v.qq.com</p>
        <p>After that, we design the data cleaning program from two aspects: data quality and image-text
relevance. First, we filter the video covers with low resolution and scale disproportion, and eliminate the
dead links. After that, we score the relevance of video covers and titles in 13M data with the open-source
Chinese image-text model QA-CLIP3, and filter the trailing 3M data to obtain CBVS-10M. We randomly
sample 1K data from 13M and 10M data, respectively, and human experts evaluate whether the video
cover is relevant to the title. The evaluation conclusions show that the relevance is improved from
75.6% to 93.0% after data cleaning. Finally, CBVS-5M is obtained by sampling in CBVS-10M.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Data Annotation</title>
        <p>The data annotation of CBVS-20K is performed by trained experts in the field of video search in a
two-stage, cross-validated approach. First, they annotate whether the user query reveals a clear intent
for video, and filter out query terms related to pornography and violence, and finally take the queries
with a need for video as candidates for annotation in the second stage. After that, the annotators mark
the degree of relevance for each &lt;user query-cover image&gt; pair, which is categorized into three grades:
strongly relevant, weakly relevant, and irrelevant. The semantics of cover images and cover texts are
required to be considered together. Meanwhile, the annotators correct the OCR text extracted by the
machine.</p>
        <p>The following criteria are used to determine whether the user query reveals a clear intent for video:
• With video intent: queries that explicitly need to be satisfied by video resources, or better (with
3https://github.com/TencentARC-QQ/QA-CLIP</p>
        <p>We exclude the controversial data, and end up with 20,001 image-text pairs consisting of 2,486 unique
queries and 19,648 unique images. The percentage of strongly relevant, weakly relevant and irrelevant
data is 29.74%, 30.80% and 39.46% respectively. The average length of user queries is 7.0 and the average
length of OCR texts is 14.5. 33.41% of cover images come with OCR texts. Fig. 4 shows the distribution
of categories of these user queries, where the categories are labelled manually.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Methodology</title>
      <sec id="sec-5-1">
        <title>4.1. Image-Text Contrastive Learning</title>
        <p>
          We follow CLIP [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to co-train the image encoder with the text encoder, taking InfoNCE Loss [38] as
the Image-Text Contrastive (ITC) loss     , as shown in Fig. 5. Specifically, we maximize the similarity
scores of the matched image and text embeddings in terms of batch. We adopt ViT [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and RoBERTa
[39] as the visual and textual skeletons, respectively, and introduce the weight initialization of QA-CLIP.
        </p>
        <p>In particular, to bridge the gap between the data morphology of the title and the user query, we
employ a Chinese word-splitting component, Lexical Analysis of Chinese (LAC)4. In order to simulate
the morphological distribution of the user query, the result of the lexical segmentation is composed
into a string using spaces as the splice character. For the case of failed word splitting, the original title
is employed. This setting takes efect for all fine-tuning tasks unless otherwise specified.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Presence-guided Encoder</title>
        <p>Video cover images difer from open domain images in that they partially carry cover texts. One option
is to outsource the cover text understanding task to an external OCR engine and fuse the cover image
with the cover text on the image side in an ALBEF [40] manner. However, the cost of image-text
similarity inference becomes expensive. In addition, the image-text contrastive learning becomes highly
dependent on the accuracy of the OCR engine, which creates an obstacle for model generalization.</p>
        <p>
          Diferently, we are inspired by [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ] to design UniCLIP in an OCR-free form. One idea is to guide
the ViT to perceive the cover texts during the training process through agent tasks, so that it relies on
no module related to the OCR function in the inference process. Since the presence of cover text is
一分钟解锁灵动岛完全体！
(Achieve full form of Dynamic
        </p>
        <p>Island in one minute!)</p>
        <p>Text</p>
        <p>Cover Image
OCR Extractor
[CLS]</p>
        <p>ITC
[CLS]
Positive/Negative</p>
        <p>Sample</p>
        <p>A uM
tte lt
itonn i-eadH</p>
        <p>OCR Presence-guided Encoder
roddNA&amp;m rraodFw eedF roddNA&amp;m</p>
        <p>tfaxoSm P&amp;M
A uM
tte lt
tinon i-eadH</p>
        <p>A
d
d
&amp;
N
o
r
m</p>
        <p>A
ten roC
t
tion ss</p>
        <p>A
d
d
&amp;
N
o
r
m
*3</p>
        <p>F
rraow eedF
d
OCR Semantic-guided Encoder</p>
        <p>A
d
d
&amp;
N
o
r
m
*3</p>
        <p>IC
tfaxoSm P&amp;M</p>
        <p>Task
CLS token
ITM
uncertain, we propose the presence-guided encoder, where the first agent task of UniCLIP is set as an
Image Classification (IC) task: ”To determine whether an image carries cover texts”.</p>
        <p>
          Specifically, as shown in Fig. 5, the presence-guided encoder takes the output tokens from the last
layer of ViT as input. These tokens go through a 3-layer, 8-header Transformer [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] structure, after
which they are fed into a MLP layer for predicting the presence or absence of cover texts. The loss in
this part is    .
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Semantic-guided Encoder</title>
        <p>The presence-guided encoder directs the cover image encoder to focus on the cover texts, but does
not involve semantic information. We further propose the semantic-guided encoder, which sets the
second agent task as an Image-Text Matching (ITM) task: ”To determine whether the specified text
is consistent with the text on the cover image”, encouraging the ViT to incorporate gains from the
semantics of the cover texts. This design is motivated by the notion that visual tokens from the ViT
contain the semantic information of cover texts, if they are successfully employed to discriminate the
consistency of the cover text with a given text.</p>
        <p>We design negative samples by nearest neighbor lookup. Take the training on CBVS-5M dataset as
an example. First, we adopt RoBERTa-wwm-Base as the encoder and load the checkpoints released by
QA-CLIP to encode all the 2.0M valid OCR texts. The Hierarchical Navigable Small World Algorithm
(HNSW) [41] is applied to retrieve the Top- OCR texts that are most semantically similar but not
identical to the anchor, and one of them is randomly selected as the negative sample. For covers without
OCR texts, only negative samples are considered. For covers with OCR texts, the percentage of positive
samples is set to 70%.</p>
        <p>The semantic-guided encoder accepts two inputs, i.e., tokens from the last layer of the ViT, and the
embeddings of positive or negative samples. As shown in Fig. 5, the module is a 3-layer structure, where
each layer consists of a self-attention, a cross-attention with an MLP, and includes residual connections.
The embeddings of the samples are updated layer by layer. The loss of this process is     .</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Training and Inference</title>
        <p>Pre-training for UniCLIP starts with the checkpoints released by QA-CLIP. Fine-tuning is then performed
on the CBVS-5M/10M dataset with a total loss of:
  
=  1</p>
        <p>+  2   +  3    ,
where  1,  2 and  3 are hyperparameters.</p>
        <p>It is worth noting that the fine-tuning of UniCLIP relies on positive samples, negative samples
and ground truths related to OCR texts, but the inference process does not rely on any OCR-related
components. As shown in Fig. 5, the presence-guided encoder and semantic-guided encoder guide the
training of the image-text alignment task in UniCLIP, but do not participate in the inference process.
Therefore, UniCLIP infers in a manner consistent with CLIP.</p>
        <p>Recall is a widely adopted retrieval metric. We report on the Recall (R) and Mean Recall (MR) of the
Positive-to-Negative Ratio (PNR) measures the consistency of predicted results with ground truth.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Experiments</title>
      <sec id="sec-6-1">
        <title>5.1. Evaluation Metrics</title>
        <sec id="sec-6-1-1">
          <title>5.1.1. Recall Metrics</title>
          <p>models.</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>5.1.2. Rank Metrics</title>
          <p>Formally, PNR is defined as:

∑
=1</p>
          <p>2  − 1
(1 + )</p>
          <p>,</p>
          <p>.
   =
∑ ∑,∈ 
∑ ∑,∈ 
{ 
{ 
&gt;   } ⋅ { ̂ 
&gt;   } ⋅ { ̂ 
&gt;  ̂ }
&lt;  ̂ }
,
where  is the indicator function. The result of the indicator function is 1 if the internal expression is
true, and 0 otherwise.   represents the set of all documents under query  .   ,  ̂ are the true and
predicted labels of &lt;image  , text &gt; , respectively. In particular, we compute PNR only for documents
under the same user query.</p>
          <p>Normalized Discounted Cumulative Gain (NDCG) is a widely adopted metric in the field of search
ranking that encourages higher rankings for better matching documents. Formally, DCG is defined as:
(1)
(2)
(3)
(4)
where</p>
          <p>is the ground truth label of the document at position  . Further, IDCG is the DCG value of
the ideal sort. NDCG is obtained by dividing by IDCG [42]:</p>
          <p>In addition to PNR and NDCG, we also report on the Mean Average Precision (MAP) metric to fully
evaluate our model.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Implementation Details</title>
        <p>For the comparison model, we adopt the hyper-parameters suggested in their open source project.
For UniCLIP, we follow the model architecture setup in OpenAI CLIP with ViT-B/16 as the visual
backbone network. We have tried a number of hyper-parameter combinations and report the best
one. The text encoder is the 12-layer architecture of RoBERTa-wwm-Base. Both are implemented by
12-layer 12-head Transformers with 768 encoding dimensions and eventually mapped linearly to 512
dimensions. The vocabulary size of text tokenizer is consistent with CN-CLIP. The initialized weights
of the visual encoder and text encoder are from QA-CLIP as described in Sec. 4.4. In addition, the
weights of presence-guided encoder and semantic-guided encoder are randomly initialized with normal
distribution, and  for the semantic-guided encoder is set to 10. The weights of     ,    , and     are
set to 0.8, 0.1, and 0.1, respectively. The data enhancement module LAC is configured by default. Other
hyper-parameter settings that are not detailed are consistent with those of the open-sourced Chinese
CLIP.</p>
        <p>In the fine-tuning stage, we perform random-size cropping and AutoAugment [ 43] on the input
image. All parameters of the image encoder and text encoder are allowed to be updated. Training is
executed on 8 NVIDIA A100 GPUs for 20 epochs, with a learning rate of 2e-5. The maximum length
of the text encoder is 12 and the weight decay is 1e-3. We leverage all-gather communications across
GPU workers to compute     on the global batch. The batch size is set to 1,520 due to GPU memory
limitations. Mixed-precision training is activated. We save checkpoints for each epoch and report
results with the highest Positive-to-Negative Ratio (PNR) on CBVS-20K. Due to computational resource
constraints, we report the results of training on the CBVS-5M dataset. We simultaneously release the
CBVS-10M dataset for subsequent studies.</p>
        <p>It needs to be noted that UniCLIP is specifically designed for the short video search domain. Since
CBVS is the first and only large-scale dataset in this field that provides video covers and real queries
(which is one the contributions of this work), there is a lack of other data benchmarks to compare it with.
In addition, although UniCLIP is not limited to Chinese, there is a lack of large-scale video cover-text
training datasets for other languages. Therefore, in this work, we only evaluate the performance of each
model on the CBVS dataset. We look forward to subsequent work to establish more video cover-text
benchmarks for.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.3. Comparisons</title>
        <p>
          To demonstrate the performance of our proposal and the value of the CBVS dataset, we extensively
evaluate advanced Chinese image-text models on CBVS-20K. The competing models include CN-CLIP
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], R2D2 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], Wukong [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], TaiyiCLIP [32], Ernie-ViL2.0 [44] and AltCLIP [31]. The results are shown
in Tab. 3.
        </p>
        <p>The results show that, for example, WuKong, although it achieves competitive performance on other
public datasets, it lacks the ability of image-text matching on CBVS-20K. The same conclusion appears
in most of the other competitors, e.g., the MR of Taiyi-CLIP  − is only 0.407, which much lower
than UniCLIP’s 0.692. The generalization performance of models trained on large-scale open domain
images is generally low on the CBVS-20K, which demonstrates the domain uniqueness of video cover
images. Vision-Language Models that perform well in the open domain do not migrate to the video
cover domain as expected.</p>
        <p>It is worth noting that although R2D2-250M  − ’s recall metrics are significantly behind UniCLIP,
its ranking metrics are close to those of UniCLIP. In particular, NDCG@1 slightly outperforms UniCLIP
with a maximum of 0.789. We infer that the experimental result is due to the fact that R2D2-250M  −
employs a more powerful visual architecture and enjoys a training corpus size of 250M. We encourage
the incorporation of CBVS-10M into the training corpus on the one hand, and the adoption of ViT-L
as a visual skeleton on the other hand, to facilitate further improvement of UniCLIP performance in
subsequent studies.</p>
        <p>Compared to the pre-trained QA-CLIP, the fine-tuning on the CBVS-5M dataset comprehensively
improves the metrics, especially the PNR by 3.67%, and R@1 by 18.25%. Performing fine-tuning on the
publicly released CN-CLIP  − , consistent findings are observed with a 7.21% improvement in PNR and
a 22.66% improvement in R@1. Performing fine-tuning on the R2D2-250M   − , significantly higher
recall metrics are observed, as well as largely comparable rank metrics. These results demonstrate that
ifne-tuning on a large-scale cover dataset can improve the performance of the model in the video search
domain. In addition, UniCLIP achieves state-of-the-art performance with the highest metrics compared</p>
        <p>Rank Metrics
PNR NDCG@1 NDCG@5 NDCG@10 MAP
Zero-shot
Fine-tuning</p>
        <p>CN-CLIP −/16
CN-CLIP −/14
WuKong −/32
WuKong −/14
Taiyi-CLIP −
Taiyi-CLIP −
Ernie-ViL2.0 −
R2D2-23M −/14
R2D2-250M −/14</p>
        <p>AltCLIP −
QA-CLIP −/16</p>
        <p>CN-CLIP −/16
R2D2-250M −/14</p>
        <p>QA-CLIP −/16
ALBEF-CLIP −/16</p>
        <p>UniCLIP −/16</p>
        <p>Rank Metrics</p>
        <p>PNR NDCG@1 NDCG@5 NDCG@10 MAP
0.473 0.711 0.783 0.656 2.907
0.491 0.747 0.818 0.685 2.991
0.499 0.754 0.812 0.688 3.006
0.503 0.754 0.820 0.692 3.069
to its competitors and does not introduce additional inference cost compared to the simple and eficient
CN-CLIP.</p>
      </sec>
      <sec id="sec-6-4">
        <title>5.4. Ablation Study</title>
        <p>We implement versions of the model with and without    and     , respectively. The weights of    
and    (or     ) are set to 0.8 and 0.2. Tab. 4 shows the results of the ablation study.</p>
        <p>If both the presence-guided encoder and the semantic-guided encoder are removed and the cover
text is discarded, the model degenerates into a fine-tuned version of QA-CLIP. The PNR is reduced from
3.069 to 2.907 and the MR from 0.692 to 0.656. In addition, removing any of the two encoders results in
varying degrees of degradation in model performance. Removing the presence-guided encoder reduces
the PNR of UniCLIP by 2.05%. Besides, removing the semantic-guided encoder reduces the PNR by
2.54%. Interestingly, when only the semantic-guided encoder is employed, the NDCG@1/5/10, MAP,
and R@1/5 of the model are basically the same as the final scheme, which indicates that the gain of
cover text is mainly in the semantic information. If two encoders are employed at the same time, i.e.,
the proposed two agent tasks are considered at the same time, the highest metrics are achieved in all
aspects. This result is ample proof of the validity of our proposal. We encourage follow-up studies to</p>
        <p>Contrastive
Feed Forward</p>
        <p>Self Attention
一分钟解锁灵动岛完全体！
(Achieve full form of Dynamic</p>
        <p>Island in one minute!)</p>
        <p>Text
*12</p>
        <p>*12
[CLS]
Feed Forward
Self Attention</p>
        <p>Image</p>
        <p>Feed Forward
Cross Attention
Self Attention
Feed Forward
Self Attention
*6
*6</p>
        <p>OCR
Extractor</p>
        <p>灵动岛进阶教程
(Dynamic Island Advanced</p>
        <p>Tutorial)</p>
        <p>OCR
further generalise our ideas.</p>
      </sec>
      <sec id="sec-6-5">
        <title>5.5. Cover Text Capability Assessment</title>
        <p>Since the training process relies on the cover text modality, for a fair comparison, we implement an
explicit OCR text fusion scheme, which is denoted as ALBEF-CLIP, as shown in 6. Compared to CLIP,
we replace ViT with an ALBEF structure, where the cover image and the cover text go through their
respective encoders before passing through a 6-layer Attention-based fusion structure. For the case of
missing OCR texts, the prompt word is employed. The results of ALBEF-CLIP are shown in Tab. 3. If
the cover text is introduced, but with ALBEF-CLIP rather than our proposal to exploit this modality,
the metrics are much lower than UniCLIP in all aspects. We hypothesize that the reason for this is
that UniCLIP guides the semantic training of ViT and handles the modality missing problem more
consistently, reducing information confusion.</p>
        <p>To further evaluate the cover text capability, we categorize the data in CBVS-20K into two main
categories according to the presence or absence of cover texts, which are denoted as   and   , respectively.
Tab. 5 demonstrates the PNR metrics for diferent combinations. Compared to the scheme without
cover texts (QA-CLIP), ALBEF-CLIP significantly improves the matching ability for covers with cover
texts, increasing the PNR from 3.203 to 3.375. However, for covers without cover texts, the scheme
degrades the performance, which may be due to semantic confusions brought about by prompt words.</p>
        <p>In comparison, UniCLIP is basically comparable to ALBEF-CLIP for matching between covers with
cover texts. This is in line with expectations, as we discard cover texts modalities in our inference,
however the very close results are a good indication of the promise of UniCLIP. Besides, UniCLIP
performs much better for both other cases than QA-CLIP and ALBEF-CLIP. For the matching between
covers without cover texts, which is most likely to happen, UniCLIP’s PNR exceeds that of the fusion
scheme by 8.00% and has a lower inference cost. For the hybrid cases, UniCLIP achieves the PNR of
3.194. This suggests that UniCLIP is able to overcome the modality missing problem to some extent and
handle cover images with or without cover texts uniformly. Thanks to this, UniCLIP shows the best
performance on the full range of data.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>As the title of the paper indicates, one of the most significant contributions of this work is to establish the
ifrst large-scale cover-text benchmark for Chinese short video search scenarios, which provides short
video covers and real user queries. we release the largest publicly available Chinese video cover-video
title dataset to fill in the lack of cover data for short video search scenarios. We further build a manual
ifne-labeling video cover-user query benchmark test for short video search domain.</p>
      <p>Based on this, we further propose UniCLIP, which integrates the semantic information of cover-texts
without increasing the inference cost, is uniform with and without cover text, and has the advantage of
online deployment. UniCLIP are proposed to unify cover texts to guide contrastive learning, where the
image classification task and the image-text matching task are performed in an OCR-free manner. We
are the first to integrate the semantics of cover text into CLIP. UniCLIP has demonstrated significant
performance gains and has been deployed in our online system. With more than 700 million daily active
users worldwide for short video products represented by TikTok, our benchmark and model have great
potential for application. We believe in the ability of CBVS-5M/10M to expand the domain of
largescale Chinese image-text training. Besides, we are pleasantly surprised to observe the model-agnostic
potential of UniCLIP.</p>
      <p>One the one hand, UniCLIP is language-agnostic and can be generalised to other languages, but
relies on the establishment of large-scale open-source cover-text datasets in other languages. We
expect subsequent work to build data benchmarks for more languages and apply UniCLIP. On the
other hand, there is significant room for exploration remains in balancing the video search domain
with the generalized domain. We look forward to the extension of CBVS to downstream tasks such as
title generation, as well as inspiration from UniCLIP for performing multi-modal fusion in the CLIP
framework.
[28] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang,
et al., Grounded language-image pre-training, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022, pp. 10965–10975.
[29] R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, H. Li, Tip-adapter: Training-free
clip-adapter for better vision-language modeling, arXiv preprint arXiv:2111.03930 (2021).
[30] N. Mu, A. Kirillov, D. Wagner, S. Xie, Slip: Self-supervision meets language-image pre-training, in:</p>
      <p>European Conference on Computer Vision, Springer, 2022, pp. 529–544.
[31] Z. Chen, G. Liu, B.-W. Zhang, F. Ye, Q. Yang, L. Wu, Altclip: Altering the language encoder in clip
for extended language capabilities, arXiv preprint arXiv:2211.06679 (2022).
[32] J. Zhang, R. Gan, J. Wang, Y. Zhang, L. Zhang, P. Yang, X. Gao, Z. Wu, X. Dong, J. He, et al.,
Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence, arXiv preprint arXiv:2209.02970
(2022).
[33] X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, W. Y. Wang, Vatex: A large-scale, high-quality
multilingual dataset for video-and-language research, in: Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2019, pp. 4581–4591.
[34] S. Zhang, Z. Tan, J. Yu, Z. Zhao, K. Kuang, J. Liu, J. Zhou, H. Yang, F. Wu, Poet: Product-oriented
video captioner for e-commerce, in: Proceedings of the 28th ACM International Conference on
Multimedia, 2020, pp. 1292–1301.
[35] T. Gan, Q. Wang, X. Dong, X. Ren, L. Nie, Q. Guo, Cnvid-3.5 m: Build, filter, and pre-train the
large-scale public chinese video-text dataset, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2023, pp. 14815–14824.
[36] C. Lei, S. Luo, Y. Liu, W. He, J. Wang, G. Wang, H. Tang, C. Miao, H. Li, Understanding chinese
video and language via contrastive multimodal pre-training, in: Proceedings of the 29th ACM
International Conference on Multimedia, 2021, pp. 2567–2576.
[37] J. Lin, R. Men, A. Yang, C. Zhou, M. Ding, Y. Zhang, P. Wang, A. Wang, L. Jiang, X. Jia, et al., M6:</p>
      <p>A chinese multimodal pretrainer, arXiv preprint arXiv:2103.00823 (2021).
[38] A. v. d. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding, arXiv
preprint arXiv:1807.03748 (2018).
[39] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[40] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, S. C. H. Hoi, Align before fuse: Vision and
language representation learning with momentum distillation, Advances in neural information
processing systems 34 (2021) 9694–9705.
[41] Y. A. Malkov, D. A. Yashunin, Eficient and robust approximate nearest neighbor search using
hierarchical navigable small world graphs, IEEE transactions on pattern analysis and machine
intelligence 42 (2018) 824–836.
[42] C. D. Manning, An introduction to information retrieval, Cambridge university press, 2009.
[43] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. V. Le, Autoaugment: Learning augmentation
strategies from data, in: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 2019, pp. 113–123.
[44] B. Shan, W. Yin, Y. Sun, H. Tian, H. Wu, H. Wang, Ernie-vil 2.0: Multi-view contrastive learning
for image-text pre-training, arXiv preprint arXiv:2209.15270 (2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hendriksen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bleeker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vakulenko</surname>
          </string-name>
          , N. van
          <string-name>
            <surname>Noord</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kuiper</surname>
          </string-name>
          , M. de Rijke,
          <article-title>Extending clip for category-to-image retrieval in e-commerce</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>289</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Loy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Learning to prompt for vision-language models</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>130</volume>
          (
          <year>2022</year>
          )
          <fpage>2337</fpage>
          -
          <lpage>2348</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Men</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chinese clip: Contrastive vision-language pretraining in chinese</article-title>
          ,
          <source>arXiv preprint arXiv:2211.01335</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Morimitsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Ccmb: A large-scale chinese cross-modal benchmark</article-title>
          ,
          <source>in: Proceedings of the 31st ACM International Conference on Multimedia</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>4219</fpage>
          -
          <lpage>4227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Minzhe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , et al.,
          <article-title>Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>26418</fpage>
          -
          <lpage>26431</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Spolaôr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. S. R.</given-names>
            <surname>Takaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ensina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S. R.</given-names>
            <surname>Coy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>A systematic review on content-based video retrieval</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>90</volume>
          (
          <year>2020</year>
          )
          <fpage>103557</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Doughty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Damen</surname>
          </string-name>
          ,
          <article-title>On semantic similarity in video retrieval</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3650</fpage>
          -
          <lpage>3660</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shan</surname>
          </string-name>
          , W. Hu,
          <article-title>Create: A benchmark for chinese short video retrieval and title generation</article-title>
          ,
          <source>arXiv preprint arXiv:2203.16763</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Bimbo</surname>
          </string-name>
          ,
          <article-title>Search-oriented micro-video captioning</article-title>
          ,
          <source>in: Proceedings of the 30th ACM International Conference on Multimedia</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3234</fpage>
          -
          <lpage>3243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xu</surname>
          </string-name>
          , et al.,
          <article-title>Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks</article-title>
          ,
          <source>arXiv preprint arXiv:2306.04362</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yun</surname>
          </string-name>
          , D. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Park,
          <article-title>Ocr-free document understanding transformer</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>498</fpage>
          -
          <lpage>517</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Morse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tensmeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wigington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Morariu</surname>
          </string-name>
          ,
          <article-title>End-to-end document recognition and understanding with dessurt</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>280</fpage>
          -
          <lpage>296</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          , T.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vedantam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <article-title>Microsoft coco captions: Data collection and evaluation server</article-title>
          .
          <source>arxiv</source>
          <year>2015</year>
          , arXiv preprint arXiv:
          <volume>1504</volume>
          .00325 (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kravitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kalantidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          , et al.,
          <article-title>Visual genome: Connecting language and vision using crowdsourced dense image annotations</article-title>
          ,
          <source>International journal of computer vision 123</source>
          (
          <year>2017</year>
          )
          <fpage>32</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Liang,</surname>
          </string-name>
          <article-title>Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>11782</fpage>
          -
          <lpage>11791</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Abdullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rangarajan</surname>
          </string-name>
          ,
          <article-title>Image-text matching: Methods and challenges</article-title>
          ,
          <source>Inventive Systems and Control: Proceedings of ICISC</source>
          <year>2021</year>
          (
          <year>2021</year>
          )
          <fpage>213</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Learning the best pooling strategy for visual semantic embedding</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>15789</fpage>
          -
          <lpage>15798</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>F.</given-names>
            <surname>Faghri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Fleet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          , Vse++:
          <article-title>Improving visual-semantic embeddings with hard negatives</article-title>
          ,
          <source>arXiv preprint arXiv:1707.05612</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <article-title>Context-aware multi-view summarization network for image-text matching</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Multimedia</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1047</fpage>
          -
          <lpage>1055</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , J. Han,
          <article-title>Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>12655</fpage>
          -
          <lpage>12663</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>H.</given-names>
            <surname>Diao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Ma, H. Lu,
          <article-title>Similarity reasoning and filtration for image-text matching</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>35</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>1218</fpage>
          -
          <lpage>1226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Graph structured network for image-text matching</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>10921</fpage>
          -
          <lpage>10930</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Learning semantic relationship among instances for imagetext matching</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>15159</fpage>
          -
          <lpage>15168</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>