<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large Language Model Informed Patent Image Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hao-Cheng Lo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jung-Mei Chu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jieh Hsiang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chun-Chieh Cho</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>National Taiwan University</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>JCIPRNET</string-name>
        </contrib>
      </contrib-group>
      <fpage>51</fpage>
      <lpage>60</lpage>
      <abstract>
        <p>In patent prosecution, image-based retrieval systems for identifying similarities between current patent images and prior art are pivotal to ensure the novelty and non-obviousness of patent applications. Despite their growing popularity in recent years, existing attempts, while efective at recognizing images within the same patent, fail to deliver practical value due to their limited generalizability in retrieving relevant prior art. Moreover, this task inherently involves the challenges posed by the abstract visual features of patent images, the skewed distribution of image classifications, and the semantic information of image descriptions. Therefore, we propose a language-informed, distribution-aware multimodal approach to patent image feature learning, which enriches the semantic understanding of patent image by integrating Large Language Models and improves the performance of underrepresented classes with our proposed distribution-aware contrastive losses. Extensive experiments on DeepPatent2 dataset show that our proposed method achieves state-of-the-art or comparable performance in image-based patent retrieval with mAP +53.3%, Recall@10 +41.8%, and MRR@10 +51.9%. Furthermore, through an in-depth user analysis, we explore our model in aiding patent professionals in their image retrieval eforts, highlighting the model's real-world applicability and efectiveness.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        ited, can be broadly categorized into two: (i) Low-level
vision-based methods, which employee basic visual
feaPrior art search aims to identify similarities between new tures such as visual words [9], shape and contour [
        <xref ref-type="bibr" rid="ref11 ref25 ref38 ref44 ref55 ref61 ref64">10, 11</xref>
        ],
inventions and existing technologies, thus ensuring the relational skeletons [12], and adaptive hierarchical
deninventions satisfy novelty and non-obviousness require- sity histograms [13] to describe patent images for
rements during patent drafting, examination, and infringe- trieval. These methods, however, falter in large-scale
apment analysis [
        <xref ref-type="bibr" rid="ref62">2</xref>
        ]. Traditionally focused on metadata and plications [7]. (ii) Learning-based methods have gained
textual information [3], researchers have increasingly traction in recent years. For instance, one early work,
usturned to image-based patent retrieval to overcome the ing object detection and multi-task framework for patent
limitations (e.g., the complexity of legal and technical classification, simultaneously performs image-based
repatent language) of textual analysis [4, 5, 6, 7], given that trieval [4]. With the emergence of DeepPatent dataset
patent images provide a clearer, more intuitive under- [5] and the ECCV 2022 DIRA Workshop Image Retrieval
standing of inventions (e.g., vehicle, design, and fashion), Challenge has prompted exploration into various
netenabling faster and deeper insights compared to text work architectures, loss functions, and Re-ID techniques
alone [8]. to improve retrieval systems [6, 7].
      </p>
      <p>Patent images, designed to convey technical and sci- Despite these eforts, past studies have often
overentific information, exhibit distinctive features that set looked the real-world workflow of patent attorneys
conthem apart from natural and sketch images. Firstly, they ducting prior art searches with images. In practice, patent
often lack the background context, color, texture, and in- attorneys evaluate not only the visual similarity between
tensity variability found in natural images, characterized current images and those of prior art but also consider the
instead by their abstractness and sparseness. Secondly, images’ descriptions and their associated patent
classificaunlike sketch images, patent images provide detailed tions. This oversight leads to several critical gaps and our
and high-quality visualizations from multiple viewpoints. corresponding contributions: (i) Given the importance of
This specificity results in commercial search engines like textual content, we adopt a visual language model (VLM)
Google facing dificulties in accurately retrieving rele- [14] without following pretrain-finetune paradigm.
Furvant patent images from drawing queries [5, 7], thereby thermore, recognizing the limited semantics in patent
rendering image-based patent retrieval a significant and images’ textual content (i.e., primarily object names and
ongoing challenge. perspectives), we, inspired by past prompting
engineerResearch on image-based patent retrieval, though lim- ing [15, 16], propose using large language models (LLMs)
[17] to generate detailed, alias-containing, free-form
de5th Workshop on Patent Text Mining and Semantic Technologies scriptions. (ii) To incorporate patent classification and
($PataeunsttSeenmpTsyec@h)gm20a2i4l.com (H. Lo); d09944017@csie.ntu.edu.tw address its long-tail distribution (1), beyond the InfoNCE
(J. Chu); siang@csie.ntu.edu.tw (J. Hsiang); jef@jcipgroup.com loss, we introduce multiple coarse-grained losses with
(C. Cho) uncertainty factors tailored for long-tail data into our
metrics specifically tailored for novelty detection,
ensuring it meets broad industrial needs.
• We employ a multi-paradigm approach,
validating the system’s efectiveness not only through
technical retrieval metrics but also by
accentuating its practical utility through user studies.</p>
      <sec id="sec-1-1">
        <title>VLM. This strategy aims to ensure that patent image rep</title>
        <p>resentations capture class information while remaining
sensitive to the distribution [18]. (iii) Previous works
have treated image-based patent retrieval tasks as
ReID tasks, which do not fully align with industrial needs.
Typically, searches are conducted on large databases and
retrieval is carried out both before and after a patent is
granted, primarily in two scenarios: novelty detection or
prior art search, where a current invention is compared
against past inventions to identify similarities; and
infringement search, which aims to identify subsequent
inventions that might infringe upon the granted patent
[3]. Accordingly, we train and validate our model on a
larger dataset and ensure that the retrieval metrics align
with these temporal concerns. (iv) To further understand
the practical value of our image-based patent retrieval
system, we conducted blind user studies [19]. The goal
is to directly evaluate and compare the satisfaction,
usability, and performance of our method against existing
methods in a real-world setting, showing the practical
significance of our approach. Hence, we present
fourfold distinctive contributions to better meet the industrial
demands:</p>
      </sec>
      <sec id="sec-1-2">
        <title>Existing learning-based works on image-based patent</title>
        <p>retrieval systems can be categorized into two approaches:
The first approach is intuitive, starting with the
identification of objects within patent images, then training a
classifier to associate these identified objects with their
respective International Patent Classification (IPC) classes,
and extracting vectors from the network for retrieval
[4, 9, 20]. However, this method faces two limitations: (i)
It relies on objects that the original detector has been
pretrained to recognize, resulting in the exclusion of
unidentifiable patent images and thus limiting its applicability
for large-scale applications. (ii) Although it considers</p>
        <p>IPC, IPC provides a rather coarse classification to the
• We introduce a language-informed, distribution- entire patent, which fails to accurately reflect the specific
aware multimodal approach to patent image fea- class of a certain image.
ture learning, which is both simple and robust. The second approach developed with the release of the
This method enhances images with correspond- large-scale DeepPatent dataset [5], where a series of
studing semantic information, augmented via LLMs. ies have treated learning patent image representation as
• We propose tailored losses specifically designed a Re-ID (i.e., Patent ID) problem [21]. Employing
varifor the long-tail distribution of patent classifi- ous CNN backbones such as EficientNet [ 7, 6], ResNet
cations. This strategy significantly boosts the [5], ViT [6], and SwinTransformer [6], these studies aim
robustness and accuracy of patent image repre- to embed patent drawings into a common feature space
sentations, particularly in scenarios sensitive to using contrastive loss functions (e.g., triplet loss [5],
Arclass distinctions, leading our method achieve cFace [6, 7]), clustering identical ID images closely and
state-of-the-art results. separating diferent ID images. Although these studies
• Our model is validated on a large dataset with have shown excellent Re-ID capabilities, they have
over</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>looked several critical aspects: (i) Re-ID primarily focuses
on retrieving images within the same patent, which does
not align with the patent industry’s workflow (i.e.,
retrieving images from diferent patents). ( ii) The Re-ID
approach is susceptible to overfitting, which reduces its
generalizability and accuracy in retrieving similar cases
across diferent patents [ 22]. (iii) These methods often
ignore other patent-specific information, such as image
descriptions and Locarno classification, which are crucial
in practical patent work.</p>
      <p>
        With the rise of VLM and multimodal learning, the
retrieval of natural color images has seen significant
improvements [
        <xref ref-type="bibr" rid="ref18">23, 24, 25, 26</xref>
        ]. Likewise, multimodal
methods have become increasingly prevalent in retrieval
strategies for sketch images, which are similar, if not
identical, to patent images. For example, sketch-based
image retrieval involves retrieving natural images using
sketch representations [
        <xref ref-type="bibr" rid="ref40">27, 28, 29, 30</xref>
        ]. They mainly
utilize the associations with natural images to achieve such
efective results. While patent images lack the stroke
information found in sketches and are challenging to
associate with natural images due to their nature of novelty
and multiple perspectives, the textual information and
well-defined classification in patents provide a solid
foundation for employing a VLM approach. Therefore, we
explore the potential of applying VLM to patent image
retrieval, an area currently underappreciated, leveraging
the rich auxiliary information in patents [1].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Model Overview</title>
        <p>Our objective is to leverage the powerful capabilities of
pre-trained LLMs to facilitate feature learning for patent
images, thus achieving semantically rich image
representations and eficient patent image retrieval. To this
end, we introduce a one-stage framework, distinct from
the conventional pretrain-finetune approaches employed
by prior studies. As depicted in 2, our model comprises
two main components: visual feature extraction and text
feature extraction. In the visual component of our model,
we preprocess the input patent images using data
augmentation techniques tailored for patent images, such
as flipping, random cropping [ 31], random erasing [32],
and gridmask [33]. We then employ CNN-based
backbones to extract visual features from these augmented
images. For cases where the output feature dimensions
from certain backbones are too long or too short, we
utilize projectors composed of MLPs to align the
dimensions of visual features with those of text features for
subsequent contrastive training (2 (c)).</p>
        <p>In the textual component, our process begins with
employing an image captioner, which, given a patent image
and a predefined prompt, generates a sophisticated
description of the image [34, 35, 24]. This generated
description is then combined with other pertinent information
about the image, such as its Locarno classification,
original image description, object name, and perspective. This
composite input is fed into LLMs to produce diversified,
alias-caontaining, fine-grained text descriptions, thereby where  represents a batch of patent images, and 
deenriching the semantic understanding of the patent im- notes the corresponding set of text sentences. The subset
age (see 2 (a) and 3.3). Following, we employ a frozen + comprises texts that share the same class (in the case
text encoder (i.e., text encoder in CLIP), to extract textual of ℒcls) or category (for ℒcat) with the image . Similarly,
features of the descriptions (2 (b)). + includes all images that share the same class or
cate</p>
        <p>To ensure our contrastive loss is distribution-aware, gory with the text . By doing so, our model gains an
we introduce three types of loss functions. Firstly, for understanding of the class and category information,
enthe conventional VLM contrastive loss ℒclip, we treat the abling it to acquire robust representations even for those
pairing of a fine-grained image with its corresponding in the tail classes. Furthermore, since the text
descripdescription as a positive match. For the coarse-grained tion for each image sample varies with each iteration,
approach, inspired by [18, 36], we define two scenar- and combined with the class or category loss, the
one-toios: class-wise, where an image and a sentence from one pairing relationship between images and texts is less
the same class are considered a positive pair (ℒcls); and rigid. This variability acts as an additional regularization
category-wise, where an image and a sentence from the mechanism, preventing the model from adhering to fixed,
same category (e.g., head or tail categories) are seen as a trivial correlations within specific image-text pairs.
positive pair (ℒcat). These losses are combined to update Considering each loss’s stability varies, we move away
the visual encoder during the training phase (see 2 (d) from linear loss combination towards a method based on
and 3.2). homoscedastic uncertainty [37, 38], learnable through</p>
        <p>In the query phase (2 (e)), each patent image is trans- probabilistic deep learning. This type of uncertainty,
formed into embeddings with the visual encoder and then independent of input data, reflects the task’s intrinsic
stored in a vector database. Retrieval of other images is uncertainty. The loss includes residual regression and
done by comparing these embeddings via cosine simi- uncertainty regularization components. The implicitly
larity, where embeddings closer in distance are ranked learned variance ˆ moderates the residual regression,
higher, and those farther apart are ranked lower. while regularization prevents the network from
predicting infinite uncertainty. Hence, The overall loss can be
written as in 3.
3.2. Distribution-Aware Contastive Loss
ℒ =ℒclip exp(− ˆclip) + ˆclip
+ ℒcls exp(− ˆcls) + ˆcls
+ ℒcat exp(− ˆcat) + ˆcat</p>
        <p>(3)
where ˆ is learnable homoscedastic uncertainty. We find
this loss is robust to our task.</p>
        <p>As previously mentioned, we formulate our task as a VLM
contrastive learning paradigm. The traditional
instancebased training objective for a single image can be
described as follows (i.e., InfoNCE loss, 1):</p>
        <p>exp(t+ · v/ )
ℒclip = − log( ∑︀
=1 exp(t · v/ )
),
(1)</p>
        <p>3.3. Text Enrichment
|T+| T ∈T+
where (t1 , t2 , ..., t ) denotes  text features extracted
by the text encoder and v denotes the learned feature of Converting object names, perspectives, class names, and
a patent image. The term t · v represents the cosine the patent image’s original descriptions into text for
insimilarity score between the patent image and texts and put into a text encoder is a plausible way to generate
 is a learnable temperature coeficient. This objective is text embeddings for supervisory purposes. However,
to maximize t+ · v, which indicates the feature similarity this method encounters several limitations. Firstly, the
between the patent image and the corresponding textual most comprehensive descriptions provided by patents
information. are typically succinct and straightforward, such as FIG.</p>
        <p>However, relying solely on instance-based contrastive 3 is a front elevational view of the light device, leading
loss falls short in capturing class or category information, to unclear and sparse information about the image.
Adpotentially leading to suboptimal performance in datasets ditionally, the similarity and potential overlap among
with skewed distributions [36]. Therefore, we propose diferent classes (e.g., automobiles, motor cars, and toy
class-based and category-based coarse-grained losses (i.e., cars) can obscure the distinction of nuanced concepts. To
ℒcls and ℒcat, see 2 (d)). These losses can be described in overcome these challenges and derive more
discriminaa similar form as follows (2): tive text features, we employee captioners [34] and LLMs
− |V1+| V∑∈︁V+ log ∑︀Ve∈xVp(etxp· (tv /·  v)/ ) [e1n7hS]pafneoccrieficpatrhloleyd,suwecmeinafirgnsttdliycetueamnildepedlro,syetancnraidpcithnioegdnoedfresthstceoriigpmetaniogenerass.ttehat
− 1 ∑︁ log ∑︀Te∈xTp(etxp·(tv/·  v) / ) , (2) edpreesscctwsriippthatitioemnnastgoaeftstiomarlnaogeneygsssiifdnoecauamsseaotnnon.
feWrpretehpdaretofinwveoiddueilndtshcteraupcctatupiroteinoasns-that guide the description process. For example, instruc- images granted before . Hence, the following retrieval
tions might include: Describe the distinct visual elements metrics should be calculated over ′ given v: (i) We
present in the design, such as shapes, contours, texture, and use mean Average Precision (mAP), a metric obtained
the arrangement of various components. The outcomes by averaging AP scores across all classes. (ii) Following
of this process, merged with pre-existing auxiliary in- previous works, the standard evaluation protocol [41] is
formation and predetermined instruction templates, are to report the recall at rank  (Recall @  or R@) at
then fed into LLMs. For example, we employ templates diferent ranks ( 5, 10). (iii) We calculate the Mean
Resuch as This is a photo of {Object Name}, classified as ciprocal Rank @  with temporal concern (MRR@ or
{Class}, This image features {Details}, and {Object M@), which averages the reciprocal of the rank for the
Name} can also be referred to as {Synonym}, to generate ifrst correctly predicted patent image within the top 10
enriched text. Ultimately, this approach yields around 20 rankings across all test samples.
detailed text descriptions per image, designed to mine
the semantic nuances within the text feature space. Table 2</p>
        <p>Additionally, our research revealed that utilizing more Evaluation results of an ablation study on various components
specific object names significantly enhances feature learn- within the model architecture.
ing. For example, the broad class of Emergency equipment,
which encompasses distinct items such as lighting fixture , Head Classes Tail Classes All Classes
horticulture grow light, and lighting device. Therefore, in mAP R@10 mAP R@10 mAP R@10
our text generation process, we prioritize these detailed Baseline 32.3 34.7 21.9 30.2 26.0 32.8
object names over the more generic class names, diferent TℒecxlstaGnednℒercaattion 7407..24 5499..72 5427..40 5438..64 5487..79 5469..37
from the previous approach that typically leans on class Captioner 78.0 65.3 61.6 53.5 69.1 58.6
categorization [39, 40].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.4. Metrics for Patent Retrieval</title>
        <p>As mentioned before, in image-based patent retrieval,
particularly for novelty detection and related work searches,
patent professionals explore databases of patents,
applications, and scientific literature to determine an invention’s
uniqueness. This process, crucial both before and after a
patent application is filed, focuses on identifying if
similar prior inventions exist. Accordingly, when evaluating
retrieval metrics, it’s necessary to account for the
temporal factor, ensuring that only prior art—rather than
contemporaneous or subsequent inventions—is
considered for retrieval [3]. Considering a database  where
each data point is represented by a tuple (v, ), with v
being the image embedding and  the granted time of
the patent the image belongs to. For each query image
v with a granted time , define  containing
′ ⊆</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Implementation Details</title>
        <sec id="sec-4-1-1">
          <title>In our experiments, we employed PyTorch [42] and uti</title>
          <p>lized clusters of NVIDIA A100 GPUs. For the VLM, we
explored various ViT variants [31], ResNet50 [43],
EficientNetB0 [44], and SwinV2-B [45, 46] as backbones for the visual
encoder. The text encoder was adapted from the original
CLIP model [14], remaining fixed throughout the
experiments. For captioner, we leveraged open-source BLIP-2
[35] and GPT-4V [47]. Regarding LLMs, our focus was
on GPT-4 [47], though we also experimented with other
LLMs like GPT-3.5-Turbo [17] and LLaMA-2 [48].</p>
          <p>For our experiments, we employed the DeepPatent2
dataset. Previous research has mainly adopted the origi- 4.2. Experimental Results
nal DeepPatent dataset [5]; however, this dataset sufers
from a narrow collection span, lacks image-related
metadata, and does not segment sub-images. These limitations
can introduce substantial noise in inter-image
relationships. Fortunately, the DeepPatent2 [1] dataset addresses
these issues efectively. To maintain comparability with
previous methods, we utilized DeepPatent2 data from
2016 to 2019, consisting of 822,792 records with 407
Locarno classes. Of these, 90% were used for the training
set, and 10% for the validation set. For our query dataset,
we used 252,296 records from the year 2020.</p>
          <p>Our baseline models are our replicating
state-of-thearts on the same dataset to ensure comparability: these
include PatentNet [5], SwinV2-B+ArcFace [6], and
EfifcientNet+ArcFace [ 7]. Our primary experiments
focused on a variety of visual encoder backbones, such as
ResNet50, EficientNetB-0, ViT-B-32, and SwinV2-B. Due
to space constraints in this paper, detailed results from
experiments involving the captioner and LLMs will be
presented in the full manuscript. For preliminary insights,
we utilized GPT-4 for both the captioner and LLMs. In
our ablation study, we explore the efects of distribution
awareness losses (i.e., ℒcls and ℒcat), the text generation
component, and the captioner.</p>
          <p>Table 1 presents the quantitative results on the
DeepPatent2 query set. Overall, our approach significantly
outperforms the state-of-the-art, achieving up to a 53%
improvement in the mAP metric, 38% in Recall @ 5, 41.8%
in Recall @ 10, and 51.9% in MRR @ 10. Notably, both
ViT-B-32 and SwinV2-B show comparable performance,
with each excelling under diferent metrics or
scenarios. For instance, ViT-B-32 performs better in tail classes,
while SwinV2-B shows strength in head classes,
suggesting an interaction between data distribution and model
architecture. With these results, we have achieved the
state-of-the-art in this task. Given the strong
performance across all classes with ViT as the backbone, we
will base our ablation study on this model to further
investigate the impact of various model components on
performance.</p>
          <p>Table 2 presents the evaluation results of an ablation
study on four components, starting with a baseline model,
which is a standard CLIP model. The subsequent rows
represent enhancements to this baseline. Firstly, by
incorporating a distribution-aware contrastive loss, we
observe significant performance gains in both head and tail
classes, with tail classes experiencing more substantial
improvements (approximately 20% in mAP and Recall
@ 10), and head classes seeing a 10% increase in these
metrics. Next, by incorporating text generation
functionof similar images, closely corresponding to predefined
classes, indicating that the model has learned meaningful
and discriminative features for each class. This
clustering enables the model to efectively distinguish between
diferent classes.</p>
          <p>Conversely, the previous approach results in a t-SNE
visualization with less clustered and more continuous
embeddings, making it dificult to identify distinct clusters
or their correlation to predefined classes. This indicates
that the model’s learned representations are less
discriminative, potentially capturing more generalized features
shared across multiple classes. Although some clustering
is visible, it may not relate to actual classes but rather to
visually similar images, blurring the boundaries between
diferent classes.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. User Study</title>
      <p>To further ensure the practical value of our system, we
conducted a user study following rigorous psychological
ality, which is guided by the LLM, the model learns richer procedures. As for participants, we recruited 15 patent
semantic relationships between images. The addition of agents (48% female, average age: 33.4 years) to perform
this feature leads to a 10-20% improvement across vari- tasks related to design patent image retrieval.
ous metrics. Finally, by integrating a captioner module, As for procedure, we employed a double-blind test,
the key details of the design inventions in the images are where participants were unaware of the underlying
redirectly expressed by the captioner, further highlighting trieval system during their tasks. They could encounter
the semantic focal points of the images and enhancing either our retrieval system or a system based on the
prethe retrieval performance. vious approach. Each patent agent handled 30 retrieval
tasks, with these tasks randomly assigned to one of the
4.3. Qualitative Results two systems—15 tasks with our system and 15 with the
previous approach. After each task, participants rated
Qualitative results (see 3) indicate that our approach re- their satisfaction with the retrieval results on a scale from
trieves better results than the previous state-of-the-art, 1 to 5 and recorded the time taken to complete the task
evident in both head and tail classes. Focusing solely on (in hours). To minimize randomness, we averaged the
our model, it is apparent that it underperforms in tail scores across systems for each participant, resulting in
classes. Additionally, our system not only retrieves the four scores per person (two systems × two scores).
correct class given an image but also finds images that The results of the paired t-test revealed significant
difare visually similar to the query image. Furthermore, we ferences in satisfaction levels, with patent agents
showdelve into the errors made by the previous state-of-the- ing a higher satisfaction with our system compared to the
art. For example, when presented with an image labeled previous approach, (14) = 3.30,  &lt; 0.01. Regarding task
shoe, the previous model might retrieve a shoe image, completion time, agents completed tasks faster using our
but it actually belongs to the category of shoelaces. This system, (14) = -4.30,  &lt; 0.001. These results indicate
reveals that the previous approach did not align seman- that our system is more eficient and better meets the
tic information within the images, often leading to the practical needs of professionals in the field.
retrieval of visually similar but categorically diferent
images. Similar issues occur with categories like vehicles
&amp; toy cars or flashlights &amp; chargers , where the images 6. Conclusion
look alike but difer semantically. Our model mitigates
these errors by guiding the image’s embedding space Our method has achieved new state-of-the-art results
with linguistic information, which enhances semantic in the quantitative evaluation of mAP, Recall@, and
alignment. MRR@, as well as in high-quality image retrieval
dur</p>
      <p>Based on the t-SNE results shown in 4, our method ing qualitative evaluation. For many years, current
comyields more clustered embeddings, suggesting that the mercial design patent retrieval systems have had
signifimodel efectively captures the inherent structures or cant shortcomings. For example, traditional text-based
classes within the data. Each cluster represents a group
searches can be limiting due to the subjective interpre- Deriving design feature vectors for patent images
tation of design features and the dificulty in describing using convolutional neural networks, Journal of
Mevisual details with text. Although learning-based image chanical Design 143 (2021) 061405. doi:10.1115/
retrieval systems have started to emerge in the last two 1.4049214.
years, their practical value remains limited. [5] M. Kucer, D. Oyen, J. Castorena, J. Wu,
Deep</p>
      <p>To address these issues, our proposal can efectively patent: Large scale patent drawing recognition and
solve this problem. Firstly, we proposed a new learning- retrieval, in: Proceedings of the IEEE/CVF Winter
based architecture capable of learning image features Conference on Applications of Computer Vision,
with practical value. These representations not only con- 2022, pp. 2309–2318. doi:10.1109/WACV51458.
tain visual information but are also aligned with corre- 2022.00063.
sponding (augmented) semantic text and classification [6] K. Higuchi, K. Yanai, Patent image retrieval using
data. This has substantial practical value because spe- transformer-based deep metric learning, World
cific graphic semantic features such as curvature, edges, Patent Information 74 (2023) 102217. doi:10.1016/
and geometric details are considered. Focusing solely j.wpi.2023.102217.
on the image itself might overlook these critical visual [7] H. Wang, Y. Zhang, Learning eficient
represenelements. Secondly, we utilize a larger and more well- tations for image-based patent retrieval, in:
Chidefined dataset, which encompasses a broader collec- nese Conference on Pattern Recognition and
Comtion span, image-related metadata, and segmented sub- puter Vision (PRCV), Springer, 2023, pp. 15–26.
images. This makes the model more robust and enhances doi:10.1007/978-981-99-8540-1_2.
its accuracy. Thirdly, addressing the long-tail distribu- [8] R. N. Carney, J. R. Levin, Pictorial illustrations still
tion in classification, our study is the first to propose improve students’ learning from text, Educational
distribution-aware losses, which have proven to be efec- psychology review 14 (2002) 5–26. doi:10.1023/A:
tive. Lastly, we conducted a user study to demonstrate 1013176309260.
the practical value of our system in the field, showing [9] S. Vrochidis, A. Moumtzidou, I. Kompatsiaris,
that it is more eficient, accurate, and time-saving. Concept-based patent image retrieval, World Patent</p>
      <p>
        In the future, we have several directions for further Information 34 (2012) 292–303. doi:10.1016/j.
expansion: () Identifying similarities between this in- wpi.2012.07.002.
vention and prior arts, which involves not only using [
        <xref ref-type="bibr" rid="ref11 ref25 ref38 ref44 ref55 ref61 ref64">10</xref>
        ] A. Tiwari, V. Bansal, Patseek: content based image
existing models for visualizations through explainable retrieval system for patent database (2004).
AI [49] but also leveraging data from examiner’s reports [11] Z. Zhiyuan, Z. Juan, X. Bin, An outward-appearance
(Ofice Actions) to guide further explorations. ( ) While patent-image retrieval approach based on the
our research currently focuses on prior art searches, fu- contour-description matrix, in: 2007 Japan-China
ture work could also explore other temporal dimensions, Joint Workshop on Frontier of Computer Science
such as infringement searches. Additionally, image do- and Technology (FCST 2007), IEEE, 2007, pp. 86–89.
main adaptation [50] could be used to enhance the ef- doi:10.1109/FCST.2007.14.
fectiveness of searches across diferent domains, such as [12] B. Huet, N. J. Kern, G. Guarascio, B. Merialdo,
retrieving E-commerce images using patent drawings. Relational skeletons for retrieval in patent
drawings, in: Proceedings 2001 International
Conference on Image Processing (Cat. No. 01CH37205),
References volume 2, IEEE, 2001, pp. 737–740. doi:10.1109/
[1] K. Ajayi, X. Wei, M. Gryder, W. Shields, J. Wu, S. M. [13] SI.CIVPro.c2h0i0d1is.,9S5.8P5a9p9a.dopoulos, A. Moumtzidou,
Jones, M. Kucer, D. Oyen, Deeppatent2: A large- P. Sidiropoulos, E. Pianta, I. Kompatsiaris, Towards
scale benchmarking corpus for technical drawing content-based patent image retrieval: A framework
understanding, Scientific Data 10 (2023) 772. doi: 10. perspective, World Patent Information 32 (2010)
[
        <xref ref-type="bibr" rid="ref62">2</xref>
        ] 1W0.38S/hsal4a1b5y9,7W-0. 2Z3-ad0r2o6z5n3y-,7. Patent retrieval: [14] 9A4.–R10a6d.fodrodi:,10J..W10.1K6i/mj,.wCp.iH.2al0la0c9y.,0A5..0R1a0m. esh,
a literature review, Knowledge and Informa- G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
tion Systems 61 (2019) 631–660. doi:10.1007/ J. Clark, et al., Learning transferable visual
mods10115-018-1322-7. els from natural language supervision, in:
Inter[3] R. Krestel, R. Chikkamath, C. Hewel, J. Risch, A national conference on machine learning, PMLR,
survey on deep learning for patent analysis, World
2021, pp. 8748–8763. doi:10.48550/arXiv.2103.
      </p>
      <p>Patent Information 65 (2021) 102035. doi:10.1016/
00020.</p>
      <p>j.wpi.2021.102035. [15] Z. Guo, Y. Tang, R. Zhang, D. Wang, Z. Wang,
[4] S. Jiang, J. Luo, G. Ruiz-Pava, J. Hu, C. L. Magee, B. Zhao, X. Li, Viewrefer: Grasp the
multi</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>view knowledge for 3d visual grounding with</article-title>
          [26]
          <string-name>
            <given-names>N.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Meirom</surname>
          </string-name>
          , G. Chechik, Y. Atz-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>arXiv:2303.16894</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv. frozen
          <article-title>vision-language representations</article-title>
          , in: Euro-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          2303.16894. pean Conference on
          <source>Computer Vision</source>
          , Springer, [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>558</fpage>
          -
          <lpage>577</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2204.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          , Prompt, generate, then cache: Cas-
          <volume>01694</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>cade of foundation models makes strong few-shot</article-title>
          [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          , Pro-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>tion</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>15211</fpage>
          -
          <lpage>15222</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <source>actions on Image Processing</source>
          <volume>29</volume>
          (
          <year>2020</year>
          )
          <fpage>8892</fpage>
          -
          <lpage>8902</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          2303.02151. doi:
          <volume>10</volume>
          .1109/TIP.
          <year>2020</year>
          .
          <volume>3020383</volume>
          . [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          , J. D. [28]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sangkloy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Burnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          , The
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models bunnies</article-title>
          ,
          <source>ACM Transactions on Graphics (TOG) 35</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>are few-shot learners</article-title>
          ,
          <source>Advances in neural infor-</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1145/2897824.2925954.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>mation processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Hospedales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yin</surname>
          </string-name>
          , Y.-
          <string-name>
            <given-names>Z.</given-names>
            <surname>Song</surname>
          </string-name>
          , T. Xi-
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>doi:10</source>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>14165</volume>
          .
          <string-name>
            <surname>ang</surname>
          </string-name>
          , L. Wang,
          <article-title>Deep learning for free-hand sketch</article-title>
          : [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <article-title>Vl-ltr: A survey, IEEE transactions on pattern analy-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>Learning class-wise visual-linguistic representation sis</article-title>
          and
          <source>machine intelligence</source>
          <volume>45</volume>
          (
          <year>2022</year>
          )
          <fpage>285</fpage>
          -
          <lpage>312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>for long-tailed visual recognition</article-title>
          , in: European doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2001</year>
          .
          <volume>02600</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp. [30]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          , Transferable
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          73-
          <fpage>91</fpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2111.13579.
          <article-title>coupled network for zero-shot sketch-based im</article-title>
          [19]
          <string-name>
            <given-names>L. B.</given-names>
            <surname>Christensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , L. A. Turner, age retrieval,
          <source>IEEE Transactions on Pattern Analy-</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Research</given-names>
            <surname>Methods</surname>
          </string-name>
          , Design, and
          <string-name>
            <surname>Analysis</surname>
          </string-name>
          , 12 ed.,
          <source>sis and Machine Intelligence</source>
          <volume>44</volume>
          (
          <year>2021</year>
          )
          <fpage>9181</fpage>
          -
          <lpage>9194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Global</given-names>
            <surname>Edition</surname>
          </string-name>
          ,
          <year>2014</year>
          . Page count:
          <volume>542</volume>
          ; Dimensions: doi:10.1109/TPAMI.
          <year>2021</year>
          .
          <volume>3123315</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>23 x 18</source>
          .
          <article-title>6 cm; Book number</article-title>
          :
          <fpage>00106962</fpage>
          . [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , D. Weis[20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bhattarai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Castorena</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Yang, senborn, X. Zhai,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          , M. Dehghani,
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>based deep learning and transfer learning, in: Pro- worth 16x16 words: Transformers for image recog-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>ceedings of the IEEE/CVF conference on computer nition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .11929
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>vision and pattern recognition workshops</source>
          ,
          <year>2020</year>
          , pp. (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2010</year>
          .
          <volume>11929</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          174-
          <fpage>175</fpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2004</year>
          .
          <volume>10780</volume>
          . [32]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , G. Kang,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , Ran[21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shao</surname>
          </string-name>
          , S. C.
          <article-title>Hoi, dom erasing data augmentation</article-title>
          , in: Proceedings
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>and outlook</article-title>
          ,
          <source>IEEE transactions on pattern analy-</source>
          volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>13001</fpage>
          -
          <lpage>13008</lpage>
          . doi:
          <volume>10</volume>
          .48550/
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>sis and machine intelligence</source>
          <volume>44</volume>
          (
          <year>2021</year>
          )
          <fpage>2872</fpage>
          -
          <lpage>2893</lpage>
          . arXiv.
          <volume>1708</volume>
          .04896.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>doi:10</source>
          .1109/TPAMI.
          <year>2021</year>
          .
          <volume>3054775</volume>
          . [33]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jia</surname>
          </string-name>
          , Gridmask data aug[22]
          <string-name>
            <given-names>X.-Q.</given-names>
            <surname>Ma</surname>
          </string-name>
          , C.-
          <string-name>
            <surname>C. Yu</surname>
            ,
            <given-names>X.-X.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          , Large- mentation, arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>04086</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>scale person re-identification based on deep hash doi:10</source>
          .48550/arXiv.
          <year>2001</year>
          .
          <volume>04086</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>learning</surname>
          </string-name>
          ,
          <source>Entropy</source>
          <volume>21</volume>
          (
          <year>2019</year>
          )
          <article-title>449</article-title>
          . doi:
          <volume>10</volume>
          .1109/TIP. [34]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Visual instruc-
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <year>2017</year>
          .2695101. tion tuning,
          <source>Advances in neural information pro</source>
          [23]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rodriguez-Opazo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Teney</surname>
          </string-name>
          , S. Gould,
          <source>cessing systems 36</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>Image retrieval on real-life images with pre-trained 2304</article-title>
          .08485.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>vision-</surname>
          </string-name>
          and
          <article-title>-language models</article-title>
          ,
          <source>in: Proceedings of the [35] J</source>
          .
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Savarese</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hoi</surname>
          </string-name>
          , Blip-2: Bootstrap-
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Vision</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>2125</fpage>
          -
          <lpage>2134</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <article-title>age encoders and large language models</article-title>
          , arXiv
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          2108.04024. preprint arXiv:
          <volume>2301</volume>
          .12597 (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Karthik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mancini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Akata</surname>
          </string-name>
          ,
          <string-name>
            <surname>Vision-</surname>
          </string-name>
          by-
          <source>arXiv.2301</source>
          .12597.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>language for training-free compositional image re-</article-title>
          [36]
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Lo</surname>
          </string-name>
          , C.
          <article-title>-S. Fuh, Enhancing long-tailed 3d se-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>trieval</surname>
          </string-name>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2310.09291.
          <article-title>mantic segmentation with category-wise linguistic-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <source>arXiv:2310</source>
          .09291. visual representation,
          <source>in: The 36th IPPR Conference</source>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baldrati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Agnolucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bertini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Del Bimbo</surname>
          </string-name>
          ,
          <source>on Computer Vision</source>
          , Graphics, and
          <string-name>
            <surname>Image</surname>
          </string-name>
          Process-
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>Zero-shot composed image retrieval with textual ing (CVGIP), Kinmen</article-title>
          , Taiwan,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>inversion</surname>
          </string-name>
          ,
          <source>arXiv preprint arXiv:2303.15247</source>
          (
          <year>2023</year>
          ). [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kendall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <article-title>What uncertainties do we need</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <source>doi:10</source>
          .48550/arXiv.2303.15247.
          <article-title>in bayesian deep learning for computer vision</article-title>
          ?, Ad-
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>vances in neural information processing systems</source>
          [48]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Alma-
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <volume>30</volume>
          (
          <year>2017</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.1703.04977. hairi, Y. Babaei,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          , P. Bhargava, [38]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kendall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cipolla</surname>
          </string-name>
          ,
          <article-title>Geometric loss functions S. Bhosale</article-title>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          : Open foundation and fine-
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <article-title>for camera pose regression with deep learning, in: tuned chat models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09288</source>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <source>Proceedings of the IEEE conference on computer</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2307.09288.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <source>vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5974</fpage>
          -
          <lpage>5983</lpage>
          . [49]
          <string-name>
            <given-names>F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <source>doi:10</source>
          .48550/arXiv.1704.00390.
          <article-title>Explainable ai: A brief survey on history</article-title>
          , research [39]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rozenberszki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Litany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dai</surname>
          </string-name>
          , Language- areas, approaches and challenges, in: Natural Lan-
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <article-title>grounded indoor 3d semantic segmentation in the guage Processing and Chinese Computing: 8th</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>wild, in: European Conference on Computer Vi- CCF International Conference, NLPCC 2019, Dun-</mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          sion, Springer,
          <year>2022</year>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>141</lpage>
          . doi:
          <volume>10</volume>
          .48550/ huang, China, October 9-
          <issue>14</issue>
          ,
          <year>2019</year>
          , Proceedings,
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <source>arXiv.2204.07761. Part II 8</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>563</fpage>
          -
          <lpage>574</lpage>
          . doi:
          <volume>10</volume>
          .1007/ [40]
          <string-name>
            <given-names>R.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          , C. Wu,
          <volume>978</volume>
          -3-
          <fpage>030</fpage>
          -32236-6_
          <fpage>51</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          , G. Huang, Joint representation learn- [50]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Deng</surname>
          </string-name>
          , Deep visual domain adaptation:
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <article-title>ing for text and 3d point cloud, Pattern Recog- A survey</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>312</volume>
          (
          <year>2018</year>
          )
          <fpage>135</fpage>
          -
          <lpage>153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <source>nition 147</source>
          (
          <year>2024</year>
          )
          <article-title>110086</article-title>
          . doi:
          <volume>10</volume>
          .48550/arXiv. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1802</year>
          .
          <volume>03601</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          2301.
          <fpage>07584</fpage>
          . [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baldrati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bertini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Uricchio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Del Bimbo</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <source>pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>21466</fpage>
          -
          <lpage>21474</lpage>
          . [42]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          , J. Brad-
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <source>neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <source>doi:10</source>
          .48550/arXiv.
          <year>1912</year>
          .
          <volume>01703</volume>
          . [43]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Deep residual
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          <source>tern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .48550/
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          arXiv.
          <volume>1512</volume>
          .03385. [44]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet: Rethinking model scal-
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          <year>2019</year>
          , pp.
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1905</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          11946. [45]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          <source>ence on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          <source>doi:10</source>
          .48550/arXiv.2103.14030. [46]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          , J. Ning,
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          <article-title>v2: Scaling up capacity and resolution</article-title>
          , in: Proceed-
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          <source>sion and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>12009</fpage>
          -
          <lpage>12019</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          <source>doi:10</source>
          .48550/arXiv.2111.09883. [47]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          , I. Akkaya,
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          <string-name>
            <surname>man</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical re-</source>
        </mixed-citation>
      </ref>
      <ref id="ref66">
        <mixed-citation>
          <string-name>
            <surname>port</surname>
          </string-name>
          ,
          <source>arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ). doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref67">
        <mixed-citation>48550/arXiv.2303.08774.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>