1. Introduction

Large Language Model Informed Patent Image Retrieval

Hao-Cheng Lo

Jung-Mei Chu

Jieh Hsiang

Chun-Chieh Cho

National Taiwan University

JCIPRNET

51 60

In patent prosecution, image-based retrieval systems for identifying similarities between current patent images and prior art are pivotal to ensure the novelty and non-obviousness of patent applications. Despite their growing popularity in recent years, existing attempts, while efective at recognizing images within the same patent, fail to deliver practical value due to their limited generalizability in retrieving relevant prior art. Moreover, this task inherently involves the challenges posed by the abstract visual features of patent images, the skewed distribution of image classifications, and the semantic information of image descriptions. Therefore, we propose a language-informed, distribution-aware multimodal approach to patent image feature learning, which enriches the semantic understanding of patent image by integrating Large Language Models and improves the performance of underrepresented classes with our proposed distribution-aware contrastive losses. Extensive experiments on DeepPatent2 dataset show that our proposed method achieves state-of-the-art or comparable performance in image-based patent retrieval with mAP +53.3%, Recall@10 +41.8%, and MRR@10 +51.9%. Furthermore, through an in-depth user analysis, we explore our model in aiding patent professionals in their image retrieval eforts, highlighting the model's real-world applicability and efectiveness.

1. Introduction

ited, can be broadly categorized into two: (i) Low-level vision-based methods, which employee basic visual feaPrior art search aims to identify similarities between new tures such as visual words [9], shape and contour [ 10, 11 ], inventions and existing technologies, thus ensuring the relational skeletons [12], and adaptive hierarchical deninventions satisfy novelty and non-obviousness require- sity histograms [13] to describe patent images for rements during patent drafting, examination, and infringe- trieval. These methods, however, falter in large-scale apment analysis [ 2 ]. Traditionally focused on metadata and plications [7]. (ii) Learning-based methods have gained textual information [3], researchers have increasingly traction in recent years. For instance, one early work, usturned to image-based patent retrieval to overcome the ing object detection and multi-task framework for patent limitations (e.g., the complexity of legal and technical classification, simultaneously performs image-based repatent language) of textual analysis [4, 5, 6, 7], given that trieval [4]. With the emergence of DeepPatent dataset patent images provide a clearer, more intuitive under- [5] and the ECCV 2022 DIRA Workshop Image Retrieval standing of inventions (e.g., vehicle, design, and fashion), Challenge has prompted exploration into various netenabling faster and deeper insights compared to text work architectures, loss functions, and Re-ID techniques alone [8]. to improve retrieval systems [6, 7].

Patent images, designed to convey technical and sci- Despite these eforts, past studies have often overentific information, exhibit distinctive features that set looked the real-world workflow of patent attorneys conthem apart from natural and sketch images. Firstly, they ducting prior art searches with images. In practice, patent often lack the background context, color, texture, and in- attorneys evaluate not only the visual similarity between tensity variability found in natural images, characterized current images and those of prior art but also consider the instead by their abstractness and sparseness. Secondly, images’ descriptions and their associated patent classificaunlike sketch images, patent images provide detailed tions. This oversight leads to several critical gaps and our and high-quality visualizations from multiple viewpoints. corresponding contributions: (i) Given the importance of This specificity results in commercial search engines like textual content, we adopt a visual language model (VLM) Google facing dificulties in accurately retrieving rele- [14] without following pretrain-finetune paradigm. Furvant patent images from drawing queries [5, 7], thereby thermore, recognizing the limited semantics in patent rendering image-based patent retrieval a significant and images’ textual content (i.e., primarily object names and ongoing challenge. perspectives), we, inspired by past prompting engineerResearch on image-based patent retrieval, though lim- ing [15, 16], propose using large language models (LLMs) [17] to generate detailed, alias-containing, free-form de5th Workshop on Patent Text Mining and Semantic Technologies scriptions. (ii) To incorporate patent classification and ($PataeunsttSeenmpTsyec@h)gm20a2i4l.com (H. Lo); d09944017@csie.ntu.edu.tw address its long-tail distribution (1), beyond the InfoNCE (J. Chu); siang@csie.ntu.edu.tw (J. Hsiang); jef@jcipgroup.com loss, we introduce multiple coarse-grained losses with (C. Cho) uncertainty factors tailored for long-tail data into our metrics specifically tailored for novelty detection, ensuring it meets broad industrial needs. • We employ a multi-paradigm approach, validating the system’s efectiveness not only through technical retrieval metrics but also by accentuating its practical utility through user studies.

VLM. This strategy aims to ensure that patent image rep

resentations capture class information while remaining sensitive to the distribution [18]. (iii) Previous works have treated image-based patent retrieval tasks as ReID tasks, which do not fully align with industrial needs. Typically, searches are conducted on large databases and retrieval is carried out both before and after a patent is granted, primarily in two scenarios: novelty detection or prior art search, where a current invention is compared against past inventions to identify similarities; and infringement search, which aims to identify subsequent inventions that might infringe upon the granted patent [3]. Accordingly, we train and validate our model on a larger dataset and ensure that the retrieval metrics align with these temporal concerns. (iv) To further understand the practical value of our image-based patent retrieval system, we conducted blind user studies [19]. The goal is to directly evaluate and compare the satisfaction, usability, and performance of our method against existing methods in a real-world setting, showing the practical significance of our approach. Hence, we present fourfold distinctive contributions to better meet the industrial demands:

Existing learning-based works on image-based patent

retrieval systems can be categorized into two approaches: The first approach is intuitive, starting with the identification of objects within patent images, then training a classifier to associate these identified objects with their respective International Patent Classification (IPC) classes, and extracting vectors from the network for retrieval [4, 9, 20]. However, this method faces two limitations: (i) It relies on objects that the original detector has been pretrained to recognize, resulting in the exclusion of unidentifiable patent images and thus limiting its applicability for large-scale applications. (ii) Although it considers

IPC, IPC provides a rather coarse classification to the • We introduce a language-informed, distribution- entire patent, which fails to accurately reflect the specific aware multimodal approach to patent image fea- class of a certain image. ture learning, which is both simple and robust. The second approach developed with the release of the This method enhances images with correspond- large-scale DeepPatent dataset [5], where a series of studing semantic information, augmented via LLMs. ies have treated learning patent image representation as • We propose tailored losses specifically designed a Re-ID (i.e., Patent ID) problem [21]. Employing varifor the long-tail distribution of patent classifi- ous CNN backbones such as EficientNet [ 7, 6], ResNet cations. This strategy significantly boosts the [5], ViT [6], and SwinTransformer [6], these studies aim robustness and accuracy of patent image repre- to embed patent drawings into a common feature space sentations, particularly in scenarios sensitive to using contrastive loss functions (e.g., triplet loss [5], Arclass distinctions, leading our method achieve cFace [6, 7]), clustering identical ID images closely and state-of-the-art results. separating diferent ID images. Although these studies • Our model is validated on a large dataset with have shown excellent Re-ID capabilities, they have over

2. Related Work

looked several critical aspects: (i) Re-ID primarily focuses on retrieving images within the same patent, which does not align with the patent industry’s workflow (i.e., retrieving images from diferent patents). ( ii) The Re-ID approach is susceptible to overfitting, which reduces its generalizability and accuracy in retrieving similar cases across diferent patents [ 22]. (iii) These methods often ignore other patent-specific information, such as image descriptions and Locarno classification, which are crucial in practical patent work.

With the rise of VLM and multimodal learning, the retrieval of natural color images has seen significant improvements [ 23, 24, 25, 26 ]. Likewise, multimodal methods have become increasingly prevalent in retrieval strategies for sketch images, which are similar, if not identical, to patent images. For example, sketch-based image retrieval involves retrieving natural images using sketch representations [ 27, 28, 29, 30 ]. They mainly utilize the associations with natural images to achieve such efective results. While patent images lack the stroke information found in sketches and are challenging to associate with natural images due to their nature of novelty and multiple perspectives, the textual information and well-defined classification in patents provide a solid foundation for employing a VLM approach. Therefore, we explore the potential of applying VLM to patent image retrieval, an area currently underappreciated, leveraging the rich auxiliary information in patents [1].

3. Method 3.1. Model Overview

Our objective is to leverage the powerful capabilities of pre-trained LLMs to facilitate feature learning for patent images, thus achieving semantically rich image representations and eficient patent image retrieval. To this end, we introduce a one-stage framework, distinct from the conventional pretrain-finetune approaches employed by prior studies. As depicted in 2, our model comprises two main components: visual feature extraction and text feature extraction. In the visual component of our model, we preprocess the input patent images using data augmentation techniques tailored for patent images, such as flipping, random cropping [ 31], random erasing [32], and gridmask [33]. We then employ CNN-based backbones to extract visual features from these augmented images. For cases where the output feature dimensions from certain backbones are too long or too short, we utilize projectors composed of MLPs to align the dimensions of visual features with those of text features for subsequent contrastive training (2 (c)).

In the textual component, our process begins with employing an image captioner, which, given a patent image and a predefined prompt, generates a sophisticated description of the image [34, 35, 24]. This generated description is then combined with other pertinent information about the image, such as its Locarno classification, original image description, object name, and perspective. This composite input is fed into LLMs to produce diversified, alias-caontaining, fine-grained text descriptions, thereby where represents a batch of patent images, and deenriching the semantic understanding of the patent im- notes the corresponding set of text sentences. The subset age (see 2 (a) and 3.3). Following, we employ a frozen + comprises texts that share the same class (in the case text encoder (i.e., text encoder in CLIP), to extract textual of ℒcls) or category (for ℒcat) with the image . Similarly, features of the descriptions (2 (b)). + includes all images that share the same class or cate

To ensure our contrastive loss is distribution-aware, gory with the text . By doing so, our model gains an we introduce three types of loss functions. Firstly, for understanding of the class and category information, enthe conventional VLM contrastive loss ℒclip, we treat the abling it to acquire robust representations even for those pairing of a fine-grained image with its corresponding in the tail classes. Furthermore, since the text descripdescription as a positive match. For the coarse-grained tion for each image sample varies with each iteration, approach, inspired by [18, 36], we define two scenar- and combined with the class or category loss, the one-toios: class-wise, where an image and a sentence from one pairing relationship between images and texts is less the same class are considered a positive pair (ℒcls); and rigid. This variability acts as an additional regularization category-wise, where an image and a sentence from the mechanism, preventing the model from adhering to fixed, same category (e.g., head or tail categories) are seen as a trivial correlations within specific image-text pairs. positive pair (ℒcat). These losses are combined to update Considering each loss’s stability varies, we move away the visual encoder during the training phase (see 2 (d) from linear loss combination towards a method based on and 3.2). homoscedastic uncertainty [37, 38], learnable through

In the query phase (2 (e)), each patent image is trans- probabilistic deep learning. This type of uncertainty, formed into embeddings with the visual encoder and then independent of input data, reflects the task’s intrinsic stored in a vector database. Retrieval of other images is uncertainty. The loss includes residual regression and done by comparing these embeddings via cosine simi- uncertainty regularization components. The implicitly larity, where embeddings closer in distance are ranked learned variance ˆ moderates the residual regression, higher, and those farther apart are ranked lower. while regularization prevents the network from predicting infinite uncertainty. Hence, The overall loss can be written as in 3. 3.2. Distribution-Aware Contastive Loss ℒ =ℒclip exp(− ˆclip) + ˆclip + ℒcls exp(− ˆcls) + ˆcls + ℒcat exp(− ˆcat) + ˆcat

(3) where ˆ is learnable homoscedastic uncertainty. We find this loss is robust to our task.

As previously mentioned, we formulate our task as a VLM contrastive learning paradigm. The traditional instancebased training objective for a single image can be described as follows (i.e., InfoNCE loss, 1):

exp(t+ · v/ ) ℒclip = − log( ∑︀ =1 exp(t · v/ ) ), (1)

3.3. Text Enrichment |T+| T ∈T+ where (t1 , t2 , ..., t ) denotes text features extracted by the text encoder and v denotes the learned feature of Converting object names, perspectives, class names, and a patent image. The term t · v represents the cosine the patent image’s original descriptions into text for insimilarity score between the patent image and texts and put into a text encoder is a plausible way to generate is a learnable temperature coeficient. This objective is text embeddings for supervisory purposes. However, to maximize t+ · v, which indicates the feature similarity this method encounters several limitations. Firstly, the between the patent image and the corresponding textual most comprehensive descriptions provided by patents information. are typically succinct and straightforward, such as FIG.

However, relying solely on instance-based contrastive 3 is a front elevational view of the light device, leading loss falls short in capturing class or category information, to unclear and sparse information about the image. Adpotentially leading to suboptimal performance in datasets ditionally, the similarity and potential overlap among with skewed distributions [36]. Therefore, we propose diferent classes (e.g., automobiles, motor cars, and toy class-based and category-based coarse-grained losses (i.e., cars) can obscure the distinction of nuanced concepts. To ℒcls and ℒcat, see 2 (d)). These losses can be described in overcome these challenges and derive more discriminaa similar form as follows (2): tive text features, we employee captioners [34] and LLMs − |V1+| V∑∈︁V+ log ∑︀Ve∈xVp(etxp· (tv /· v)/ ) [e1n7hS]pafneoccrieficpatrhloleyd,suwecmeinafirgnsttdliycetueamnildepedlro,syetancnraidpcithnioegdnoedfresthstceoriigpmetaniogenerass.ttehat − 1 ∑︁ log ∑︀Te∈xTp(etxp·(tv/· v) / ) , (2) edpreesscctwsriippthatitioemnnastgoaeftstiomarlnaogeneygsssiifdnoecauamsseaotnnon. feWrpretehpdaretofinwveoiddueilndtshcteraupcctatupiroteinoasns-that guide the description process. For example, instruc- images granted before . Hence, the following retrieval tions might include: Describe the distinct visual elements metrics should be calculated over ′ given v: (i) We present in the design, such as shapes, contours, texture, and use mean Average Precision (mAP), a metric obtained the arrangement of various components. The outcomes by averaging AP scores across all classes. (ii) Following of this process, merged with pre-existing auxiliary in- previous works, the standard evaluation protocol [41] is formation and predetermined instruction templates, are to report the recall at rank (Recall @ or R@) at then fed into LLMs. For example, we employ templates diferent ranks ( 5, 10). (iii) We calculate the Mean Resuch as This is a photo of {Object Name}, classified as ciprocal Rank @ with temporal concern (MRR@ or {Class}, This image features {Details}, and {Object M@), which averages the reciprocal of the rank for the Name} can also be referred to as {Synonym}, to generate ifrst correctly predicted patent image within the top 10 enriched text. Ultimately, this approach yields around 20 rankings across all test samples. detailed text descriptions per image, designed to mine the semantic nuances within the text feature space. Table 2

Additionally, our research revealed that utilizing more Evaluation results of an ablation study on various components specific object names significantly enhances feature learn- within the model architecture. ing. For example, the broad class of Emergency equipment, which encompasses distinct items such as lighting fixture , Head Classes Tail Classes All Classes horticulture grow light, and lighting device. Therefore, in mAP R@10 mAP R@10 mAP R@10 our text generation process, we prioritize these detailed Baseline 32.3 34.7 21.9 30.2 26.0 32.8 object names over the more generic class names, diferent TℒecxlstaGnednℒercaattion 7407..24 5499..72 5427..40 5438..64 5487..79 5469..37 from the previous approach that typically leans on class Captioner 78.0 65.3 61.6 53.5 69.1 58.6 categorization [39, 40].

3.4. Metrics for Patent Retrieval

As mentioned before, in image-based patent retrieval, particularly for novelty detection and related work searches, patent professionals explore databases of patents, applications, and scientific literature to determine an invention’s uniqueness. This process, crucial both before and after a patent application is filed, focuses on identifying if similar prior inventions exist. Accordingly, when evaluating retrieval metrics, it’s necessary to account for the temporal factor, ensuring that only prior art—rather than contemporaneous or subsequent inventions—is considered for retrieval [3]. Considering a database where each data point is represented by a tuple (v, ), with v being the image embedding and the granted time of the patent the image belongs to. For each query image v with a granted time , define containing ′ ⊆

4. Experiments 4.1. Implementation Details In our experiments, we employed PyTorch [42] and uti

lized clusters of NVIDIA A100 GPUs. For the VLM, we explored various ViT variants [31], ResNet50 [43], EficientNetB0 [44], and SwinV2-B [45, 46] as backbones for the visual encoder. The text encoder was adapted from the original CLIP model [14], remaining fixed throughout the experiments. For captioner, we leveraged open-source BLIP-2 [35] and GPT-4V [47]. Regarding LLMs, our focus was on GPT-4 [47], though we also experimented with other LLMs like GPT-3.5-Turbo [17] and LLaMA-2 [48].

For our experiments, we employed the DeepPatent2 dataset. Previous research has mainly adopted the origi- 4.2. Experimental Results nal DeepPatent dataset [5]; however, this dataset sufers from a narrow collection span, lacks image-related metadata, and does not segment sub-images. These limitations can introduce substantial noise in inter-image relationships. Fortunately, the DeepPatent2 [1] dataset addresses these issues efectively. To maintain comparability with previous methods, we utilized DeepPatent2 data from 2016 to 2019, consisting of 822,792 records with 407 Locarno classes. Of these, 90% were used for the training set, and 10% for the validation set. For our query dataset, we used 252,296 records from the year 2020.

Our baseline models are our replicating state-of-thearts on the same dataset to ensure comparability: these include PatentNet [5], SwinV2-B+ArcFace [6], and EfifcientNet+ArcFace [ 7]. Our primary experiments focused on a variety of visual encoder backbones, such as ResNet50, EficientNetB-0, ViT-B-32, and SwinV2-B. Due to space constraints in this paper, detailed results from experiments involving the captioner and LLMs will be presented in the full manuscript. For preliminary insights, we utilized GPT-4 for both the captioner and LLMs. In our ablation study, we explore the efects of distribution awareness losses (i.e., ℒcls and ℒcat), the text generation component, and the captioner.

Table 1 presents the quantitative results on the DeepPatent2 query set. Overall, our approach significantly outperforms the state-of-the-art, achieving up to a 53% improvement in the mAP metric, 38% in Recall @ 5, 41.8% in Recall @ 10, and 51.9% in MRR @ 10. Notably, both ViT-B-32 and SwinV2-B show comparable performance, with each excelling under diferent metrics or scenarios. For instance, ViT-B-32 performs better in tail classes, while SwinV2-B shows strength in head classes, suggesting an interaction between data distribution and model architecture. With these results, we have achieved the state-of-the-art in this task. Given the strong performance across all classes with ViT as the backbone, we will base our ablation study on this model to further investigate the impact of various model components on performance.

Table 2 presents the evaluation results of an ablation study on four components, starting with a baseline model, which is a standard CLIP model. The subsequent rows represent enhancements to this baseline. Firstly, by incorporating a distribution-aware contrastive loss, we observe significant performance gains in both head and tail classes, with tail classes experiencing more substantial improvements (approximately 20% in mAP and Recall @ 10), and head classes seeing a 10% increase in these metrics. Next, by incorporating text generation functionof similar images, closely corresponding to predefined classes, indicating that the model has learned meaningful and discriminative features for each class. This clustering enables the model to efectively distinguish between diferent classes.

Conversely, the previous approach results in a t-SNE visualization with less clustered and more continuous embeddings, making it dificult to identify distinct clusters or their correlation to predefined classes. This indicates that the model’s learned representations are less discriminative, potentially capturing more generalized features shared across multiple classes. Although some clustering is visible, it may not relate to actual classes but rather to visually similar images, blurring the boundaries between diferent classes.

5. User Study

To further ensure the practical value of our system, we conducted a user study following rigorous psychological ality, which is guided by the LLM, the model learns richer procedures. As for participants, we recruited 15 patent semantic relationships between images. The addition of agents (48% female, average age: 33.4 years) to perform this feature leads to a 10-20% improvement across vari- tasks related to design patent image retrieval. ous metrics. Finally, by integrating a captioner module, As for procedure, we employed a double-blind test, the key details of the design inventions in the images are where participants were unaware of the underlying redirectly expressed by the captioner, further highlighting trieval system during their tasks. They could encounter the semantic focal points of the images and enhancing either our retrieval system or a system based on the prethe retrieval performance. vious approach. Each patent agent handled 30 retrieval tasks, with these tasks randomly assigned to one of the 4.3. Qualitative Results two systems—15 tasks with our system and 15 with the previous approach. After each task, participants rated Qualitative results (see 3) indicate that our approach re- their satisfaction with the retrieval results on a scale from trieves better results than the previous state-of-the-art, 1 to 5 and recorded the time taken to complete the task evident in both head and tail classes. Focusing solely on (in hours). To minimize randomness, we averaged the our model, it is apparent that it underperforms in tail scores across systems for each participant, resulting in classes. Additionally, our system not only retrieves the four scores per person (two systems × two scores). correct class given an image but also finds images that The results of the paired t-test revealed significant difare visually similar to the query image. Furthermore, we ferences in satisfaction levels, with patent agents showdelve into the errors made by the previous state-of-the- ing a higher satisfaction with our system compared to the art. For example, when presented with an image labeled previous approach, (14) = 3.30, < 0.01. Regarding task shoe, the previous model might retrieve a shoe image, completion time, agents completed tasks faster using our but it actually belongs to the category of shoelaces. This system, (14) = -4.30, < 0.001. These results indicate reveals that the previous approach did not align seman- that our system is more eficient and better meets the tic information within the images, often leading to the practical needs of professionals in the field. retrieval of visually similar but categorically diferent images. Similar issues occur with categories like vehicles & toy cars or flashlights & chargers , where the images 6. Conclusion look alike but difer semantically. Our model mitigates these errors by guiding the image’s embedding space Our method has achieved new state-of-the-art results with linguistic information, which enhances semantic in the quantitative evaluation of mAP, Recall@, and alignment. MRR@, as well as in high-quality image retrieval dur

Based on the t-SNE results shown in 4, our method ing qualitative evaluation. For many years, current comyields more clustered embeddings, suggesting that the mercial design patent retrieval systems have had signifimodel efectively captures the inherent structures or cant shortcomings. For example, traditional text-based classes within the data. Each cluster represents a group searches can be limiting due to the subjective interpre- Deriving design feature vectors for patent images tation of design features and the dificulty in describing using convolutional neural networks, Journal of Mevisual details with text. Although learning-based image chanical Design 143 (2021) 061405. doi:10.1115/ retrieval systems have started to emerge in the last two 1.4049214. years, their practical value remains limited. [5] M. Kucer, D. Oyen, J. Castorena, J. Wu, Deep

To address these issues, our proposal can efectively patent: Large scale patent drawing recognition and solve this problem. Firstly, we proposed a new learning- retrieval, in: Proceedings of the IEEE/CVF Winter based architecture capable of learning image features Conference on Applications of Computer Vision, with practical value. These representations not only con- 2022, pp. 2309–2318. doi:10.1109/WACV51458. tain visual information but are also aligned with corre- 2022.00063. sponding (augmented) semantic text and classification [6] K. Higuchi, K. Yanai, Patent image retrieval using data. This has substantial practical value because spe- transformer-based deep metric learning, World cific graphic semantic features such as curvature, edges, Patent Information 74 (2023) 102217. doi:10.1016/ and geometric details are considered. Focusing solely j.wpi.2023.102217. on the image itself might overlook these critical visual [7] H. Wang, Y. Zhang, Learning eficient represenelements. Secondly, we utilize a larger and more well- tations for image-based patent retrieval, in: Chidefined dataset, which encompasses a broader collec- nese Conference on Pattern Recognition and Comtion span, image-related metadata, and segmented sub- puter Vision (PRCV), Springer, 2023, pp. 15–26. images. This makes the model more robust and enhances doi:10.1007/978-981-99-8540-1_2. its accuracy. Thirdly, addressing the long-tail distribu- [8] R. N. Carney, J. R. Levin, Pictorial illustrations still tion in classification, our study is the first to propose improve students’ learning from text, Educational distribution-aware losses, which have proven to be efec- psychology review 14 (2002) 5–26. doi:10.1023/A: tive. Lastly, we conducted a user study to demonstrate 1013176309260. the practical value of our system in the field, showing [9] S. Vrochidis, A. Moumtzidou, I. Kompatsiaris, that it is more eficient, accurate, and time-saving. Concept-based patent image retrieval, World Patent

In the future, we have several directions for further Information 34 (2012) 292–303. doi:10.1016/j. expansion: () Identifying similarities between this in- wpi.2012.07.002. vention and prior arts, which involves not only using [ 10 ] A. Tiwari, V. Bansal, Patseek: content based image existing models for visualizations through explainable retrieval system for patent database (2004). AI [49] but also leveraging data from examiner’s reports [11] Z. Zhiyuan, Z. Juan, X. Bin, An outward-appearance (Ofice Actions) to guide further explorations. ( ) While patent-image retrieval approach based on the our research currently focuses on prior art searches, fu- contour-description matrix, in: 2007 Japan-China ture work could also explore other temporal dimensions, Joint Workshop on Frontier of Computer Science such as infringement searches. Additionally, image do- and Technology (FCST 2007), IEEE, 2007, pp. 86–89. main adaptation [50] could be used to enhance the ef- doi:10.1109/FCST.2007.14. fectiveness of searches across diferent domains, such as [12] B. Huet, N. J. Kern, G. Guarascio, B. Merialdo, retrieving E-commerce images using patent drawings. Relational skeletons for retrieval in patent drawings, in: Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), References volume 2, IEEE, 2001, pp. 737–740. doi:10.1109/ [1] K. Ajayi, X. Wei, M. Gryder, W. Shields, J. Wu, S. M. [13] SI.CIVPro.c2h0i0d1is.,9S5.8P5a9p9a.dopoulos, A. Moumtzidou, Jones, M. Kucer, D. Oyen, Deeppatent2: A large- P. Sidiropoulos, E. Pianta, I. Kompatsiaris, Towards scale benchmarking corpus for technical drawing content-based patent image retrieval: A framework understanding, Scientific Data 10 (2023) 772. doi: 10. perspective, World Patent Information 32 (2010) [ 2 ] 1W0.38S/hsal4a1b5y9,7W-0. 2Z3-ad0r2o6z5n3y-,7. Patent retrieval: [14] 9A4.–R10a6d.fodrodi:,10J..W10.1K6i/mj,.wCp.iH.2al0la0c9y.,0A5..0R1a0m. esh, a literature review, Knowledge and Informa- G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, tion Systems 61 (2019) 631–660. doi:10.1007/ J. Clark, et al., Learning transferable visual mods10115-018-1322-7. els from natural language supervision, in: Inter[3] R. Krestel, R. Chikkamath, C. Hewel, J. Risch, A national conference on machine learning, PMLR, survey on deep learning for patent analysis, World 2021, pp. 8748–8763. doi:10.48550/arXiv.2103.

Patent Information 65 (2021) 102035. doi:10.1016/ 00020.

j.wpi.2021.102035. [15] Z. Guo, Y. Tang, R. Zhang, D. Wang, Z. Wang, [4] S. Jiang, J. Luo, G. Ruiz-Pava, J. Hu, C. L. Magee, B. Zhao, X. Li, Viewrefer: Grasp the multi

view knowledge for 3d visual grounding with [26]

Cohen ,

Gal ,

E. A.

Meirom , G. Chechik, Y. Atz-

arXiv:2303.16894 ( 2023 ). doi: 10 .48550/arXiv. frozen vision-language representations , in: Euro-

2303.16894. pean Conference on Computer Vision , Springer, [16]

Zhang ,

Hu ,

Li ,

Huang ,

Deng ,

Qiao , 2022 , pp. 558 - 577 . doi: 10 .48550/arXiv.2204.

Gao ,

Li , Prompt, generate, then cache: Cas- 01694 .

cade of foundation models makes strong few-shot [27]

Deng ,

Xu ,

Wang ,

Yang ,

Tao , Pro-

tion , 2023 , pp. 15211 - 15222 . doi: 10 .48550/arXiv. actions on Image Processing 29 ( 2020 ) 8892 - 8902 .

2303.02151. doi: 10 .1109/TIP. 2020 . 3020383 . [17]

Brown ,

Mann ,

Ryder ,

Subbiah , J. D. [28]

Sangkloy ,

Burnell ,

Ham ,

Hays , The

Sastry ,

Askell , et al., Language models bunnies , ACM Transactions on Graphics (TOG) 35

are few-shot learners , Advances in neural infor- ( 2016 ) 1 - 12 . doi: 10 .1145/2897824.2925954.

mation processing systems 33 ( 2020 ) 1877 - 1901 . [29]

Xu ,

T. M.

Hospedales ,

Yin , Y.-

Song , T. Xi-

doi:10 .48550/arXiv. 2005 . 14165 . ang , L. Wang, Deep learning for free-hand sketch : [18]

Tian ,

Wang ,

Zhu ,

Dai ,

Qiao , Vl-ltr: A survey, IEEE transactions on pattern analy-

Learning class-wise visual-linguistic representation sis and machine intelligence 45 ( 2022 ) 285 - 312 .

for long-tailed visual recognition , in: European doi: 10 .48550/arXiv. 2001 . 02600 .

Conference on Computer Vision , Springer, 2022 , pp. [30]

Wang ,

Deng , T. Liu,

Tao , Transferable

73- 91 . doi: 10 .48550/arXiv.2111.13579. coupled network for zero-shot sketch-based im [19]

L. B.

Christensen ,

R. B.

Johnson , L. A. Turner, age retrieval, IEEE Transactions on Pattern Analy-

Research

Methods , Design, and Analysis , 12 ed., sis and Machine Intelligence 44 ( 2021 ) 9181 - 9194 .

Global

Edition , 2014 . Page count: 542 ; Dimensions: doi:10.1109/TPAMI. 2021 . 3123315 .

23 x 18 . 6 cm; Book number : 00106962 . [31]

Dosovitskiy ,

Beyer ,

Kolesnikov , D. Weis[20]

Bhattarai ,

Oyen ,

Castorena , L . Yang, senborn, X. Zhai,

Unterthiner , M. Dehghani,

based deep learning and transfer learning, in: Pro- worth 16x16 words: Transformers for image recog-

ceedings of the IEEE/CVF conference on computer nition at scale , arXiv preprint arXiv: 2010 .11929

vision and pattern recognition workshops , 2020 , pp. ( 2020 ). doi: 10 .48550/arXiv. 2010 . 11929 .

174- 175 . doi: 10 .48550/arXiv. 2004 . 10780 . [32]

Zhong ,

Zheng , G. Kang,

Li ,

Yang , Ran[21]

Ye ,

Shen ,

Lin ,

Xiang ,

Shao , S. C. Hoi, dom erasing data augmentation , in: Proceedings

and outlook , IEEE transactions on pattern analy- volume 34 , 2020 , pp. 13001 - 13008 . doi: 10 .48550/

sis and machine intelligence 44 ( 2021 ) 2872 - 2893 . arXiv. 1708 .04896.

doi:10 .1109/TPAMI. 2021 . 3054775 . [33]

Chen , S. Liu,

Zhao ,

Jia , Gridmask data aug[22]

X.-Q.

Ma , C.- C. Yu , X.-X.

Chen , L.

Zhou , Large- mentation, arXiv preprint arXiv: 2001 . 04086 ( 2020 ).

scale person re-identification based on deep hash doi:10 .48550/arXiv. 2001 . 04086 .

learning , Entropy 21 ( 2019 ) 449 . doi: 10 .1109/TIP. [34]

Liu ,

Li ,

Wu ,

Y. J.

Lee , Visual instruc-

2017 .2695101. tion tuning, Advances in neural information pro [23]

Liu ,

Rodriguez-Opazo ,

Teney , S. Gould, cessing systems 36 ( 2024 ). doi: 10 .48550/arXiv.

Image retrieval on real-life images with pre-trained 2304 .08485.

vision- and -language models , in: Proceedings of the [35] J . Li , D.

Li , S.

Savarese , S.

Hoi , Blip-2: Bootstrap-

Vision , 2021 , pp. 2125 - 2134 . doi: 10 .48550/arXiv. age encoders and large language models , arXiv

2108.04024. preprint arXiv: 2301 .12597 ( 2023 ). doi: 10 .48550/ [24]

Karthik ,

Roth ,

Mancini ,

Akata , Vision- by- arXiv.2301 .12597.

language for training-free compositional image re- [36]

H. C.

Lo , C. -S. Fuh, Enhancing long-tailed 3d se-

trieval , 2024 . doi: 10 .48550/arXiv.2310.09291. mantic segmentation with category-wise linguistic-

arXiv:2310 .09291. visual representation, in: The 36th IPPR Conference [25]

Baldrati ,

Agnolucci ,

Bertini ,

Del Bimbo , on Computer Vision , Graphics, and Image Process-

Zero-shot composed image retrieval with textual ing (CVGIP), Kinmen , Taiwan, 2023 .

inversion , arXiv preprint arXiv:2303.15247 ( 2023 ). [37]

Kendall ,

Gal , What uncertainties do we need

doi:10 .48550/arXiv.2303.15247. in bayesian deep learning for computer vision ?, Ad-

vances in neural information processing systems [48]

Touvron ,

Martin ,

Stone ,

Albert , A . Alma-

30 ( 2017 ). doi: 10 .48550/arXiv.1703.04977. hairi, Y. Babaei,

Bashlykov ,

Batra , P. Bhargava, [38]

Kendall ,

Cipolla , Geometric loss functions S. Bhosale , et al., Llama 2 : Open foundation and fine-

for camera pose regression with deep learning, in: tuned chat models , arXiv preprint arXiv:2307.09288

Proceedings of the IEEE conference on computer ( 2023 ). doi: 10 .48550/arXiv.2307.09288.

vision and pattern recognition , 2017 , pp. 5974 - 5983 . [49]

Xu ,

Uszkoreit ,

Du ,

Fan ,

Zhao ,

Zhu ,

doi:10 .48550/arXiv.1704.00390. Explainable ai: A brief survey on history , research [39]

Rozenberszki ,

Litany ,

Dai , Language- areas, approaches and challenges, in: Natural Lan-

grounded indoor 3d semantic segmentation in the guage Processing and Chinese Computing: 8th

wild, in: European Conference on Computer Vi- CCF International Conference, NLPCC 2019, Dun-

sion, Springer, 2022 , pp. 125 - 141 . doi: 10 .48550/ huang, China, October 9- 14 , 2019 , Proceedings,

arXiv.2204.07761. Part II 8 , Springer, 2019 , pp. 563 - 574 . doi: 10 .1007/ [40]

Huang ,

Pan ,

Zheng ,

Jiang ,

Xie , C. Wu, 978 -3- 030 -32236-6_ 51 .

Song , G. Huang, Joint representation learn- [50]

Wang ,

Deng , Deep visual domain adaptation:

ing for text and 3d point cloud, Pattern Recog- A survey , Neurocomputing 312 ( 2018 ) 135 - 153 .

nition 147 ( 2024 ) 110086 . doi: 10 .48550/arXiv. doi: 10 .48550/arXiv. 1802 . 03601 .

2301. 07584 . [41]

Baldrati ,

Bertini ,

Uricchio ,

Del Bimbo ,

pattern recognition , 2022 , pp. 21466 - 21474 . [42]

Paszke ,

Gross ,

Massa ,

Lerer , J. Brad-

neural information processing systems 32 ( 2019 ).

doi:10 .48550/arXiv. 1912 . 01703 . [43]

He ,

Zhang , S. Ren,

Sun , Deep residual

tern recognition , 2016 , pp. 770 - 778 . doi: 10 .48550/

arXiv. 1512 .03385. [44]

Tan ,

Le , Eficientnet: Rethinking model scal-

2019 , pp. 6105 - 6114 . doi: 10 .48550/arXiv. 1905 .

11946. [45]

Liu ,

Lin ,

Cao ,

Hu ,

Wei ,

Zhang ,

ence on computer vision , 2021 , pp. 10012 - 10022 .

doi:10 .48550/arXiv.2103.14030. [46]

Liu ,

Hu ,

Lin ,

Yao ,

Xie ,

Wei , J. Ning,

v2: Scaling up capacity and resolution , in: Proceed-

sion and pattern recognition , 2022 , pp. 12009 - 12019 .

doi:10 .48550/arXiv.2111.09883. [47]

Achiam ,

Adler ,

Agarwal ,

Ahmad , I. Akkaya,

man , S.

Anadkat , et al., Gpt-4 technical re-

port , arXiv preprint arXiv:2303.08774 ( 2023 ). doi:10.

48550/arXiv.2303.08774.