<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>K. Kavimandan);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Hierarchical Multi-Positive Contrastive Learning for Patent Image Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kshitij Kavimandan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelos Nalmpantis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emma Beauxis-Aussalet</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert-Jan Sips</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TKH AI</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vrije Universiteit Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Patent images are technical drawings that convey information about a patent's innovation. Patent image retrieval systems aim to search in vast collections and retrieve the most relevant images. Despite recent advances in information retrieval, patent images still pose significant challenges due to their technical intricacies and complex semantic information, requiring eficient fine-tuning for domain adaptation. Current methods neglect patents' hierarchical relationships, such as those defined by the Locarno International Classification (LIC) system, which groups broad categories (e.g., “furnishing”) into subclasses (e.g., “seats” and “beds”) and further into specific patent designs. In this work, we introduce a hierarchical multi-positive contrastive loss that leverages the LIC's taxonomy to induce such relations in the retrieval process. Our approach assigns multiple positive pairs to each patent image within a batch, with varying similarity scores based on the hierarchical taxonomy. Our experimental analysis with various vision and multimodal models on the DeepPatent2 dataset shows that the proposed method enhances the retrieval results. Notably, our method is efective with low-parameter models, which require fewer computational resources and can be deployed on environments with limited hardware.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>Patent Image Retrieval</kwd>
        <kwd>Hierarchical Multi-Positive Contrastive Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Patent images are technical drawings that illustrate the novelty of a patent, often conveying their
details more efectively than natural language written in text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Thereby, technical patent reports
are typically accompanied by multiple images capturing diferent aspects of the invention. With the
rapidly growing volume of patents, eficient patent image retrieval systems are becoming an essential
component for searching these vast collections.
      </p>
      <p>
        Many advances in information retrieval have been largely driven by the power of attention based
models [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] and the knowledge acquired during extensive pretraining phases, mainly focused on the
language domain. While similar models, such as Vision Transformer (ViT) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and ResNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], have
provided remarkable results on a plethora of vision tasks, they still fall short when processing technical
drawings since their pretraining mainly involves natural images [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In response, to address this domain
shift, researchers have released specialized sketch datasets [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] that facilitate model fine-tuning on
such images. Similarly, large scale datasets containing patent images have emerged to address their
unique intricacies and enable the development of eficient patent image retrieval methods.
      </p>
      <p>
        DeepPatent [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was the first large scale dataset designed for training and evaluating patent image
retrieval systems, comprising over 350, 000 images across 45, 000 patents, enabling the development of
PatentNet, which exhibited significant improvements in patent image retrieval. Additionally, several
studies investigated the generation of synthetic text descriptions by leveraging the zero-shot capabilities
of (vision) language models [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], allowing the application of multimodal models, such as CLIP
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], on patent image retrieval. Inspired by DeepPatent, DeepPatent2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] provided an extension of
the dataset, scaling to more than 2.7 million images with patents spanning from 2007 to 2020 while
also incorporating additional metadata like the object’s name. Despite the advances in patent image
retrieval [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ], many methodologies determine the relevance of images based on their association
with the same patent. This criterion neglects the rich hierarchical taxonomies of patents that are defined
by standardized classification systems. Such hierarchical similarities could potentially enhance the
efectiveness of patent image retrieval systems.
      </p>
      <p>
        In this paper, we aim to address this limitation by leveraging the hierarchical taxonomy of patents as
defined by the Locarno International Classification (LIC) system [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which organizes industrial designs
into a structured taxonomy consisting of 32 main classes, each further divided into various subclasses.
Figure 1 provides an example of how patents are organized within this hierarchical taxonomy. For
brevity, we omit illustrating all classes entailed in the LIC taxonomy. While many studies aim to capture
the inherent hierarchical information of data, ranging from representation learning methods [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]
to specialized architectures [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], it remains unclear how to properly leverage patents’ hierarchical
relations for improving patent image retrieval.
      </p>
      <p>
        To this end, we propose a hierarchical multi-positive contrastive learning method that explicitly
integrates these hierarchical relations of patents into the training process. Our method extends upon
previous works on patent image retrieval [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and contrastive learning approaches [
        <xref ref-type="bibr" rid="ref11 ref19">11, 19</xref>
        ] by treating
patent images of the same hierarchical main class, subclass and patent ID as positive with varying
degrees of similarity. Figure 1 compares conventional contrastive learning methods with the proposed
approach. With the conventional method shown in Figure 1(b), each image is associated only with
one positive pair that belongs to the same patent ID. In contrast, the proposed approach in Figure 1(c)
respects the hierarchical taxonomy, assigning higher positive scores to images with finer taxonomic
relationships. For example, two images from the same patent receive the highest positive score, reflecting
their direct relationship. Images that belong only to the same Locarno subclass are assigned a slightly
lower positive score, while those that share only the same Locarno main class receive an even lower
score.
      </p>
      <p>In our experimental analysis with various architectures, we demonstrate that our approach enhances
the retrieval performance. Notably, the proposed method shows great efectiveness with low parameter
models which can be deployed in resource constrained environments where computational eficiency is
also crucial.</p>
      <p>The rest of the paper is structured as follows. First, in Section 2, we formulate the proposed hierarchical
multi-positive contrastive learning method for patents. Then, in Section 3, we provide the details of the
experimental setup, facilitating the reproducibility of our results. In Section 4, we report our findings
and demonstrate the efectiveness of our approach. Finally, in Section 5, we draw the conclusions of
this study and discuss future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>To induce hierarchical relations among patents, we propose a Hierarchical Multi-Positive Contrastive
Learning approach that leverages the hierarchical taxonomy provided by the LIC system. Our approach
enables the model to align patent images of the same main class, subclass and patent ID incrementally
closer in the embedding space.</p>
      <p>
        Let  be a collection of patent images,  ∈  a sample image from the dataset and  ∈ R the
corresponding image embedding provided by an image encoder. Considering a batch of 2 images
that form  positive pairs (, ˜) and the embeddings of the anchor image  and its positive pair ˜,
the contrastive loss [
        <xref ref-type="bibr" rid="ref11 ref20">11, 20</xref>
        ] is defined as:
 = − log
      </p>
      <p>exp(sim(, ˜))
∑︀=1 exp(sim(, ˜ ))
(1)
where sim(, ˜ ) indicates the cosine similarity between the two vector embeddings  and ˜ .
(a)</p>
      <p>Locarno International Classification System
(b) Contrastive Learning
(c) Hierarchical Multipositive
Main Class
Subclass
Patent ID</p>
      <p>Furnishing</p>
      <p>Foodstuffs
Beds</p>
      <p>Seats</p>
      <p>Fruits
Patent 1</p>
      <p>Patent 2</p>
      <p>Patent 3</p>
      <p>Patent 4
Image 1,1</p>
      <p>Image 1,2</p>
      <p>Image 4,1</p>
      <p>Image 4,2</p>
      <p>I1,2
I2,2
I3,2
I4,2</p>
      <p>I1,1 I2,1 I3,1 I4,1</p>
      <p>I1,1 I2,1 I3,1 I4,1
I1,2
I2,2
I3,2
I4,2
relationship from (a). I, denotes the -th image that belongs to patent .</p>
      <p>The loss defined in Equation 1, as well as similar losses employed in prior work, such as in PatentNet,
are unable to properly capture the hierarchical relations of patents within the batch. In contrast, 
should accommodate multiple positive pairs for the anchor image  and assign a diferent relevance
score to each pair depending on their hierarchical relations within the LIC taxonomy.</p>
      <p>Let ℎ define the relevance score between two images  and ˜ :
ℎ =
⎧
⎪
⎪⎪
⎪⎨
⎪
⎪⎩0
otherwise
if  and ˜ belong to the same patent ID
if  and ˜ belong to the same subclass
⎪⎪ if  and ˜ belong to the same main class
where  &gt;  &gt;  are positive scalar values that reflect the importance of matching at diferent
hierarchical levels. The function ℎ assigns the highest relevance score to the most specific case
(same patent ID) with progressively lower scores for broader relationships. Additionally, let  be the
normalization factor for the patent image :
Then, the hierarchical multi-positive contrastive loss is defined as:
 = ∑︁ ℎ

=1
 = −
=1
∑︁ ℎ log</p>
      <p>exp(sim(, ˜ ))
∑︀ =1 exp(sim(, ˜))
This formulation enables the model to learn representations that align each image  with multiple
other images from the batch based on their hierarchical proximity within the LIC taxonomy.</p>
      <p>In the case where the text description  of the patent image  is available, we can incorporate
language supervision by adding an additional term to :
where  denotes the embedding of the text description  provided by a language encoder. The
hyperparameter  is a weighting factor controlling the language supervision.</p>
      <p>−
=1
∑︁ ℎ log</p>
      <p>exp(sim(,  ))
∑︀ =1 exp(sim(, ))</p>
      <p>Note that Equation 1 is a special case of Equation 4. The two equations are equivalent when only a
single positive pair exists with a score of 1, and all other pairs are assigned a score of 0.</p>
      <p>While our implementation leverages the LIC system, this approach generalizes to other hierarchical
classification systems, such as the Cooperative Patent Classification system. Alternative taxonomies
can be seamlessly integrated by appropriately defining the scoring function ℎ to reflect their specific
hierarchical structures.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <p>
        For conducting the experiments, we use the DeepPatent2 dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for the year 2007, which contains
multiple images per patent along with the patent’s code from the LIC system and a short textual
description of the depicted object. The experimental setup is similar to Kucer et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We split the data
using a 72.25/12.75/15 ratio for training, validation and testing, respectively. In training, we sample 64
patents, where for each we randomly pick 2 images forming a positive pair based on the patent ID. For
testing, we sample 2 images from each patent, with each image being used individually to form a query.
The rest of the patent images from the test set form the database used for searching. All images are
reshaped with a resolution of 224 × 224. During training, we use the following augmentation techniques
to avoid overfitting: horizontal flip with an applying probability of 0.3, rotation by a maximum of
10 degrees with a probability of 0.5, and Gaussian noise with a probability of 0.2. For testing, no
augmentation methods are deployed.
      </p>
      <p>
        We conduct experiments with the ViT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], CLIP [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and ResNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] architecture of diferent sizes.
The vision models, ViT and ResNet, are initialized from a pretrained version on ImageNet, while CLIP
models are pretrained using the dataset from [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        We use the AdamW optimizer [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] with a learning rate of 0.0001 and weight decay of 0.01. All
models are trained for 20 epochs until convergence with early stopping based on the validation set.
Each experiment is repeated for multiple random seeds. For the ViT and ResNet models, we repeat
the experiments for 5 diferent seeds, while for the CLIP models, which require more computational
resources, we use 3 diferent seeds. The temperature  and the hyperparameter  are set to 0.1 and
0.2, respectively. For the scoring function ℎ , we set  = 1,  = 0.35 and  = 0.2, emphasizing the
patent ID level while still incorporating information from higher levels. These values ofer a balanced
performance across all levels and a fair comparison with the baselines that mainly focus on the patent
ID level. Note that a diferent scoring function could be used depending on the significance of each
hierarchical level for the use case at hand.
      </p>
      <p>The models are evaluated using the mean Average Precision (mAP), the normalized Discounted
Cumulative Gain (nDCG), the Top-K Mean Reciprocal Rank (MRR@K) and the Top-K Accuracy (Acc@K).</p>
      <p>
        The experiments are conducted using PyTorch [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], PyTorch Lightning [23], and the transformers
library from Hugging Face [24]. The training process of a model takes approximately 2.5 hours on a
single NVIDIA A100 GPU.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>Overall, the hierarchical multi-positive contrastive loss enhances retrieval performance across all
hierarchical levels. Notably, the proposed approach provides significant improvements with the ResNet
architecture and lower parameter models such as ViT Tiny. While with larger ViT models, we notice
improved performance at the Subclass and Main Class levels, we observe a slight deterioration at the
Patent ID level. This trade-of is expected, as images from higher hierarchical levels have a higher
similarity score and get higher in the ranking list. Also, we calculate the standard deviation between
the runs, but we do not observe any significant diference between the methods. For the Patent ID level,
the standard deviation is approximately ±0.005 , for the Subclass level, it is ±0.002 , and for the Main
Class level, it is ±0.001 for both methods and metrics.</p>
      <p>Table 2 reports the results with the CLIP model. First, we evaluate only the ViT component from a
pretrained CLIP, in isolation from the language encoder. Additionally, we experiment in a multimodal
setting with minimal language supervision where the textual descriptions are defined using the following
format:</p>
      <p>“This is a patent image of a [OBJECT_NAME].”
where [OBJECT_NAME] represents the object’s description provided by DeepPatent2. These models
provide significant improvements compared to the ViT and ResNet models from Table 1. This can
potentially being attributed to the extensive and contextualized pretrained phases of CLIP. Additionally,
language supervision further improves performance. Finally, we observe a similar performance trade-of
between Patent ID and the higher hierarchical levels, as previously shown in Table 1 for ViT Base and
ViT Large. In the case of the CLIP models, the deterioration in performance at the Patent ID level is
more pronounced, resulting from greater improvements in Subclass and Main Class levels.</p>
      <p>Figure 2 reports the results with ResNet-18 and ResNet-50 using the metrics MRR@K and Acc@K
for  ∈ {1, 5, 10, 20}, providing a more comprehensive overview of the retrieved list. For all levels
(Patent ID, Subclass, and Main Class), the proposed approach outperforms the conventional contrastive
learning method, with more relevant items being found at higher ranks in the retrieved list.</p>
      <p>Finally, we project the embeddings of ViT Base into 2 dimensions using PCA. Figure 3 illustrates the
samples from 5 subclasses (where 2 subclasses belong to the same main class). We notice that without
any hierarchical information induced during training, the classes have a higher overlap and are less
distinctly separated. In contrast, the proposed approach leads to more coherent clustering, with samples
from the same subclass positioned closer together and subclasses of the same main class being closer in
the embedding space.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we presented a hierarchical multipositive contrastive learning approach to improve patent
image retrieval. We integrated the hierarchical relationships of patents defined by the LIC system into
the training process, allowing the models to capture this rich information in the embedding space.
Our approach considers multiple positive pairs within a batch for an anchor image, with each pair
being assigned a diferent relevance score, which reflects how closely their patents are classified within
the chosen hierarchical taxonomy (e.g., LIC). Experimental results demonstrated that our approach
enhanced performance at all hierarchical levels, exhibiting notable improvements with low parameter
models.</p>
      <p>
        Our findings suggest that incorporating the hierarchical information of patents can improve patent
image retrieval, opening several promising avenues for future research. One direction could be to
explore hyperbolic embeddings, which are inherently more suitable for capturing hierarchical structures
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Finally, our study was specifically focused on the LIC taxonomy. Future directions could investigate
alternative taxonomies, for example the Cooperative Patent Classification system, which provides a
more granular hierarchical structure with additional levels.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
learning library, 2019. URL: https://arxiv.org/abs/1912.01703. arXiv:1912.01703.
[23] W. Falcon, The PyTorch Lightning team, PyTorch Lightning, 2019. URL: https://github.com/</p>
      <p>Lightning-AI/lightning. doi:10.5281/zenodo.3828935.
[24] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao,
S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural language
processing, in: Q. Liu, D. Schlangen (Eds.), Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations, Association for
Computational Linguistics, Online, 2020, pp. 38–45. URL: https://aclanthology.org/2020.emnlp-demos.6/.
doi:10.18653/v1/2020.emnlp-demos.6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kucer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Castorena</surname>
          </string-name>
          , J. Wu, Deeppatent:
          <article-title>Large scale patent drawing recognition and retrieval</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2309</fpage>
          -
          <lpage>2318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , in: International Conference on Learning Representations,
          <year>2021</year>
          . URL: https://openreview.net/forum?id=YicbFdNTTy.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Geirhos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rubisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Michaelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bethge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Wichmann</surname>
          </string-name>
          , W. Brendel,
          <article-title>Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness</article-title>
          , in: International conference on learning representations,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>Learning robust global representations by penalizing local predictive power</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sangkloy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Burnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ham</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Hays,</surname>
          </string-name>
          <article-title>The sketchy database: learning to retrieve badly drawn bunnies</article-title>
          ,
          <source>ACM Transactions on Graphics (TOG) 35</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Aubakirova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gerdes</surname>
          </string-name>
          , L. Liu, Patfig:
          <article-title>Generating short and long captions for patent figures</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>2843</fpage>
          -
          <lpage>2849</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>H.-C. Lo</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Chu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hsiang</surname>
          </string-name>
          , C.-C. Cho,
          <article-title>Large language model informed patent image retrieval, 2024</article-title>
          . URL: https://arxiv.org/abs/2404.19360. arXiv:
          <volume>2404</volume>
          .
          <fpage>19360</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          , in: International conference on machine learning,
          <source>PmLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ajayi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shields</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kucer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oyen</surname>
          </string-name>
          ,
          <article-title>Deeppatent2: A large-scale benchmarking corpus for technical drawing understanding</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>772</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Learning eficient representations for image-based patent retrieval</article-title>
          ,
          <source>in: Chinese Conference on Pattern Recognition and Computer Vision</source>
          (PRCV), Springer,
          <year>2023</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Higuchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yanai</surname>
          </string-name>
          ,
          <article-title>Patent image retrieval using transformer-based deep metric learning</article-title>
          ,
          <source>World Patent Information</source>
          <volume>74</volume>
          (
          <year>2023</year>
          )
          <fpage>102217</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>World</given-names>
            <surname>Intellectual Property Ofice</surname>
          </string-name>
          , Locarno classification, https://www.wipo.int/classifications/ locarno/,
          <year>2025</year>
          . Accessed: April
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mettes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Ghadimi</given-names>
            <surname>Atigh</surname>
          </string-name>
          , M. Keller-Ressel,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yeung</surname>
          </string-name>
          ,
          <article-title>Hyperbolic deep learning in computer vision: A survey</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>132</volume>
          (
          <year>2024</year>
          )
          <fpage>3484</fpage>
          -
          <lpage>3508</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nalmpantis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lippe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Magliacane</surname>
          </string-name>
          ,
          <article-title>Hierarchical causal representation learning</article-title>
          ,
          <source>in: Causal Representation Learning Workshop at NeurIPS</source>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          , Stablerep:
          <article-title>Synthetic images from text-to-image models make strong visual representation learners</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>48382</fpage>
          -
          <lpage>48402</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>A. van den Oord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <article-title>Representation learning with contrastive predictive coding</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1807</year>
          .03748. arXiv:
          <year>1807</year>
          .03748.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Decoupled weight decay regularization</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2019</year>
          . URL: https://openreview.net/forum?id=
          <fpage>Bkg6RiCqY7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pytorch:</surname>
          </string-name>
          <article-title>An imperative style, high-performance deep</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>