<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Patents Visually: An Interactive Search System for Multimodal Patent Image Search and Interpretation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sushil Awale</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Müller-Budack</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rahim Delaviz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralph Ewerth</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Patent Ofice</institution>
          ,
          <addr-line>The Hague</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>L3S Research Center, Leibniz University Hannover</institution>
          ,
          <addr-line>Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>TIB - Leibniz Information Centre for Science and Technology</institution>
          ,
          <addr-line>Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Marburg and hessian.AI - Hessian Center for Artificial Intelligence</institution>
          ,
          <addr-line>Marburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>76</fpage>
      <lpage>83</lpage>
      <abstract>
        <p>Most patent retrieval systems are text-based, which under utilizes the multimodal nature of patent documents. Although a few multimodal patent retrieval systems exist, they fall short in providing eficient and informative visualizations, and facilitating the interpretation of retrieval results. To address these shortcomings, this paper presents iPatent, a novel web-based multimodal patent image retrieval system. Unlike previous solutions, iPatent integrates state-of-the-art deep learning models for fine-grained unimodal, cross-modal, and multimodal patent image retrieval. Additionally, it employs both traditional machine learning techniques and modern generative methods for interactive visual exploration and insightful interpretation of retrieval results. iPatent leverages modern web technologies to provide an interactive interface that enables users to explore large patent databases eficiently and in a visually informative way. Source code and demo are publicly available at: https://service.tib. eu/ipatent/.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal patent retrieval</kwd>
        <kwd>interactive visualization systems</kwd>
        <kwd>generative AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Patents protect intellectual property by combining detailed textual descriptions with visual
representations. However, most patent search and analysis tools primarily depend on text-based methods to
identify similar patents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This approach can be limiting, as visual elements-such as technical drawings
and illustrations-often convey crucial supplementary information that text alone may not fully capture.
These visual elements can also bridge linguistic and domain-specific gaps that are prevalent in patent
documents. Integrating patent images could enhance eficiency in the patent search process [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which is
especially valuable as the volume of patent applications continue to rise1. By leveraging both textual and
visual information, search systems can identify a broader range of similar patents, thereby improving
recall and supporting more comprehensive decision-making for both applicants and examiners [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Several systems have been developed to address the unique challenges of patent image retrieval. Early
patent image retrieval systems relied on low-level visual features for content-based image retrieval (e.g.,
PatMedia [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], PatSeek [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). However, they struggled with scalability and semantic search. In recent
years, vision-language models (VLMs) such as CLIP (Contrastive Language-Image Pre-training, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ])
have enabled semantic search in patent image retrieval [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ] by mapping both images and text
into a shared embedding space. For example, VisPat2 leverages CLIP to support semantic and
crossmodal retrieval. However, existing systems lack support for fine-grained query formulation, such as
text-conditioned image-to-image retrieval or sub-region (component-level) search. Furthermore, these
systems lack mechanisms for organizing the retrieved images into meaningful groups or providing
6th Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech) 2025
* Corresponding author.
$ sushil.awale@tib.eu (S. Awale)
0000-0003-2575-0134 (S. Awale); 0000-0002-6802-1241 (E. Müller-Budack); 0000-0003-0918-6297 (R. Ewerth)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1https://report-archive.epo.org/files/babylon/epo_patent_index_2022_infographic_en.pdf
2https://service.tib.eu/vispat
insightful keywords or descriptions to the results, which help in eficient exploration, and interpretation
of retrieval results in large databases.
      </p>
      <p>In this paper, we present iPatent, an open-source, interactive web-based search system for multimodal
patent image retrieval and analysis. The system enables semantic patent image search through
imageto-image, text-to-image, and text-conditioned image-to-image retrieval approaches. Users can also search
for specific sub-regions or components within a query image, supporting fine-grained exploration.
Additionally, the system organizes retrieval results into visually and semantically coherent clusters,
each accompanied by synthetic descriptions, enhancing exploration and interpretability to facilitate
further analysis of the results.</p>
      <p>
        The remainder of this paper is organized as follows. Section 2 describes the system architecture and
technical implementation details of iPatent. The multimodal image retrieval, image clustering, and
interpretation approaches are described in Section 3. Section 4 presents practical use cases of iPatent,
and the current functionalities and user interface. In Section 5, we provide quantitative evaluation of
CLIP’s performance on patent retrieval task. Section 6 summaries the paper and outlines future work.
2. System Architecture of iPatent
iPatent facilitates patent image retrieval and analysis through a web-based application system. As
illustrated in Figure 1, iPatent comprises of three primary modules: the user interface and backend
module, the index and retrieval module, and the analysis and generative module. The user interface and
backend module is implemented using Streamlit [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Python. The backend sub-module orchestrates
the data flow among all the other modules. The index and retrieval module implements the indexing and
retrieval mechanism using Qdrant [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] vector database, and stores the raw image bytes in MinIO [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
object store database. The analysis and generative module provides clustering service implemented
in scikit-learn [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], feature extraction service utilizing a VLM, and generation service using a large
vision-language model (LVLM) deployed with Ollama [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>User Interface</p>
      <p>d
Retrieval
Select top k
100
Text weight
3. Multimodal Image Retrieval
iPatent operates through a sequence of steps that combine deep feature extraction, flexible and
finegrained retrieval, and post-retrieval visualizations. Ofline, a large collection of patent images are
processed using a deep learning model to extract high-dimensional feature representations (Section 3.1)
which are stored in a vector database. The vector-based retrieval process (Section 3.2) then allows
unimodal, cross-modal, and multimodal retrieval of patent images for exploration and discovery. After
retrieval, the system further organizes the results through clustering, grouping similar images based on
their feature representations (Section 3.3). This organization enhances result navigation and exploration,
and together with synthetic descriptions (Section 3.4) enable interpretation and analysis.</p>
      <sec id="sec-1-1">
        <title>3.1. Feature Extraction and Indexing</title>
        <p>
          Efective patent image search requires retrieving not only visually similar images but also those that
are semantically related. To achieve this, we employ CLIP [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] embeddings, which map both images and
text into a shared semantic space through a language-supervised training paradigm. The CLIP model
variant employed is ViT-B-16 trained on laion400m_e32, implemented via the open_clip library [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
CLIP ensures that semantically similar image-text pairs are positioned close together in the embedding
space. We encode a large collection of utility patent images using CLIP and index them ofline in Qdrant
vector database along with their metadata such as terms associated with the components depicted in
the images.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>3.2. Multimodal Image Retrieval</title>
        <p>
          The core image retrieval mechanism in the system is based on a k-nearest neighbor search within a
high-dimensional semantic space, enabling both visual and semantic similarity matching. Central to its
eficient and scalable performance is the use of the HNSW (Hierarchical Navigable Small World) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]
algorithm for indexing, which organizes vectors in a multi-layered graph structure. Users can submit
queries in the form of images, text, or a combination of both. The query is then passed through the
feature extractor (discussed in Section 3.1), which encodes it into the semantic space. The system then
performs a vector search with Qdrant using cosine distance to eficiently retrieve the most similar
images from the index.
        </p>
        <p>iPatent implements three image retrieval approaches by leveraging the shared semantic space of
CLIP: (1) image-to-image (unimodal), (2) text-to-image (cross-modal), and (3) text-conditioned
image-toimage (multimodal).</p>
        <p>The image-to-image retrieval approach allows transcending domain and language barriers during
patent search. A user, for example, can upload a technical drawing of a folding chair mechanism
to search for images that depict visually and structurally similar folding mechanisms, regardless of
diferences in drawing style, and without requiring domain knowledge. In this approach, iPatent takes
a single patent image as query, which is then encoded using CLIP, and then performs vector search to
retrieve similar images.</p>
        <p>The text-to-image retrieval approach facilitates searching for relevant images using descriptive
language, which bridges the gap between textual and visual information. For example, a user can enter
a descriptive textual query such as "portable solar panel with integrated battery storage" encapsulating
concepts that sometimes images alone can not. In this cross-modal retrieval approach, iPatent supports
text-based natural language query, which is also embedded into the same semantic space as the images
using CLIP. The system, similar to image-to-image approach, performs vector search to identify images
whose embeddings are most similar to the text query.</p>
        <p>The text-conditioned image-to-image retrieval is a multimodal approach that leverages the
complementary strengths of both modalities, enabling more nuanced and precise queries. For example, users
can upload an image of a bicycle frame and refine their search with textual attributes or modifications
such as "with built-in suspension system" to find images with additional functional or structural
requirements. Here, iPatent encodes both the query image and the query text into the same semantic space
using CLIP, and performs image-to-image and text-to-image retrieval. The retrieved results are then
fused together using weighted averaging, i.e., late fusion. The modality weights are also configurable
providing the user further fine-grained query formulation features.
Grid View Cluster View</p>
      </sec>
      <sec id="sec-1-3">
        <title>3.3. Image Clustering</title>
        <p>Most retrieval systems present results as an ordered list ranked by similarity scores, typically in a list or
grid layout. However, such linear layouts often overlook duplicate or near-duplicate results, do not
reveal relationships among retrieved items, and often fail to provide meaningful grouping or clustering
similar results. Moreover, fixed grid layouts can also become overwhelming when displaying large
volume of results, potentially decreasing eficiency in locating specific images of interest.</p>
        <p>
          Displaying retrieved images in visually and semantically coherent cluster groups significantly reduces
information overload and enhances users’ ability to eficiently navigate and explore the results. To
achieve this, we apply the k-means [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] clustering algorithm to the CLIP embeddings of the retrieved
images. By leveraging CLIP’s high-dimensional semantic representations, the clustering process groups
together images that share visual and conceptual similarities, making it easier for users to identify
patterns and thematic groupings within the search results. Both the number of clusters and the subset
of retrieved images to be clustered are user-configurable through the frontend. Figure 2 shows the
retrieval results grouped into two clusters paired with keywords and synthetic titles and descriptions.
        </p>
      </sec>
      <sec id="sec-1-4">
        <title>3.4. Cluster Interpretation</title>
        <p>Providing rich context for each cluster group—both in relation to its member images and the original
query—greatly enhances the interpretability and navigability of retrieval results. To achieve this, each
cluster is paired with a set of descriptive keywords as well as synthetic titles and summaries. The
keywords consist of component terms associated with the patent images in the cluster, which are
extracted directly from the corresponding patent texts. This gives users immediate insight into the
technical content of each group.</p>
        <p>
          Beyond keywords, each cluster is further provided with a synthetic title and a description, both
generated by an LVLM. For this purpose, we use the LLaVA-1.6 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] model with Vicuna [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] as the
language backbone, quantized to 4-bit precision for eficient inference and resource utilization. To
mitigate the significant computational cost associated with LVLM inference, we limit the input to a
random selection of 5 images for each cluster. The LVLM then synthesizes human-readable titles and
context-aware descriptions that capture the main characteristics and relevance of each cluster. Figure 2
shows a cluster view in iPatent with two clusters containing three images each, along with synthetic
titles and descriptions, and associated keywords.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Use Cases and Demonstration</title>
      <p>iPatent has a range of impactful use cases across the patent lifecycle. We discuss some uses cases of
iPatent in Section 4.1, and highlight some of the user interface and features of iPatent in Section 4.2.</p>
      <sec id="sec-2-1">
        <title>4.1. Use Cases</title>
        <p>The multimodal image retrieval capabilities of iPatent allows for various use cases across the patent
lifecycle:</p>
        <p>Prior Art Search: iPatent’s multiple search approaches allow for quick and focused searches
helping patent examiners and applicants eficiently identify existing inventions that may afect the
novelty or non-obviousness of a new patent application.</p>
        <p>Cross-Domain and Multimodal Search: The cross-modal and multimodal search in iPatent allow
for searching cross-domain where users lack precise technical vocabulary or when inventions are best
described visually. This feature is especially helpful in fields where technical drawings and schematics
are central to the invention, such as mechanical engineering, and design domains.</p>
        <p>Infringement Detection and Freedom-to-Operate Analysis: The multimodal search in iPatent
can potentially identify a broader range of relevant patents, enhancing the recall of the search process.
This allows companies and legal teams to reduce potential infringement or ensure freedom to operate
reducing the risk of litigation post-publication.
4.2. Demonstrator
iPatent is available at https://service.tib.eu/ipatent/. Users can perform three distinct types of image
retrieval: image-to-image retrieval (Figure 1; a), text-to-image retrieval (Figure 1; b), and text-conditioned
image-to-image retrieval (Figure 1; a and b). In image-to-image retrieval, users can also crop a specific
region of the image, to perform fine-grained searches such as searching for similar sub-figures or
components. By adjusting the dimensions of the blue square (shown in Figure 1; a1), the users can
acquire a cropped query image (shown in Figure 1; a2).</p>
        <p>For each query, the top  (adjustable by the user) most relevant images are displayed in a grid
view (Figure 1; c1). The layout is designed so that the most relevant image appears in the top-left corner,
with images arranged from left to right across each row. Relevance decreases progressively rightward
along the rows and downward through the grid, making the bottom-right image the least relevant
among the displayed results.</p>
        <p>Users are also provided with a cluster view (Figure 2; c2). Here, the retrieved images are organized
into groups, each representing a cluster of visually and semantically similar images. Each cluster is
displayed as a horizontally scrollable row, which reduces visual information overload by helping users
focus on conceptually coherent subsets. The users also have the option to select the number of desired
clusters and the number of retrieved images to cluster (Figure 1; d). To further enhance interpretability
and eficient navigation, each cluster is also paired with a set of keywords, providing quick insights
about the cluster. Beyond the keywords, each cluster is also enriched with synthetic title and description,
which are generated using an LVLM.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Evaluation</title>
      <p>
        For evaluation of the CLIP model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we run the image-to-image retrieval task on the publicly available
benchmark DeepPatent [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This benchmark focuses on image-to-image retrieval task and uses Re-ID
(Re-identification) to judge relevance, i.e., two patent images belonging to the same patent are considered
to be relevant. The dataset consists of 13, 133 patent drawing images as query and more than 38, 000
as index images (from 6, 927) patents. The CLIP model demonstrates strong performance, achieving
an mAP (mean Average Precision) of 0.57 at rank  = 1 without any fine-tuning or domain adaption.
This result highlights CLIP’s robust generalization for image retrieval, though further optimization or
adaption could yield even higher accuracy [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ]. However, relevance based solely on Re-ID does not
fully capture the complexity of real-world patent search scenarios. One major challenge is evaluating
patent retrieval systems is the lack of reliable, high-quality relevance data, especially for image-based,
multimodal and cross-modal search. To circumvent the labor-intensive process of annotating large
datasets, pseudo ground-truth labels such as patent citations [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] are used, which are mainly based on
text-based retrieval results. As a result, there occurs a misalignment between the ground truth and the
multimodal retrieval tasks, often labeling relevant patents as non-relevant or without any relevance
judgment.
      </p>
      <p>In addition to the quantitative metrics, we present qualitative results to demonstrate the model’s
capability to retrieve visually and semantically similar patent images. In Figure 3, for each query image,
the top-three retrieved images exhibit strong visual similarity and consistent semantic content. For
example, the query image from patent US2020340208A1 retrieves images of similar machinery with
comparable structural details. Similarly, the queries of patents US2020339341A1 and US2022267022A1,
yield closely related designs and visualizations. These qualitative examples highlight the efectiveness
of CLIP in capturing both fine-grained visual features and relevant semantics, confirming its practical
utility in multimodal patent image retrieval.</p>
      <p>US2020340208A1 EP3670764A1 WO2020022454A1EP3670764A1
US2020339341A1 EP3738925A1 US2021205184A1WO2018051449A1</p>
      <p>US2022267022A1 EP3640143A1 US2012207401A1US2011095912A1</p>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusions and Future Work</title>
      <p>In this paper, we have introduced iPatent, a novel web-based platform for patent image retrieval and
analysis. Unlike existing systems, iPatent leverages state-of-the-art models for unimodal, cross-modal,
and multimodal retrieval of patent images with semantic search capability. The platform also supports
ifne-grained searching for sub-figures or components through as interactive query image cropping
feature. For organizing search results, iPatent uses k-means clustering, while LVLMs automatically
generate informative cluster titles and descriptions. Leveraging modern web technologies, iPatent
provides an interactive and visually informative interface, supporting eficient exploration of large
patent databases.</p>
      <p>In future work, we aim to enhance iPatent with additional ranking and interpretation features such
as LVLM-based re-ranking and explanation of ranking results. We plan to improve cluster visualizations
with interactive 3D projections for more intuitive exploration. Furthermore, we plan to incorporate
faceted search options such as filtering by patent class or figure type. Additionally, LVLMs may introduce
biases in the content they generate. A systematic study of the synthetic content and its potential biases
is planned as an import direction for future research.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This article has been funded by the Academic Research Programme of the European Patent Ofice
(project "ViP@Scale: Visual and multimodal patent search at scale"). We would like to thank Wolfgang
Gritz and Matthias Springstein (both TIB - Leibniz Information Centre for Science and Technology) for
their feedback and help in developing iPatent.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Grammarly in order to: Grammar and spelling
check, Paraphrase and reword. After using these tool(s)/service(s), the author(s) reviewed and edited
the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Krestel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chikkamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hewel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Risch</surname>
          </string-name>
          ,
          <article-title>A survey on deep learning for patent analysis</article-title>
          ,
          <source>World Patent Information</source>
          <volume>65</volume>
          (
          <year>2021</year>
          )
          <article-title>102035</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.wpi.
          <year>2021</year>
          .
          <volume>102035</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zellmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Elbeshausen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Womser-Hacker</surname>
          </string-name>
          ,
          <article-title>Elicitation of requirements for innovative visual patent retrieval based on interviews with experts</article-title>
          ,
          <source>in: The Information Behaviour Conference</source>
          ,
          <string-name>
            <surname>ISIC</surname>
          </string-name>
          <year>2022</year>
          , Berlin, Germany,
          <source>September 26-29</source>
          ,
          <year>2022</year>
          ,
          <year>2022</year>
          , p.
          <source>isic2234. doi:10</source>
          .47989/irisic2234.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sidiropoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vrochidis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kompatsiaris</surname>
          </string-name>
          ,
          <article-title>Content-based binary image retrieval using the adaptive hierarchical density histogram</article-title>
          ,
          <source>Pattern Recognition</source>
          <volume>44</volume>
          (
          <year>2011</year>
          )
          <fpage>739</fpage>
          -
          <lpage>750</lpage>
          . doi:
          <volume>10</volume>
          .1016/J. PATCOG.
          <year>2010</year>
          .
          <volume>09</volume>
          .014.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <article-title>PATSEEK: content based image retrieval system for patent database</article-title>
          , in: International Conference on Electronic Business,
          <string-name>
            <surname>ICEB</surname>
          </string-name>
          <year>2004</year>
          , Beijing, China, December 5-
          <issue>9</issue>
          ,
          <year>2004</year>
          , Academic Publishers/World Publishing Corporation,
          <year>2004</year>
          , pp.
          <fpage>1167</fpage>
          -
          <lpage>1171</lpage>
          . URL: https://aisel.aisnet. org/iceb2004/199.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          ,
          <source>July 18-24</source>
          ,
          <year>2021</year>
          , PMLR,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          . URL: http://proceedings.mlr.press/v139/radford21a.html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Pustu-Iren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bruns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ewerth</surname>
          </string-name>
          ,
          <article-title>A multimodal approach for semantic patent image retrieval, in: Patent Text Mining and Semantic Technologies co-located with the ACM SIGIR Conference on Research and Development in Information Retrieval</article-title>
          ,
          <source>PatentSemTech@SIGIR</source>
          <year>2021</year>
          , Aachen, Germany, July
          <volume>15</volume>
          ,
          <year>2021</year>
          , CEUR-WS.org,
          <year>2021</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>49</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2909</volume>
          /paper6. pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hsiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Large language model informed patent image retrieval, in: Patent Text Mining and Semantic Technologies co-located with the ACM SIGIR Conference on Research and Development in Information Retrieval</article-title>
          ,
          <source>PatentSemTech@SIGIR</source>
          <year>2024</year>
          , Washington D.C., USA, July
          <volume>28</volume>
          ,
          <year>2024</year>
          , CEUR-WS.org,
          <year>2024</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>60</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3775</volume>
          /paper11.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Shomee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Medya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Ravi</surname>
          </string-name>
          ,
          <article-title>IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents</article-title>
          ,
          <source>in: Conference on Neural Information Processing Systems</source>
          ,
          <source>NeurIPS</source>
          <year>2024</year>
          , Vancouver, BC, Canada,
          <source>December 10-15</source>
          ,
          <year>2024</year>
          ,
          <string-name>
            <given-names>Curran</given-names>
            <surname>Associates</surname>
          </string-name>
          , Inc.,
          <year>2024</year>
          , pp.
          <fpage>125520</fpage>
          -
          <lpage>125546</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2024/ ifle/e3301977b92f28e32639ec99eb08f4a1-Paper-Datasets_and_Benchmarks_Track.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Streamlit</surname>
          </string-name>
          , Streamlit, https://streamlit.io/,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-22.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Qdrant</surname>
          </string-name>
          , Qdrant documentation, https://qdrant.tech/documentation/concepts/search/,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-22.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>MinIO</surname>
          </string-name>
          , Minio documentation, https://min.io/docs/minio/linux/index.html,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          - 04-22.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ollama</surname>
          </string-name>
          , Ollama documentation, https://docs.llamaindex.ai/en/stable/examples/llm/ollama/,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-22.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ilharco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wightman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Taori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Namkoong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , L. Schmidt, Openclip,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .5281/ zenodo.5143773.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Malkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Yashunin</surname>
          </string-name>
          ,
          <article-title>Eficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>42</volume>
          (
          <year>2020</year>
          )
          <fpage>824</fpage>
          -
          <lpage>836</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2018</year>
          .
          <volume>2889473</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          ,
          <article-title>Least squares quantization in pcm</article-title>
          ,
          <source>IEEE Transactions on Information Theory</source>
          <volume>28</volume>
          (
          <year>1982</year>
          )
          <fpage>129</fpage>
          -
          <lpage>137</lpage>
          . doi:
          <volume>10</volume>
          .1109/TIT.
          <year>1982</year>
          .
          <volume>1056489</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Visual instruction tuning</article-title>
          ,
          <source>in: Conference on Neural Information Processing Systems</source>
          ,
          <source>NeurIPS</source>
          <year>2023</year>
          , New Orleans, LA, USA, December
          <volume>10</volume>
          -
          <issue>16</issue>
          ,
          <year>2023</year>
          ,
          <string-name>
            <given-names>Curran</given-names>
            <surname>Associates</surname>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>34892</fpage>
          -
          <lpage>34916</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2023/ ifle/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>W.-L. Chiang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , L. Zheng,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>E. P.</given-names>
          </string-name>
          <string-name>
            <surname>Xing</surname>
          </string-name>
          ,
          <source>Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality</source>
          ,
          <year>2023</year>
          . URL: https://lmsys.org/blog/2023-03-30-vicuna/.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kucer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Castorena</surname>
          </string-name>
          , J. Wu, Deeppatent:
          <article-title>Large scale patent drawing recognition and retrieval</article-title>
          ,
          <source>in: IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          , WACV 2022,
          <article-title>Waikoloa</article-title>
          ,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA, January 3-
          <issue>8</issue>
          ,
          <year>2022</year>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>557</fpage>
          -
          <lpage>566</lpage>
          . doi:
          <volume>10</volume>
          .1109/WACV51458.
          <year>2022</year>
          .
          <volume>00063</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Higuchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yanai</surname>
          </string-name>
          ,
          <article-title>Patent image retrieval using transformer-based deep metric learning</article-title>
          ,
          <source>World Patent Information</source>
          <volume>74</volume>
          (
          <year>2023</year>
          )
          <article-title>102217</article-title>
          . doi:https://doi.org/10.1016/j.wpi.
          <year>2023</year>
          .
          <volume>102217</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-C. Hung</surname>
            ,
            <given-names>C.-F.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Density-refine: Patent image retrieval by density-based region extraction and feature fusion</article-title>
          ,
          <source>Journal of Mechanical Design</source>
          <volume>147</volume>
          (
          <year>2025</year>
          )
          <article-title>081703</article-title>
          . doi:
          <volume>10</volume>
          .1115/1. 4067749.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zenz</surname>
          </string-name>
          , CLEF-IP
          <year>2011</year>
          :
          <article-title>Retrieval in the intellectual property domain, in: CLEF-IP Workshop co-located with the Conference and Labs of the Evaluation Forum</article-title>
          ,
          <source>CLEF-IP@CLEF</source>
          <year>2011</year>
          , Amsterdam, The Netherlands,
          <source>September 19-22</source>
          ,
          <year>2011</year>
          , CEUR-WS.org,
          <year>2011</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1177</volume>
          /
          <article-title>CLEF2011wn-CLEF-IP-PiroiEt2011</article-title>
          .pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>