<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norway's Digitised Book Collection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marie Roald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Magnus Breder Birkenes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lars Gunnarsønn Bagøien Johnsen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research and Special Collections, The National Library of Norway</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <fpage>892</fpage>
      <lpage>905</lpage>
      <abstract>
        <p>Digital tools for text analysis have long been essential for the searchability and accessibility of digitised library collections. Recent computer vision advances have introduced similar capabilities for visual materials, with deep learning-based embeddings showing promise for analysing visual heritage. Given that many books feature visuals in addition to text, taking advantage of these breakthroughs is critical to making library collections open and accessible. In this work, we present a proof-of-concept image search application for exploring images in the National Library of Norway's pre-1900 books, comparing Vision Transformer (ViT), Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for LanguageImage Pre-training (SigLIP) embeddings for image retrieval and classification. Our results show that the application performs well for exact image retrieval, with SigLIP embeddings slightly outperforming CLIP and ViT in both retrieval and classification tasks. Additionally, SigLIP-based image classification can aid in cleaning image datasets from a digitisation pipeline.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;image retrieval</kwd>
        <kwd>computer vision</kwd>
        <kwd>embeddings</kwd>
        <kwd>vector search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>[4] and corresponding webapps2 which provides tools based on text aggregates (e.g. n-grams,
collocations and concordances) to facilitate automated and reproducible analysis of the text.</p>
      <p>Currently, these tools have largely been based on text extracted from Analysed Layout and
Text Object-Extensible Markup Language (ALTO-XML) files 3 generated by optical character
recognition (OCR) models during digitisation5[]. However, the output XML also contains
coordinates for graphical elements. These graphical elements represent non-textual elements
in the books, e.g. illustrations or decorations. While such elements are an important part of
the books, they have been cumbersome to explore, requiring manual inspection. Therefore, an
essential missing step for making NLN’s digitised collection more accessible is making these
graphical elements easier to explore and analyse.</p>
      <p>
        An approach to make such elements explorable, is creating tools for image search, either in
the form of exact image retrieval (i.e. recovering a specific image) or semantic image retrieval
(i.e. recovering an image with similar contents) or both. While text-based search engines are
commonplace, image search is more complicated 1[
        <xref ref-type="bibr" rid="ref27 ref6">6, 28</xref>
        ]. Early methods matched images
using surrounding text [
        <xref ref-type="bibr" rid="ref14">16</xref>
        ], but this approach demands high-quality textual descriptions, which
can be lacking. Alternatively, exact image retrieval traditionally relies on handcrafted image
features for comparison [
        <xref ref-type="bibr" rid="ref14 ref27">16, 28</xref>
        ]. Handcrafting such features can be challenging, and typically
form a dense vector, which can hinder efÏcient lookups.
      </p>
      <p>
        However, recent technological advancements have simplified the implementation of image
search engines. Various tools now implement efÏcient search indices for dense vectors, such
as the hierarchical navigable small worlds (HNSW) index [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Moreover convolutional
neural networks (CNNs) and vision transformers (ViTs) have alleviated the need for handcrafted
image features for computer vision [
        <xref ref-type="bibr" rid="ref8">8, 7</xref>
        ]. Furthermore, there has been an influx of
multimodal models, like Contrastive Language-Image Pre-training (CLIP) 1[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Sigmoid Loss for
Language Image Pre-Training (SigLIP) [
        <xref ref-type="bibr" rid="ref26">27</xref>
        ]. The recent advances in computer vision and
proliferation of advanced pre-trained computer vision models has empowered the development of
new research and tools for exploring and analysing image-based data in the digital humanities
[
        <xref ref-type="bibr" rid="ref2 ref20 ref21 ref9">2, 25, 21, 9, 22, 20</xref>
        ].
      </p>
      <p>
        Previous work on machine learning-driven computer vision-based image search tools for
digital humanities mainly focuses on cleanly digitised materials such as collections of videos,
photographs, lantern slides and medieval illuminations2[
        <xref ref-type="bibr" rid="ref15 ref20 ref21">, 21, 22, 17</xref>
        ]. However, there is limited
work applying such tools to images extracted from the output of automatic layout detection
of scanned media, e.g. books and newspapers. Such image collections pose unique challenges.
First, the magnitude of data is often larger than for collections of photographs. Second, such
data can contain artefacts not found in cleanly digitised materials. For example, detected
bounding boxes might be inaccurate. False positives can occur, where the automatic layout detection
mistakenly marks, e.g. tables or blank pages, as graphical elements. Avoiding such artefacts
can be infeasible, as redoing layout analysis for a collection of sizeable magnitude can be
costprohibitive and not guaranteed to succeed. Therefore, a natural next step is exploring machine
learning-based image retrieval in the context of NLN’s collection of scanned automatically
processed media.
2https://www.nb.no/dh-lab/apper/
3https://www.loc.gov/standards/alto/
This short paper details ongoing work on these challenges, with three primary contributions:
1. Developing a proof-of-concept image search application for NLN’s pre-1900 books.
2. Comparing modern image embeddings for image retrieval in NLN’s digitised books.
3. Evaluating pre-trained models for fine-tuned classification of image categories.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and related work</title>
      <p>
        Two traditional approaches for image retrieval are context-based full-text search — querying
the images’ textual context — and hashing-based approaches for exact image retrieval. The
former typically works by using an inverted index to efÏciently retrieve relevant images via
e.g. term frequency-inverse document frequency (TF-IDF) weighting [
        <xref ref-type="bibr" rid="ref23">24</xref>
        ], before potentially
reranking them based on image features [
        <xref ref-type="bibr" rid="ref14">16</xref>
        ]. The hashing-based alternative works by computing
a compact hash, or “fingerprint”, that can be used for efÏcient exact image retrieval [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        More recent image retrieval approaches compute image similarities using deep
learningbased image classification models such as ViTs [ 7] or CNNs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These models first transform
an image into an embedding, which is used as input for a logistic regression model. The key
insight in using these models for image retrieval is that we can compute image similarities by
comparing the embeddings, e.g. with the cosine similarity.
      </p>
      <p>
        However, by using classification models, we assume that embeddings learned by training
on image-label combinations are informative enough to group images semantically, which can
hinder generalisation to out-of-sample images 1[5]. Another approach is multimodal models
like CLIP and SigLIP. In short, these models work by combining an image transformer and
a text transformer to compute image and text embeddings – aligning them to ensure strong
cosine similarity for matching pairs. This approach has been successfully applied to e.g. image
retrieval and zero-shot classification [
        <xref ref-type="bibr" rid="ref17">19</xref>
        ], and generalise better to out-of-sample images [
        <xref ref-type="bibr" rid="ref17">19,
15</xref>
        ].
      </p>
      <p>
        During CLIP and SigLIP training, models receive shufÒed image-caption pairs and compute
probabilities for matches. Such training demands extensive data and computational resources.
To circumvent this, it is common to use pre-trained models and the popularity of model
repositories, like Huggingface Hub [
        <xref ref-type="bibr" rid="ref25">26</xref>
        ] and Torch Hub [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], has made using models trained on massive
datasets accessible.
      </p>
      <p>
        While methods for efÏcient sparse vector queries have existed for decades [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], querying
based on image embeddings requires dense vector queries, which is still a research topic.
However, the recently proposed HNSW-index for approximate nearest neighbour search 1[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has
gained traction for accuracy and efÏciency. The index consists of a hierarchy of navigable small
world graphs [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], each built from diferent data subsets, and querying consists of iteratively
traversing the hierarchy, enabling efÏcient navigation through large datasets.
      </p>
      <p>
        Applying modern computer vision to problems in digital humanities has recently gained
traction. The term distant viewing is introduced in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which demonstrates how computer
vision methods for clustering and object detection can be applied to image- and video-data.
Building on this, [25] shows how CNN-based semantic image retrieval can be used to explore
trends in newspaper advertisements and illustrations extracted from Delpher — a digitised
materials search engine by the Dutch national library. Moreover, [
        <xref ref-type="bibr" rid="ref15">17</xref>
        ] demonstrate how a
combination of monomodal image- and language-models can be used to combine and enrich
two manually annotated collections of medieval illuminations and2[
        <xref ref-type="bibr" rid="ref1 ref21">1, 22</xref>
        ] shows how a CLIP
model can be used to explore and label magic lantern slides efÏciently and that it can struggle
with zero-shot classification of old illustrations. Using CLIP embeddings, [20] clusters news
videos and employs a graph-based approach for efÏcient exploration. Machine learning-driven
image retrieval tools for libraries and museums, like Maken4 , Bildsök5 and Nasjonalmuseet
Beta6 have also emerged. These previous works highlight computer vision’s potential in digital
humanities, and thus, evaluating and comparing such models in the context of NLN’s digitised
book collection is a relevant next step.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Extracting images</title>
        <p>To search the images, they must first be extracted from the digitised book collection.
During NLN’s digitisation, books are scanned and processed through a pipeline including
layout detection and OCR, producing ALTO-XML files 7 named after Uniform Resource Names
(URNs). These files contain page information, describing the page in terms of four block types:
TextBlock, Illustration, GraphicalElement and CompositeBlock (blocks containing other
blocks)8. In the ALTO-XML files parsed for this work, all illustrations and graphical elements
are tagged as GraphicalElement. Parsing these files, we extracted the page URN, coordinates,
and size for each graphical element in addition to the textual context of each image in the
digitised books. For this work, we processed pre-1900 books, creating a sufÏciently large, yet
manageable subset for testing.</p>
        <p>For each graphical element, we used NLN’s IIIF API9 to download images from URLs
following the format in Table1, discarding images with aspect-ratio≥ 50. By integrating ALTO-XML
ifles with the IIIF endpoint — both technologies already utilised by NLN — we obtained images
from digitised Norwegian books before 1900.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Creating the vector search application</title>
        <p>
          We computed image embeddings using Huggingface Transfomers 2[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] with three models: ViT
(google/vit-base-patch16-22410), CLIP (openai/clip-vit-base-patch3211) and SigLIP
(google/siglip-base-patch16-256-multilingual12). Each pre-trained model’s
preprocessing pipeline involved resizing images to the input shapes (224 for ViT and CLIP, and 256 for
SigLIP) and scaling the pixel values. For ViT and SigLIP, images were resized to 224 × 224 and
4https://www.nb.no/maken/
5https://lab.kb.se/bildsok/
6https://beta.nasjonalmuseet.no/collection/
7https://digitalpreservation-blog.nb.no/docs/formats/preferred-formats-en/
8https://www.loc.gov/standards/alto/techcenter/layout.html
9https://iiif.io/api/image/2.0/
10Commit hash: 3f49326eb077187dfe1c2a2bb15fbd74e6ab91e3
11Commit hash: 3d74acf9a28c67741b2f4f2ea7635f0aaf6f0268
12Commit hash: a66c5982c8c396206b96060e2bf837d6731a326f
256 × 256 pixels, altering the aspect ratio. CLIP resized the smallest dimension to 224,
preserving the aspect ratio, then center-cropped to 224 × 224 pixels. Next, we used the corresponding
image transformer and obtained embeddings of sizes 768 (ViT and SigLIP) and 512 (CLIP).
        </p>
        <p>After computing embeddings, we ingested them into a Qdrant database and used FastAPI to
create an application programming interface (API) for efÏcient querying by images, embedding
vectors, image IDs, or context-based text search. Qdrant supports fast K-nearest neighbour
search for both dense and sparse vectors. For image-based queries, we used a cosine
similaritybased HNSW index, and for context-based full-text queries, we used a dot-product-based
inverted index for TF-IDF (details in supplement on GitHub13). We used default parameters for
all search indices. The vector database and the API are hosted on-premise, exposing only the
API to the Internet. The application also includes a frontend, implemented using Flask and
HTMX, hosted using Google Cloud Run with 512 MiB RAM and one vCPU.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Classifying based on embedding vectors</title>
        <p>As the graphical elements stem from NLN’s digitisation process, many segmentation anomalies
are also tagged as graphical elements. Common examples are blank pages, parts of tables,
and text. To estimate the fraction of such regions, we used HumanSignal Label Studio and
manually labelled a dataset containing 2000 images as eitherBlank page, Segmentation anomaly,
Illustration or photograph, Musical notation, Map, Mathematical chart or Graphical element (e.g.
initial, decorative border, etc.).</p>
        <p>
          After labelling the data, we fitted regularised logistic regression models (using scikit-learn
v1.5.0 [
          <xref ref-type="bibr" rid="ref16">18</xref>
          ]) to classify images based on their embedding vectors. This can be interpreted as a
form of transfer learning, fine-tuning the last layer of the transformer model. The embedding
vector type (i.e. ViT, CLIP or SigLIP) and the complexity parameter (inverse ridge parameter)
were selected using nested cross-validation with 20 outer folds and ten inner folds. Models were
selected based on a micro-averaged F1-score (the harmonic mean of micro-averaged precision
and sensitivity). We selected the complexity parameter from ten logarithmically spaced values
13https://github.com/Sprakbanken/CHR24-image-retrieval
between 10−4 and 104. Finally, we computed the confusion matrix in the outer cross-validation
loop (the evaluation loop). The supplement describes the overall cross-validation algorithm in
Algorithms 1 and 2.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluating searches</title>
        <p>To evaluate the search, we first manually inspected some example queries before performing
a systematic evaluation on exact image retrieval. To simulate exact image retrieval scenarios,
we selected the 684 images labelled as Illustration or photograph, Map or Mathematical chart
as target images, and applied random cropping ≤( 15 %, independently on all sides), rotation
(±0 − 10 ∘) and scaling (±0 − 20 %, independently for width and height). Then, querying the
database with these transformed images, we evaluated the Top accuracy measuring whether
our application retrieved the target image in the first result (Top 1), first row (Top 5), first two
rows (Top 10) or results at all (Top 50).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>14https://dh.nb.no/run/bildesok/
15The labels and analysis code are available on GitHub</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and conclusion</title>
      <p>These promising results demonstrate that pre-trained computer vision models provide
meaningful embeddings. This is notable as our data consists of pre-1900 book images and difers
vastly from the training set of such models, which are typically scraped from the internet.
Furthermore, the results indicate that SigLIP embeddings slightly outperforms CLIP and ViT for
all tasks — even for image classification, which ViT was trained for — in line with prior results
showing that multimodal models are more robust to out-of-sample data1[5].</p>
      <p>While all models perform well for retrieval, CLIP sometimes struggled, particularly if the
object of interest was of-centre. In such cases, the object is cropped out during preprocessing
and matches will be based on the remaining image. Furthermore, the application performs
well for exact image retrieval, even with up to30 % cropping in both directions and up to±10 ∘
rotation. These results are promising, but more work is still needed to evaluate performance
for other degradations (e.g. simulated print and scanning artefacts). Finally, the encouraging
image classification results indicate advantages of adding this methodology to the data
ingestion pipeline. Filtering out irrelevant elements can save up to40 % storage and improve the
search results.</p>
      <p>In conclusion, we found that by combining tagged graphical elements of the book digitisation
process, NLN’s IIIF endpoint and recent advances in artificial intelligence, we can create an
efÏcient image search application that facilitates exploring the library’s collection in a new
way.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future work</title>
      <p>As the current prototype image-search app only supports books pre-1900, a natural extension is
including illustration objects from all NLN’s digitised books and newspapers. Moreover, as one
use case we consider is exact image retrieval, an obvious next step is more thorough analysis of
the the application’s accuracy on this task, e.g. using additional evaluation measurements for
recall, and including domain-specific degradation (e.g. simulated halftone and scanning
artefacts). Another avenue for future work is comparing deep learning-based similarity measures
with simpler, less computation- and storage-intensive approaches like hashing-based methods.
Additionally, we want to make the software more adaptable, ultimately creating open-source
infrastructure to further these methods’ accessibility for other ALTO-XML and IIIF collections.</p>
      <p>
        Future work should explore the embeddings further, e.g. using CLIP and SigLIP for
textbased image retrieval. Additionally, performance could improve by fine-tuning the embeddings
on domain-relevant data. Moreover, we have so far only used the embeddings for image
retrieval and classification. Using the embeddings as the base to discover clusters, automatically
tag the images or create image descriptions are, therefore, interesting potential steps. Another
important direction is digging deeper into what the models consider ”similar” through
visualisations and empirical experiments. Finally, because deep learning-based embeddings are
trained on datasets with known biases [
        <xref ref-type="bibr" rid="ref21">3, 22, 14</xref>
        ], examining biases in these embeddings is
crucial.
[14]
[15]
      </p>
      <p>A. Mandal, S. Little, and S. Leavy. “Multimodal bias: Assessing gender bias in computer
vision models with NLP techniques”. In:Proceedings of the 25th International Conference
on Multimodal Interaction (ICMI ’23). Paris, France, 2023, pp. 416–424.</p>
      <p>D. Mayo, J. Cummings, X. Lin, D. Gutfreund, B. Katz, and A. Barbu. “How hard are
computer vision datasets? calibrating dataset difÏculty to viewing time”. In: Proceedings of
the 37th International Conference on Neural Information Processing Systems. New Orleans,
LA, USA, 2023, pp. 11008–11036.
[25]</p>
      <sec id="sec-6-1">
        <title>Query image Model Pos. 1 Pos. 2</title>
      </sec>
      <sec id="sec-6-2">
        <title>Continued on next page Query image Model Pos. 1</title>
      </sec>
      <sec id="sec-6-3">
        <title>SigLIP</title>
      </sec>
      <sec id="sec-6-4">
        <title>CLIP ViT</title>
      </sec>
      <sec id="sec-6-5">
        <title>SigLIP</title>
      </sec>
      <sec id="sec-6-6">
        <title>CLIP ViT</title>
        <p>Map</p>
      </sec>
      <sec id="sec-6-7">
        <title>Mathematical chart</title>
      </sec>
      <sec id="sec-6-8">
        <title>Musical</title>
        <p>notation</p>
      </sec>
      <sec id="sec-6-9">
        <title>Graphical</title>
        <p>element</p>
      </sec>
      <sec id="sec-6-10">
        <title>Blank</title>
        <p>page</p>
      </sec>
      <sec id="sec-6-11">
        <title>Segmentation anomaly</title>
      </sec>
      <sec id="sec-6-12">
        <title>Illustration or photograph</title>
      </sec>
      <sec id="sec-6-13">
        <title>In total</title>
      </sec>
      <sec id="sec-6-14">
        <title>A perfect classifier will only have nonzero entries on the diagonal.</title>
        <p>Musical
notation
lca ion
i a
s t
uM ton
tteahm trcah</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ansel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Voznesensky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Berard</surname>
          </string-name>
          , E. Burovski,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chauhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chourdia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Constable</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          , E. Ellison,
          <string-name>
            <given-names>W.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gschwind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hirsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kalambarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kirsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lazos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lezcano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Luk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Puhrsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saroufim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Siraichi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Suk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Suo,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tillet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wen</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          . “
          <article-title>PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation”</article-title>
          .
          <source>In:Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems</source>
          , Volume
          <volume>2</volume>
          . La Jolla, CA, USA,
          <year>2024</year>
          , pp.
          <fpage>929</fpage>
          -
          <lpage>947</lpage>
          . doi:
          <volume>10</volume>
          .1145/3620665.3640366.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Arnold</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tilton</surname>
          </string-name>
          . “Distant Viewing:
          <article-title>Analyzing Large Visual Corpora”</article-title>
          .
          <source>InD:igital Scholarship in the Humanities 34.Supplement_1</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>i3</fpage>
          -
          <lpage>i16</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqz013.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Birhane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. U.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Kahembwe</surname>
          </string-name>
          .
          <article-title>Multimodal datasets: misogyny, pornography, and malignant stereotypes</article-title>
          . https://arxiv.org/abs/2110.
          <year>01963</year>
          .
          <year>2021</year>
          . arXiv:
          <fpage>2110</fpage>
          .
          <year>01963</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>M. B. Birkenes</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Johnsen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Kåsen. “NB</given-names>
            <surname>DH-LAB</surname>
          </string-name>
          :
          <article-title>A Corpus Infrastructure for Social Sciences and Humanities Computing”</article-title>
          .
          <source>In:CLARIN Annual Conference Proceedings 2023. Leuven, Belgium</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>30</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>M. B. Birkenes</surname>
            ,
            <given-names>L. G.</given-names>
          </string-name>
          <string-name>
            <surname>Johnsen</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Lindstad</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ostad</surname>
          </string-name>
          . “
          <article-title>From Digital Library to N-Grams: NB N-gram”</article-title>
          .
          <source>In: Proceedings of the 20th Nordic Conference of Computational Linguistics</source>
          . Vilnius, Lithuania,
          <year>2015</year>
          , pp.
          <fpage>293</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Biswas</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Blanco-Medina</surname>
          </string-name>
          .
          <article-title>State of the Art: Image Hashing</article-title>
          . https://arxiv.org/abs/2 108.11794.
          <year>2021</year>
          . arXiv:
          <volume>2108</volume>
          .
          <fpage>11794</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Dehghani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Minderer</surname>
            , G. Heigold,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gelly</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Uszkoreit</surname>
            , and
            <given-names>N. Houlsby. “</given-names>
          </string-name>
          <article-title>An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”</article-title>
          .
          <source>In:International Conference on Learning Representations</source>
          . Vienna, Austria,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          . doi:
          <volume>10</volume>
          .48550/a rXiv.
          <year>2010</year>
          .
          <volume>11929</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          . “
          <article-title>Deep Residual Learning for Image Recognition”</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas</source>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C. S.</given-names>
            <surname>Wilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Beelen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>McDonough</surname>
          </string-name>
          . “
          <article-title>MapReader: A Computer Vision Pipeline for the Semantic Exploration of Maps at Scale”</article-title>
          .
          <source>InP:roceedings of the 6th ACM SIGSPATIAL International Workshop on Geospatial Humanities</source>
          . Seattle, WA, USA,
          <year>2022</year>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>19</lpage>
          . doi:
          <volume>10</volume>
          .1145/3557919.3565812.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>D. E. Knuth.</surname>
          </string-name>
          <article-title>The Art of Computer Programming</article-title>
          . Vol.
          <volume>3</volume>
          , Sorting and Searching (2nd Ed.) 2nd ed. Reading, MA, USA:
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kopinor.</surname>
          </string-name>
          Bokhylla-avtalen (fra
          <year>2024</year>
          ). https://www.kopinor.no/avtaletekster/bokhylla-a vtalen-fra-
          <year>2024</year>
          .
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Malkov</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Yashunin</surname>
          </string-name>
          . “
          <article-title>EfÏcient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs”</article-title>
          .
          <source>In:IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>42</volume>
          .4 (
          <issue>2020</issue>
          ), pp.
          <fpage>824</fpage>
          -
          <lpage>836</lpage>
          . doi:
          <volume>10</volume>
          .1109/tpami.
          <year>2018</year>
          .
          <volume>2889473</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Malkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ponomarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Logvinov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Krylov</surname>
          </string-name>
          . “
          <source>Approximate Nearest Neighbor Algorithm Based on Navigable Small World Graphs”. In:Information Systems 45.null</source>
          (
          <year>2014</year>
          ), pp.
          <fpage>61</fpage>
          -
          <lpage>68</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.is.
          <year>2013</year>
          .
          <volume>10</volume>
          .006.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          . “
          <article-title>Multimedia Search Reranking: A Literature Survey”</article-title>
          .
          <source>In: ACM Computing Surveys 46.3</source>
          (
          <issue>2014</issue>
          ),
          <volume>38</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          :
          <fpage>38</fpage>
          . doi:
          <volume>10</volume>
          .1145/2536798.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Meinecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Guéville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Wrisley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Jänicke</surname>
          </string-name>
          . “Is Medieval Distant Viewing Possible?
          <article-title>: Extending and Enriching Annotation of Legacy Image Collections Using Visual Analytics”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 39.2</source>
          (
          <issue>2024</issue>
          ), pp.
          <fpage>638</fpage>
          -
          <lpage>656</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqae020.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and É. Duchesnay. “
          <string-name>
            <surname>Scikit-Learn</surname>
          </string-name>
          :
          <article-title>Machine Learning in Python”</article-title>
          .
          <source>In:Journal of Machine Learning Research</source>
          <volume>12</volume>
          .
          <string-name>
            <surname>null</surname>
          </string-name>
          (
          <year>2011</year>
          ), pp.
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [19] [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <surname>and I. Sutskever.</surname>
          </string-name>
          “
          <article-title>Learning Transferable Visual Models From Natural Language Supervision”</article-title>
          .
          <source>In:Proceedings of the 38th International Conference on Machine Learning. Online</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Ruth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Burghardt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Liebl</surname>
          </string-name>
          . “
          <article-title>From Clusters to Graphs - Toward a Scalable Viewing of News Videos”</article-title>
          .
          <source>In:Computational Humanities Research Conference</source>
          <year>2023</year>
          (
          <article-title>CHR2023).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          Paris, France,
          <year>2023</year>
          , pp.
          <fpage>167</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Smits</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          . “
          <article-title>Towards Multimodal Computational Humanities. Using CLIP to Analyze Late-Nineteenth Century Magic Lantern Slides</article-title>
          .”
          <source>In:Computational Humanities Research Conference</source>
          <year>2021</year>
          (
          <article-title>CHR2021)</article-title>
          . Online,
          <year>2021</year>
          , pp.
          <fpage>149</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Smits</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          .
          <article-title>“A Multimodal Turn in Digital Humanities. Using Contrastive Machine Learning Models to Explore, Enrich, and Analyze Digital Visual Historical Collections”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 38.3</source>
          (
          <issue>2023</issue>
          ), pp.
          <fpage>1267</fpage>
          -
          <lpage>1280</lpage>
          . doi:
          <volume>10</volume>
          .10 93/llc/fqad008.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Snydman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Cramer</surname>
          </string-name>
          . “
          <article-title>The International Image Interoperability Framework (IIIF): A Community &amp; Technology Approach for Web-Based Images”</article-title>
          . In: Archiving Conference. Los Angeles, CA, USA.,
          <year>2015</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K. Spärck</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>“A Statistical Interpretation of Term Specificity and Its Application in Retrieval”</article-title>
          .
          <source>In: Journal of Documentation 28.1</source>
          (
          <issue>1972</issue>
          ), pp.
          <fpage>11</fpage>
          -
          <lpage>21</lpage>
          . doi:
          <volume>10</volume>
          .1108/eb026526.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Smits</surname>
          </string-name>
          . “
          <article-title>The Visual Digital Turn: Using Neural Networks to Study Historical Images”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 35.1</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>194</fpage>
          -
          <lpage>207</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqy085.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rush</surname>
          </string-name>
          . “Transformers:
          <article-title>State-of-theArt Natural Language Processing”</article-title>
          .
          <source>In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-demos.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          . “
          <article-title>Sigmoid Loss for Language Image Pre-Training”</article-title>
          .
          <source>In: Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision</source>
          . Paris, France,
          <year>2023</year>
          , pp.
          <fpage>11975</fpage>
          -
          <lpage>11986</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          .
          <article-title>Recent Advance in Content-based Image Retrieval: A Literature Survey</article-title>
          . https://arxiv.org/abs/1706.06064.
          <year>2017</year>
          . arXiv:
          <volume>1706</volume>
          .
          <fpage>06064</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>