<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Grosso);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Geometries for Prototype-Based Image Classification: a Reality Check</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samuele Fonio</string-name>
          <email>samuele.fonio@unito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Grosso</string-name>
          <email>silvia.grosso@insa-lyon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Bouchenak</string-name>
          <email>sara.bouchenak@insa-lyon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Prototype Learning, Metric Learning, Non-Euclidean Geometries</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INSA Lyon - LIRIS</institution>
          ,
          <addr-line>Lyon</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Turin - Dept. of Computer Science</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Identifying the most appropriate metric to capture similarities between embeddings remains a longstanding challenge in Machine Learning (ML). Recent research has highlighted the potential of non-Euclidean geometries for modeling embedding distances, particularly in datasets with latent hierarchical structures. However, selecting an optimal geometry is non-trivial, and the literature lacks a comprehensive analysis of the advantages and limitations associated with diferent metric spaces. In this paper, we aim to address this gap by focusing on a setting where the choice of geometry plays a crucial role in the optimization process: Prototype Learning (PL). Through extensive comparisons across a diverse range of datasets, we uncover key insights into the behavior of non-Euclidean spaces, showcasing their limitation and possible future developments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Deep Learning (DL) has emerged as the main tool for detecting complex patterns and solving very
challenging tasks in Machine Learning (ML), achieving its best results in Computer Vision (CV) and
Natural Language Processing (NLP). It relies on Deep Neural Networks (DNNs), which are known
to be very efective, but also considered as black boxes, since they usually lack interpretability and
explainability by design.</p>
      <p>In particular, DL relies on compressing data representations in a so-called latent space, and then using
these representations to accomplish specific tasks. Especially for classification tasks, it is crucial to find
the best way to detect similarities between embeddings, since drawing decision boundaries heavily
depends on it.</p>
      <p>Recently, non-Euclidean geometries have garnered much interest in the research community.
Specifically, the key point of leveraging non-Euclidean geometries is to embed data onto specific manifolds
equipped with diferent metric spaces, to better shape the underlying relationships between data.
Among these, the hyperspherical and hyperbolic geometries showed promising research directions.</p>
      <p>
        Hyperspherical geometries, which mainly involve normalization of the embeddings and the use
of cosine similarity, have been well studied in contrastive learning and face recognition [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. This
geometry has the main benefit of setting a bound on the distance magnitude (since cosine similarity is
bounded) possibly reflecting the underlying geometry of the data.
      </p>
      <p>
        Hyperbolic geometries were introduced to explicitly handle hierarchical data, such as text [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
graphs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In these cases, the underlying geometry of the data is the hyperbolic one, since the data are
explicitly hierarchical and this geometry is able to embed hierarchical structures with arbitrarily low
distortion [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, recent advancements have shown the benefits of hyperbolic geometries also in
CV tasks [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ], where the hierarchy is somewhat hidden. While this information is crucial to justify
the use of these geometries, it is often overlooked, showing the leading performance of non-Euclidean
geometries without a clear explanation. On the other hand, it is possible to find CV tasks where a
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
image
image
image</p>
      <sec id="sec-1-1">
        <title>Backbone</title>
      </sec>
      <sec id="sec-1-2">
        <title>Prototype Update</title>
        <p>
          hierarchy is present but implicit. For example, in fine-grained image classification the labels are usually
arranged according to a hierarchy, making hyperbolic spaces particularly suitable. Another example is
remote sensing, where data can show specific hierarchical structures [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          Many works have shown the discrepancy between Euclidean and hyperbolic performances, especially
in few-shot learning [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and also raised some concerns about the numerical stability of these spaces [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
As a consequence, we aim to thoroughly investigate the impact of non-Euclidean geometries in a context
where metric elements are crucial: Prototype Learning [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] (PL). PL was originally proposed as a simple
yet eficient method for few-shot learning. The main idea behind PL is to learn a metric rather than a
distribution by minimizing some metric elements (e.g., a distance) between embeddings and their true
class representation, i.e., the prototypes. This approach led to good generalization properties, while
leveraging pure metrics in the embedding space.
        </p>
        <p>
          In this context, the geometry of the embedding space plays a crucial role. The employment of PL in
image classification is often overlooked, but recent advancements introduce non-Euclidean geometries
for Image classification [
          <xref ref-type="bibr" rid="ref13 ref14 ref7">13, 7, 14</xref>
          ], highlighting the potential for this paradigm to play an important
role in this domain. In fact, PL allows for smooth integration of regularization terms directly in the
embedding space, and is known to be more robust to Out-Of-Distribution (OOD) datasets [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. However,
the impact of the embedding geometry in this context has not been properly studied.
        </p>
        <p>
          In this work, we want to shed light on this aspect, by considering PL for image classification in both a
full-data setting (with a standard amount of data) and in few-data setting (with a small amount of data).
To do so, we leverage a framework in which the prototypes are parametric (See Figure 1) and updated
directly on the manifold. We thoroughly define the mathematical elements behind such optimization
process and show some possible drawbacks for Hyperspherical and Poincaré geometries. In particular,
since hyperbolic spaces are claimed to be more efective in robustness, we also evaluate the diferent
geometries with Out-Of-Distribution Detection (common evaluation in PL for image classification [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ])
and under Projected Gradient Descent (PGD) attacks. To provide a complete comparison also between
hyperbolic geometries, we leverage for the first time the Lorentz geometry for Image classification in a
PL setting. This is the first study comparing in a broad and fair manner all the geometries for PL in
image classification.
        </p>
        <p>To summarize, our contribution is threefold: i) we provide a comprehensive comparison of diferent
geometries in PL for image classification; ii) we introduce Lorentzian prototypical networks for image
classification; iii) we conduct extensive experiments to validate whether or not the non-Euclidean
geometries can benefit the learning process.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Prototype Learning</title>
        <p>
          Prototypical networks are the deep generalization of learning Vector Quantization machines [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and
nearest centroid classifiers [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>
          In most of the approaches, the prototypes are defined as centroids of the representations [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
positioned a priori [
          <xref ref-type="bibr" rid="ref13 ref18 ref8">13, 18, 8</xref>
          ] or used as parameters and updated alongside the training [
          <xref ref-type="bibr" rid="ref15 ref19">15, 19</xref>
          ]. In
particular, [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] highlights the importance and benefit of the latter choice, providing interesting results
of parametric prototypes in terms of several metrics (e.g., robustness and out-of-distribution detection).
An additional benefit of using parametric prototypes is the possibility to add a regularization on the
prototypes, rather than on the embeddings. For example, [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] shows the benefit of using parametric
prototypes and adding hierarchical information to the learning process, provided a priori through a
hierarchy of the labels.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Non-Euclidean Prototype Learning</title>
        <p>
          Hyperspherical PL The work presented in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] introduces a non-Euclidean geometry in PL, i.e., the
hyperspherical prototypical networks, keeping the prototypes fixed on the hypersphere and maximizing
the cosine separation between them. The addition of hierarchical information is also investigated, by
using a triplet-loss while positioning the prototypes. The importance of incorporating hierarchical
information in this non-Euclidean context was further explored by [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. It is worth mentioning
that in these frameworks, the prototypes are kept fixed throughout the training, while our approach
maintains the efort of updating the prototypes, by treating them as learnable parameters of the network.
Similarly to our approach, [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] tackles the specific challenge of face recognition and exploits the utility
of a normalization layer for the embeddings, leveraging the benefits of using the cosine similarity.
Diferently, we tackle a broader task with image classification, and enrich the comparison with other
non-Euclidean geometries.
        </p>
        <p>
          Hyperbolic PL Hyperbolic manifolds are claimed to represent hierarchical data with arbitrarily
low distortion [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Regarding hyperbolic models, five diferent ones have been defined: the Poincaré
ball, the Lorentz model (Hyperboloid), the Poincaré half-plane model, the hemisphere model, and the
Beltrami-Klein disk. The two most popular are the Poincaré ball and the Lorentz model.
        </p>
        <p>
          Among the works that operate mainly with the Poincaré ball there are [
          <xref ref-type="bibr" rid="ref20 ref21 ref6 ref8">20, 6, 21, 8</xref>
          ], covering
various computer vision tasks such as few-shot learning, image classification, and action recognition.
More specifically, some works highlighted that models built on hyperbolic spaces can outperform the
state-of-the-art Euclidean counterparts [
          <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
          ]. Among these, [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] gained particular relevance in the
recent literature, as it presented the first hyperbolic prototypical framework. However, [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] showed
that hyperbolic representations do not outperform well-structured Euclidean methods, raising some
questions about the supposed advantages typically attributed to these spaces.
        </p>
        <p>
          Our work can be considered a further exploration in this direction, setting up a benchmark for various
non-Euclidean geometries in the context of image classification. In fact, few works have tackled the
problem of exploiting hyperbolic geometries for image classification. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] pioneered this approach,
while [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] explored the impact of changing the temperature parameter in a contrastive loss when
exploring a hyperbolic space. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] introduces a method with fixed ideal prototypes positioned at the
boundary of the Poincaré ball, which is conceptually at an infinite distance from the center of the
hyperbolic space. A diferent approach is taken by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which showed that using hyperbolic entailment
cones as metric elements is particularly beneficial, especially in scenarios with a large number of classes.
However, in both these scenarios, the prototypes are fixed.
        </p>
        <p>
          Several works have been based on the Lorentz model [
          <xref ref-type="bibr" rid="ref3">23, 3</xref>
          ], but Computer Vision is usually
overlooked in the main literature. For example, [24] proposes a fully hyperbolic CNN based on Lorentz
operation, achieving better results than the Euclidean version and the Poincaré version [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Some
studies comparing the Poincaré ball and Lorentz models [
          <xref ref-type="bibr" rid="ref11">11, 25</xref>
          ] report that Lorentz operations are
numerically more stable. Our work involves this comparison, but focuses more on the performance on
which the hyperbolic spaces usually outperform the Euclidean ones: accuracy, robustness and OOD
detection.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <p>In this section, we are going to provide the background related to each geometry: Euclidean,
Hyperspherical, Lorentz and Poincaré ball. Comparing diferent geometries implies comparing diferent
manifolds, equipped with their own metric elements and operations. However, each manifold has its
own definition, impacting both on the metric elements and on their usage. In the following, we are
going to define the key elements to be used, and we are providing their formulation for each geometry.
Definition 1.</p>
      <p>A manifold ℳ of dimension  is a topological space such that each point’s neighborhood
can be locally approximated by the Euclidean space ℝ .
omeomorphic to ℝ , built as the first order approximation of
ℳ around  .</p>
      <p>Definition 2.</p>
      <p>Given a point  ∈ ℳ , the tangent space   ℳ of ℳ at  is the  -dimensional vector-space,
Definition 3.</p>
      <p>The Riemannian metric is the metric tensor that gives a local notion of angle, length of
curves, surface and volume. For a manifold ℳ, the Riemannian metric   is a smooth collection of inner
products on the associated tangent space:   ∶   ℳ ×   ℳ → ℝ. A Riemannian manifold is defined
as a manifold equipped with a Riemannian metric  , and is written (ℳ, ) .</p>
      <p>Definition 4.</p>
      <p>A geodesics  is the shortest path between two points on the manifold. It can be seen as the
generalization of the straight line in Euclidean spaces. Given ,  ∈ ℳ
measuring the length of the geodesic segment connecting the two points.
the distance (,  )
is defined by
and a vector  ∈   ℳ, the exponential map projects  to the
→ ℳ. The projection is obtained by moving the point along the geodesic
Definition 5.</p>
      <p>
        Given a point  ∈ ℳ
manifold ℳ, exp ( ) ∶   ℳ
 ∶ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] → ℳ
logarithmic map, log (⋅). See Figure 2 to visualize the exponential map.
      </p>
      <p>The precise definition of the exponential map depends on the manifold; its inverse function is called the
uniquely defined by  (0) =  and  ′(0) = . The projection is defined to be exp ( ) =  (1).
Definition 6.</p>
      <p>Given a point  ∈ ℝ  in the ambient space, the projection onto the manifold ℳ is the
operation that maps  to the closest point on ℳ with respect to a given metric (typically the Euclidean
metric), denoted by</p>
      <p>ℳ() ∶ ℝ → ℳ.</p>
      <p>Given these definitions, we can now describe the respective elements for each geometry.


sphere in the picture).</p>
      <p>The Euclidean setting is the standard one, as the embedding space of a Neural Network is
usually assumed to be equipped with this geometry. In this setting, there is no need of projection on
the manifold, as we assume the embedding to lie in a Euclidean space. The distance used is the standard
Euclidean distance:</p>
      <p>(,  ) = ‖ −  ‖.</p>
      <p>
        It is worth mentioning that we did not use the squared norm as it yields low performances [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
Hyperspherical The manifold in this case is defined as the hypersphere in  dimensions:
This manifold is usually equipped with the standard cosine distance:
 −1 = { ∈ ℝ  ∶ ‖‖ 2 = 1}.
      </p>
      <p>(,  ) = 1 −
exp ( ) = {
,
cos(‖ ‖) + sin(‖ ‖)
, if  ≠ 0,</p>
      <p>if  = 0.
Given a point  ∈ ℝ  ∖ {0}, the projection onto the unit hypersphere  −1 is:
Lorentz</p>
      <p>The  -dimensional Lorentz model of hyperbolic space, denoted by ℍ , is defined as:
where ⟨⋅, ⋅⟩ℒ is the Lorentzian inner product ℝ+1 is defined as:
Let  ∈  −1 ⊂ ℝ and  ∈  
 −1 = { ∈ ℝ  ∶ ⟨ , ⟩ = 0} . The exponential map at  is given by:
(1)
(2)
(3)
(4)
for any ,  ∈ ℝ +1 .</p>
      <p>The tangent space at a point  ∈ ℍ  is the subspace of ℝ+1 given by:</p>
      <p>ℍ = { ∈ ℝ +1 ∶ ⟨ , ⟩ ℒ = 0} .</p>
      <p>The Riemannian metric on ℍ is induced by the Lorentzian inner product restricted to the tangent
space. For tangent vectors ,  ∈</p>
      <p>ℍ , it is given by:
The distance between two points ,  ∈ ℍ  is given by:
Given a point  ∈ ℍ  and a tangent vector  ∈   ℍ , the exponential map is defined as:
  (,  ) = ⟨,  ⟩</p>
      <p>ℒ.</p>
      <p>ℍ(,  ) = arccosh(−⟨,  ⟩ ℒ).
exp ( ) = cosh(‖ ‖ ℒ) + sinh(‖ ‖ ℒ)‖ ‖ ℒ

where ‖ ‖ ℒ = √⟨ ,  ⟩
ℒ</p>
      <p>.</p>
      <p>Given a point  ∈ ℝ +1 such that ⟨, ⟩ ℒ &lt; 0 and  0 &gt; 0, the projection onto the hyperboloid is:</p>
      <p>The  -dimensional Poincaré ball model of hyperbolic space with curvature  , denoted
by   , is the open unit ball in ℝ :
If not otherwise stated, we assume  = 1 . At a point  ∈   , the tangent space is identified with ℝ :
  = { ∈ ℝ  ∶ ‖‖ &lt; 1 } .</p>
      <p>≅ ℝ .</p>
      <p>The Riemannian metric on the Poincaré ball is conformally equivalent to the Euclidean metric, which
means that it preserves angles but not lengths. It is defined as:
  (,  ) =  2⟨,  ⟩,
where   =</p>
      <p>2
1 − ‖‖ 2
,
and ⟨⋅, ⋅⟩ is the standard Euclidean inner product.</p>
      <p>The distance between two points ,  ∈</p>
      <p>is given by:</p>
      <p>Given a point  ∈   and a tangent vector  ∈     ≅ ℝ , the exponential map is defined as:
for  = 0:</p>
      <p>2
where   = 1−‖‖ 2 and ⊕ denotes Möbius addition addition. For  = 0, we define exp (0) =  , while
(5)
The Möbius addition addition of ,  ∈ 
Given a point  ∈ ℝ  ∖   , the projection onto the Poincaré ball is given by:
  (,  ) = arcosh (1 + 2</p>
      <p>‖ −  ‖ 2
(1 − ‖‖ 2)(1 − ‖ ‖2)) .
exp ( ) =  ⊕ (tanh (  ‖ ‖
2
)

‖ ‖</p>
      <p>) , for  ≠ 0,
exp0( ) =</p>
      <p>√
1 tanh(  ‖ ‖/2)

‖ ‖
where  &gt; 0 is a small numerical tolerance to avoid projecting exactly onto the boundary ‖‖ = 1 .</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Metric Learning</title>
        <p>In this section, we are going to explain the methodology we employed.</p>
        <p>
          Standard PL relies on centroid prototypes, which means that the prototypes are defined as centroid of
the representations of each class. Alternatively, [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] introduced the idea of extending deep
networks to learn prototypes by embedding the prototype representations as network parameters.
        </p>
        <p>
          Our methodology involves extracting the output from a backbone network, such as a ResNet18, and
projecting it onto the manifold ℳ. Subsequently, distances from class prototypes are computed. Each
geometry will have its own distance, that might be able or not to efectively represent the distances
between embeddings. These distances are interpreted as a probability distribution using softmax
activation, which is then employed in a cross-entropy loss function for learning purposes.
  ∈  = {1 … }
Π = {  ,  ∈  }
layer ℎ ∶ ℝ

Formally, we assume to be given a dataset  = {(  ,   ) }=1 , with  
, and | | = 
. A backbone network  (⋅, ) ∶  → ℝ
taking values in a sample space  ,
 is augmented with parameters
 → ℳ, which maps the embeddings onto the manifold, enabling the use of diferent metric
representing the prototypes, with   ∈ ℳ. We attach to the backbone a projection
elements, obtaining  = () = ℎ ∘  (, )
i.e., finding the parameters  and Π minimizing over the training set the distance based cross-entropy
loss [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]:
ℒ (, Π;  ) =
1
        </p>
        <p>∑
 (  ,  )∈
− log
 −(</p>
        <p>
          ,   )/
∑∈  −( 
,  )/
,
(6)
where  is the temperature parameter [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and (⋅, ⋅) is the distance defined on the manifold.
        </p>
        <p>The temperature is used to modulate the sharpness or smoothness of distances. Lower temperatures
tend to sharpen distributions, leading to more confident, peaked outputs, while higher temperatures
lfatten distributions, allowing for greater exploration or uncertainty. This tuning plays a crucial role in
model behavior, especially in non-Euclidean geometries such as hyperbolic or spherical spaces, where
distances and similarity measures difer fundamentally from flat Euclidean space. In these geometries,
the curvature afects how features are clustered and how separation boundaries form, making them more
sensitive to the scale at which similarities are interpreted. Improper temperature tuning can distort
learning dynamics by over- or under-emphasizing these curved-distance relationships, potentially
degrading the quality of embeddings or classification performance. Therefore, understanding and
adjusting the temperature parameter becomes even more critical when models operate in non-Euclidean
latent spaces.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Projection layer</title>
        <p>A key module of our methodology is the projection onto the manifold ℳ, i.e., ℎ ∶ ℝ → ℳ. For the
Euclidean geometry, we assume the embedding space to be already equipped with the Euclidean metrics
and operations, and as a consequence, there is no need of explicit projection.</p>
        <p>
          For Non-Euclidean geometries, we need to consider case by case. In particular, for the Poincaré ball,
our projection is the exponential map (5), which is the standard adopted by the main works adopting
the Poincaré ball [
          <xref ref-type="bibr" rid="ref14 ref7 ref8">7, 8, 14</xref>
          ]. However, this operation has an inherent assumption:  (, ) ∈ 
assumption is plausible for this geometry, while being incoherent for the hyperspherical and the Lorentz
0
  . This
case for diferent reasons.
to use the projection (1).
        </p>
        <p>In particular, this is due to the role of the origin, for which 0 ∈   , while 0 ∉  −1 , which does not
allow us to use the exponential map to project the points onto the hypersphere. It is however reasonable</p>
        <p>For what it concerns the Lorentz geometry, we still cannot use the exponential math based in 0 since,
again, the origin 0 ∉ ℍ</p>
        <p>according to (2). As a consequence, we cannot consider the tangent plane to a
point that is not on the manifold. For this geometry, as a projection layer we need to use (4).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Prototypes’ update</title>
        <p>It is important to notice that prototypes within this context possess a dual nature: they function as
parameters of the architecture, while existing as entities within the embedding space. Computing
the gradient involves operating within the parameter space, but it is instead crucial to maneuver the
prototypes within the embedding space, for which the update rule must be coherent with the chosen
geometry. Specifically, an operation that avoids moving the prototypes’ from the manifold is needed.
It is worth mentioning that this operation involves only the prototypes, while the parameter of the
backbone  can be updated by the standard SGD.</p>
        <p>The standard SGD update rule assumes implicitly a Euclidean geometry:
 ←  −  ⋅ ∇
 ℒ ,
where ∇ is the standard Euclidean gradient with respect to the prototypes and  is the learning rate.</p>
        <p>On the other hand, if we operate on a Non-Euclidean manifold, the traditional SGD rule is no more
valid. In particular, using Eq. (7) is not consistent with the manifold as it does not assure that  ∈ ℳ
after its update.</p>
        <p>One possible solution for this is to apply SGD regardless of the geometry and then apply the projection
layer on the obtained prototypes, ensuring that the prototypes lie on the manifold. However, this
procedure violates the geometry of the embedding space.</p>
        <p>In a more elegant way, it is possible to use a Riemannian version of SGD [26] (shown in Figure 3),
which depends on the manifold. In particular, the Euclidean gradient is not constrained on the tangent
space of the manifold. As a consequence, it is needed to project the calculated Euclidean gradient
(which is eficient to be calculated) onto the tangent space of the manifold. Denoting ∇ = ∇ if not
stated otherwise, for the Hypersphere we have the following formulation of the Riemannian gradient:</p>
        <p>∇ ℒ
For the Lorentz Model:
For the Poincaré ball:</p>
        <p>∇ ℒ = ∇ℒ − ⟨∇ℒ ,  ⟩  .
∇ℍℒ = ∇ℒ + ⟨∇ℒ ,  ⟩ ℒ ⋅  .</p>
        <p>∇ ℒ = 1 ∇ℒ .</p>
        <p>2
 ← exp (− ⋅ ∇ ℳℒ ),
Once the gradient is projected onto the right space, we still need a proper update rule to actually move
the prototypes. To accomplish this, we use RSGD [26] In its most general form:
where ∇ℳ is the Riemannian gradient relative to manifold ℳ. The intuition about how the update rule
operates is that the exponential map folds the gradient vector on the tangent space onto the Manifold,
moving the prototype accordingly. In recent research, also the Riemannian version of other optimizers
(e.g., ADAM) have also been studied [27], but, to the best of our knowledge, never in a PL setting, which
actually represent a nice study case for this type of optimization.</p>
        <p>
          Poincaré ball In the case of the Poincaré ball, it is important to notice that embeddings close to
the boundary negatively impact the learning, causing possible vanishing gradients behaviors [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
As a consequence, we use two techniques to prevent this scenario. Prototypes   are initialized by
sampling randomly from [−0.1, 0.1] and then projecting the sampled points onto the Poincaré ball via
the exponential map. Secondly, we clip the features to be at most 1 before applying the exponential
map.
(7)
(8)
(9)
(10)
(11)
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental setting</title>
        <p>
          Datasets We compare the performances of the algorithms on four public datasets: CIFAR-10 [28]
10 classes, 50000/10000 examples (train/test); CIFAR-100 [28] 100 classes, 50000/10000 examples
(train/test); CUB [29] 200 classes, 5994/5794 examples (train/test); Aircraft [30] 100 classes, 6667/3333
examples (train/test); The datasets were selected because they provide a fine-grained hierarchy over
the classes (Aircraft, CUB) or because they show low  -hyperbolicity (CIFAR-100, CIFAR-10) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>We consider two experimental settings: the full-data setting, where the full training set is used for
model training, and the few-data setting, where we simulate low-data regimes by limiting the training
set to only  examples per class, with  ∈ {5, 15, 30} , allowing us to analyze the scaling behavior of the
diferent geometries under data scarcity.</p>
        <p>Baselines We compared the embedding spaces equipped with the diferent geometries: Euclidean
(ECL), Hyperspherical (HPS), Lorentz (LOR), Poincaré (POI). Each method is equipped with a
ResNet18 [31] backbone network. It is worth mentioning that, due to the small sizes of the CIFAR
datasets (images are 32 × 32 pixels) a standard procedure is to replace the first layer of the ResNet with a
3 × 3 kernel, rather than a 7 × 7, and we employed this technique both in full-data setting and few-data
setting. The training setting for the traditional scenario was SGD with a learning rate of 0.1, momentum
of 0.9, and weight decay of 0.0005, for 200 epochs. Additionally, we used a learning rate scheduler which
divided the learning rate by 5 at epoch 60, 120, 160. In case of non-Euclidean geometry (except for
Hyperspherical), we used RSGD [26], with the same hyperparameter setting. We used geoopt [32] to
implement the non-Euclidean operations. To test the best model, we kept a validation set extracted from
the training set (10%) and picked the best model according to the validation accuracy. We employed an
early stopping with a patience of 10 steps activated after the third learning rate scheduler’s step (epoch
160). Further details about the optimization are provided in the discussion.</p>
        <p>In the few-data setting, due to the lack of data, we employed a finetuning setting on a pretrained
ResNet18 backbone, using RADAM [27] with learning rate 0.0003, weight decay 0.0003 and generally
small batch size for the diferent dataset, since the number of the examples is impacted by the number
of classes. Specifically, we picked batch size 2, 2, 4 for CIFAR-10 for respectively  equal to 5, 15, 30;
batch size 8, 16, 16 for CIFAR-100 and Aircraft, and batch size 16 and 32 for CUB with  equal to 5 and
15, while 30 was not possible due to lack of examples.</p>
        <p>For the temperature, we performed an hyperparameter tuning, which is discussed in the Results
subsection 5.2. In particular, if not otherwise stated we picked  = 1 for the Euclidean geometry, expect for
CIFAR-10, for which we used  = 0.1 . For the Hyperspherical geometry we used  = 0.1 . For Lorentz
we employed  = 0.1 for all the datasets except CIFAR-10, for which we used  = 1 . Finally, for Poincaré,
we used  = 0.1 for CIFAR-10 and CIFAR-100, and  = 0.02 for Aircraft and CUB.</p>
        <p>
          The embedding dimension was set equal to the number of classes for each dataset. Hyperbolic
embeddings have shown good performances in low dimensional learning [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], but we keep this investigation
for future works.
        </p>
        <p>The experiments were run on HPC4AI [33] on a cluster with 4 nodes, each having 2 CPU - 2x Intel®
Xeon® Processor E5-2680 v3, 12 core 2.1Ghz, RAM - 128GB/2133 (8 x 16Gb) DDR4, DISK - 1 x SSD ,
800GB sas 6 Gbps 2.5, NET - IB 56Gb + 2x10Gb, GPU - 2 x NVIDIA T4 (Tesla) su PCI-E Gen3 x16.</p>
        <p>The code is publicly available at this link.</p>
        <p>
          Metrics Our evaluation employed a range of metrics, including accuracy, robustness, and
out-ofdistribution (OOD) detection. Notably, all datasets used are balanced in class distribution, making
standard accuracy a reliable and meaningful metric for performance assessment. For what it concerns
the Robustness, we provide the results under PGD attacks [34] with  ∈ 0, 0.8/255, 1.6/255, 3.2/255 ,
similar to [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The OOD detection metric is calculated as the gap between the confidence on the
trained dataset and the confidence on an OOD dataset, similarly to [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. In particular, if we denote with
 (̂ ) = max∈{0,…,}  ((, )) , where C is the number of classes,  is the network as previously defined
and  is the SoftMax function, the OOD is calculated as:
 ( ) =
        </p>
        <p>1 ∑  (̂ ) −
|  | ∈ 
1</p>
        <p>∑  (̂ ),
|  | ∈ 
(12)
where   is the in-distribution dataset (the one on which the model was trained) and   is the OOD
dataset.</p>
        <p>Ideally, we would like a model to have high confidence on its training data, while showing low
confidence on OOD samples. Due to a minor modification in the first convolutional layer of the network,
necessary to accommodate the smaller image size of 32 × 32 used in CIFAR-10 and CIFAR-100, we
restrict OOD evaluations to compatible architectures. Specifically, we use CIFAR-10 as the OOD dataset
for models trained on CIFAR-100 (and vice versa), and CUB as the OOD dataset for models trained on
Aircraft (and vice versa), since the latter pair uses the standard ResNet18 without architectural changes.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>
          In this section, we are going to comment on the results we obtained by comparing the diferent
geometries.
Full-Data Setting Our comparisons cover a diverse set of datasets, providing insights across diferent
number of classes and types of data. In particular, CIFAR-10 and CIFAR-100 are well-studied benchmark
datasets in CV tasks. They have also been used in works that leverage hierarchical information [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ],
but their latent hierarchy lacks formal structure [24]. CIFAR-10 is primarily used for experiments
involving a small number of classes, rather than for its hierarchical structure. Table 1 shows the image
classification performance across the diferent geometries in the full-data setting. We can observe that
in the full-data setting, the Euclidean geometry achieves slightly better performances for CIFAR-10,
while the Poincaré geometry achieves better performances for CIFAR-100, in terms of accuracy. This is
in line with other works that show hyperbolic spaces achieving good results with a large number of
classes [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>On the other hand, Aircraft and CUB are usually employed to benchmark fine-grained image
classification tasks, as their labels follow an explicit hierarchy. In particular, CUB serves as an interesting testbed
due to its large number of classes (200). We can see from Table 1 that the Euclidean geometry achieves
remarkable performances on both datasets. This likely suggests that, without explicit hierarchical
regularization, non-Euclidean geometries fail to leverage the underlying structure, resulting in lower
overall accuracy. As for the comparison between non-Euclidean geometries, Poincaré achieves good
results on CIFAR-100 and CUB, while Lorentz geometry demonstrates promising performance in OOD
detection.</p>
        <p>100
80</p>
        <p>Ablation studies Since the prototypes play a crucial role, we believe it is important to take a deeper
look into their optimization process. In particular, for the Hyperspherical geometry, we conducted
experiments using both the Riemannian optimizer RSGD (11), which performs gradient steps via the
exponential map on the manifold, and the standard SGD optimizer. In the latter case, we adopt a common
alternative that replaces the exponential map with a retraction, a smooth mapping  ∶   ℳ → ℳ
that provides a first-order approximation of the exponential map. In this case, the retraction is the
projection (1). As shown in Figure 4, in our experiments, the retraction-based SGD consistently
outperforms RSGD across all four datasets, achieving up to a 10% improvement on Aircraft. This
performance gap ofers an insightful comparison between the two optimization strategies, particularly
highlighting the importance of properly updating the prototypes, as discussed in detail in subsection 4.3.
Although the two methods share the same step size, the key diference lies in the operation used to
project the gradient back onto the manifold. While this may appear to be a minor variation, we observed
that the prototypes tended to move more when using the retraction, potentially enabling a more flexible
adaptation of the data distribution in the embedding space. In Figure 5a we show the average distance
of consequent prototypes at diferent epochs. We clearly see a decreasing trend, common to every
geometry, but we also observe that with RSGD the prototypes converge faster, while SGD shows higher
values until the end, suggesting that the prototypes are still moving while the model has converged.
Our evaluation provide interesting insights on the practical impact of this choice; however, further
investigation is required to draw more definitive conclusions.</p>
        <p>As a further investigation, we examined more closely the efect of using a separate optimizer for
the prototypes across all geometries. In particular, in our experiments, we usually employed RSGD
with the same parameters used for the neural network (e.g., learning rate 0.1). However, the prototypes
may require more careful updates. Therefore, we also tried a separate optimizer (RSGD) with smaller
magnitude of the updates (learning rate 0.001, weight decay 0.0001). This investigation resulted in
better performance only for the Euclidean geometry applied to the CUB dataset and for the Poincaré
geometry across all datasets, suggesting that the optimization in the case of the Poincaré ball requires
80</p>
        <p>CUB
(a) Cosine distance between prototypes of
consequent epochs.</p>
        <p>(b) Impact of shrink initialization and separate prototype
optimization in Poincaré geometry.
careful design and tuning.</p>
        <p>For this reason, as mentioned in subsection 4.3, we also evaluated the the impact of shrinking the
initialization of the prototypes for the Poincaré ball. In Figure 5b, we present our ablation study,
showing the efectiveness of shrinking the prototypes and using a separate optimizer in the case of
the Poincaré geometry. Interestingly, the shrinkage initialization afects the datasets diferently: it has
a positive impact on CIFAR-10 and CIFAR-100, but a negative one on Aircraft and CUB. In contrast,
the additional separate optimization step (denoted as  ∗) consistently yields the best performance
across all datasets. This further highlights the sensitive and crucial role of prototype optimization in
non-Euclidean geometries. The results shown in Table 1 for Poincaré include shrinking initialization
and a separate optimizer for the prototypes.</p>
        <p>
          To conclude our discussion on the key factors afecting the learning of Euclidean and non-Euclidean
geometries, we analyzed the impact of temperature scaling across the four geometries and the diferent
datasets. This parameter is particularly crucial for non-Euclidean geometries, where distances can
grow exponentially [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. We performed a fine-tuning procedure over four temperature values,  ∈
{1, 0.1, 0.05, 0.01}, evaluating the behavior of each geometry across all datasets.
        </p>
        <p>As an illustrative example, we report in Figure 7 the distinct impact of temperature on the four
geometries for the CUB dataset. Overall, we observe significant instability at  = 0.01 , and an interesting
opposite behavior between Euclidean and non-Euclidean geometries, especially pronounced for the
Hyperspherical geometry. Except for CIFAR-10, where results remain relatively robust across diverse
temperatures, for all other datasets, lowering the temperature leads to worse performance in Euclidean
space, while Lorentz, Poincaré, and Hyperspherical geometries tend to benefit from sharper softmax
outputs.</p>
        <p>Robustness As previously stated, we have also tested the geometries in terms of robustness, with
respect to OOD and adversarial robustness. In Table 1, we report the global average performance in
terms of OOD detection.</p>
        <p>On CIFAR-10 and CUB the results in terms of OOD detection show limited performance, despite the
fact that they represent very diferent multiclass classification tasks, with 10 and 200 classes respectively.
Interestingly, when comparing datasets with the same number of classes (e.g., Aircraft and CIFAR-100,
both with 100 classes), we can notice that the highest OOD performance are on Aircraft, showcasing a
good confidence gap on CUB. This suggests that the quality of OOD detection does not depend solely
on the number of classes, and thus on the overall complexity of the classification task, but is also
strongly influenced by the intrinsic nature of the data, such as the type of images, visual characteristics,
(a) CIFAR-10
(b) CUB
intra-class complexity, and other structural factors.</p>
        <p>
          Regarding the comparison between geometries, Hyperbolic geometries are usually indicated as strong
OOD detectors [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. However, this does not seem to be the case for CIFAR-100 and Aircraft, while
the leading performance of Lorentz on CIFAR-10 and CUB, as well as generally among non-Euclidean
geometries, is particularly interesting. Since this is the first study introducing the Lorentz geometry as
a competitor, this represents a noteworthy result. The magnitude of Lorentzian distances, combined
with the higher numerical stability of the Lorentz geometry, likely enables better performance in terms
of OOD detection.
        </p>
        <p>An additional metric used to compare the diferent geometries is adversarial robustness, i.e., a model’s
ability to maintain its performance even when subjected to adversarial examples. Adversarial examples
are inputs that have been slightly perturbed in ways that are often imperceptible to humans but can
lead the model to make incorrect predictions. The Projected Gradient Descent (PGD) attack is one
of the most common methods for generating adversarial examples and evaluating the robustness of
machine learning model. PGD is widely recognized as a strong attack; if a model is robust against
PGD, it is generally considered robust against other types of adversarial attacks as well [34]. Figure 6a
and Figure 6b illustrate the varying degrees of robustness across geometries on CIFAR-10 and CUB.
The diferent values   on the  -axis represent the increasing magnitude of the perturbation applied to
the input data. The smaller the impact on model accuracy as the perturbation increases, the greater the
robustness of the model. We can see from Figure 6a that the Lorentz geometry clearly outperforms all
other geometries on CIFAR-10, while achieving comparable results in CUB (see Figure 6b). This shows
that the Lorentz geometry is a possible robust competitor with respect to the Euclidean geometry when
the number of classes is low. Further investigations will be conducted in future works to assess the
properties ofered by this geometry.
2. Few-Data Setting Table 2 presents the classification performance across diferent geometries
in the few-data setting, considering 5, 15, and 30 training examples per class. For consistency, we
retained the optimal temperature values determined in the full-data setting for each geometry and
dataset. Additionally, we preserved shrink initialization toward the origin for the experiments using the
Poincaré geometry. Given the increased dificulty of the few-data scenario, we employed a ResNet18
model pretrained on ImageNet, fine-tuning the entire network.</p>
        <p>Even under this challenging setup, the Euclidean geometry continues to outperform the other
geometries in nearly all scenarios. Overall, CIFAR-10 and CIFAR-100 show comparable performance
across geometries for all the three few-data settings. In contrast, for the more fine-grained datasets
Aircraft and CUB, especially in the most constrained settings with only 5 or 15 examples per class,
non-Euclidean geometries, particularly Lorentz and Poincaré, sufer from a more pronounced drop in
accuracy.</p>
        <p>
          Training with a small amount of data is, of course, a more challenging scenario. In this context, the
magnitude of the non-Euclidean geometries might help by sharpening the data distributions, adding
stronger penalties, and potentially improving generalization. It is worth mentioning that our setting is
diferent from few-shot learning, although it leads to similar conclusions. In particular, we adopted the
setting from [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], and we can observe that the Euclidean geometry outperforms the other geometries.
As shown in Table 2, performance increases with the number of examples, as expected. In this context,
the number of classes is particularly crucial, as it directly afects the total number of examples. In
fact, performance on CUB remains comparable to the full dataset, and even improves when using 15
examples per class, thanks to the fine-tuning of the network on ImageNet. The only case where the
best geometry is not the Euclidean one is with CIFAR-10, with 15 examples per class. However, the
gain in performance is modest.
        </p>
        <p>Regarding OOD detection, we can see that on CIFAR-10 all the geometries show poor performances,
with very high confidence observed in both cases. This suggests that, since the model has been trained
on few data and that the number of classes is smaller than the entire dataset, it tends to predict OOD
samples with high confidence. However, for this dataset the Hyperspherical geometry shows leading
performances. For the other datasets, with 5 examples per class the leading method is usually the
Euclidean one, paired with a non-Euclidean one. In particular, for CIFAR-100 Euclidean and Poincaré
perform the same, while on Aircraft Euclidean, Hyperspherical and Lorentz perform the best. Finally, on
CUB, the Lorentz geometry achieves the best results. This is an interesting behavior, but the performance
gains are always very modest. Among the non-Euclidean geometries, Lorentz consistently performs
 = 1
 = 0.1
 = 0.05
 = 0.01
40
)
(y%30
c
a
r
cu 20
c
A
10
40
)
(y%30
c
a
r
cu 20
c
A
10</p>
        <p>ECL
ECL</p>
        <p>HPS LOR</p>
        <p>Full-data setting (CUB)
HPS LOR</p>
        <p>Few-data setting (CUB)</p>
        <p>POI
POI
Average test accuracy and OOD detection over 3 runs with their standard deviation in few-data setting.
In bold we indicate the best result.</p>
        <p>ECL
HPS
LOR</p>
        <p>We also analyzed the robustness of models trained in the few-data setting with only 5 examples per
class. No significant diferences in robustness were observed across the four geometries. However,
an interesting pattern emerges when comparing the robustness of models on CIFAR-100 and Aircraft
datasets, as shown in Figure 8a and Figure 8b. Despite the fact that all four geometries achieve
significantly higher classification accuracy on Aircraft compared to CIFAR-100, the models trained on
CIFAR-100 are noticeably more robust. In contrast, the accuracy on Aircraft drops below
adversarial perturbations for all geometries. This is likely due to the nature of PGD attacks, which
operate at the pixel level. On small, low-resolution images like those in CIFAR-100, even relatively
strong perturbations afect fewer global visual features. On high-resolution images like those in Aircraft,
the same perturbations are more spatially concentrated and thus more visually disruptive, leading to a
10% under
much greater degradation in performance.</p>
        <p>Finally, we analyzed the impact of diferent temperature values in the few-data setting, focusing on
the most challenging scenario with only 5 examples per class. As shown in Figure 7, we observe a similar
trend as in the full-data setting: Euclidean geometry behaves inversely compared to non-Euclidean
ones, with the Hyperspherical geometry showing the most pronounced contrast. However, diferently
from the full-data setting, Lorentz and Poincaré show a decreasing trend in decreasing the temperature.
This further highlights the importance of properly fine-tuning this parameter.
(a) CIFAR-100
(b) Aircraft</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this study, we conducted a thorough evaluation of four diferent geometries within a PL framework for
image classification, considering both standard and few-data settings. While non-Euclidean geometries
demonstrated competitive performance in few scenarios, our results suggest that standard optimization
techniques leveraging Euclidean geometry continue to represent the state-of-the-art in the settings we
explored.</p>
      <p>
        These findings align with [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], who observed that Poincaré embeddings ofered only modest gains
in the few-shot learning setting, and that strong results could still be achieved using conventional
Euclidean approaches.
      </p>
      <p>Our work extends this conclusion by evaluating a broader range of geometries and experimental
settings. We find that non-Euclidean methods often require carefully tailored algorithms and data
with specific latent structures to outperform Euclidean methods. This study enriches the comparative
landscape of geometric methods and highlights the limitations and potential of each approach. In doing
so, it provides a solid testbed for developing future non-Euclidean techniques that can surpass current
Euclidean-based models.</p>
      <p>We believe that uncovering and leveraging the latent structure of data is crucial to achieving high
performance with minimal supervision. Continued exploration in this direction is essential, and we
hope our results stimulate further interest in the research community.</p>
      <p>As future work, there are several promising directions. We are considering adding a hierarchical
prior to investigate whether an explicitly hierarchical arrangement of the embedding space allows
hyperbolic methods to achieve better performance. Another direction we plan to explore is expanding
out evaluation to few-shot learning and to other prototypical methods such as ProtoPNets, which have
shown promise not only in terms of performance but also in model interpretability.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT, Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.
sampling, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, 2024, pp. 1891–1903.
[23] M. Law, R. Liao, J. Snell, R. Zemel, Lorentzian distance learning for hyperbolic representations, in:</p>
      <p>International Conference on Machine Learning, PMLR, 2019, pp. 3672–3681.
[24] A. Bdeir, K. Schwethelm, N. Landwehr, Fully hyperbolic convolutional neural networks for
computer vision, arXiv preprint arXiv:2303.15919 (2023).
[25] B. Wilson, M. Leimeister, Gradient descent in hyperbolic space, arXiv preprint arXiv:1805.08207
(2018).
[26] S. Bonnabel, Stochastic gradient descent on riemannian manifolds, IEEE Transactions on Automatic</p>
      <p>Control 58 (2013) 2217–2229.
[27] G. Bécigneul, O.-E. Ganea, Riemannian adaptive optimization methods, arXiv preprint
arXiv:1810.00760 (2018).
[28] A. Krizhevsky, Learning multiple layers of features from tiny images, https://www. cs. toronto.</p>
      <p>edu/kriz/learning-features-2009-TR. pdf (2009).
[29] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-UCSD Birds-200-2011 Dataset,</p>
      <p>Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
[30] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, A. Vedaldi, Fine-Grained Visual Classification of Aircraft,</p>
      <p>Technical Report, 2013. arXiv:1306.5151.
[31] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[32] M. Kochurov, R. Karimov, S. Kozlukov, Geoopt: Riemannian optimization in pytorch, 2020.</p>
      <p>arXiv:2005.02819.
[33] M. Aldinucci, S. Rabellino, M. Pironti, F. Spiga, P. Viviani, M. Drocco, M. Guerzoni, G. Boella, M.
Mellia, P. Margara, et al., Hpc4ai: an ai-on-demand federated platform endeavour, in: Proceedings of
the 15th ACM International Conference on Computing Frontiers, 2018, pp. 279–286.
[34] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards deep learning models resistant
to adversarial attacks, arXiv preprint arXiv:1706.06083 (2017).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiang</surname>
          </string-name>
          , J. Cheng,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Yuille</surname>
          </string-name>
          ,
          <article-title>Normface: L2 hypersphere embedding for face verification</article-title>
          ,
          <source>in: Proceedings of the 25th ACM international conference on Multimedia</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1041</fpage>
          -
          <lpage>1049</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zafeiriou</surname>
          </string-name>
          , Arcface:
          <article-title>Additive angular margin loss for deep face recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4690</fpage>
          -
          <lpage>4699</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Learning continuous hierarchies in the lorentz model of hyperbolic geometry</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3779</fpage>
          -
          <lpage>3788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ganea</surname>
          </string-name>
          , G. Bécigneul, T. Hofmann,
          <article-title>Hyperbolic neural networks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>31</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sala</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. De Sa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ré</surname>
          </string-name>
          ,
          <article-title>Representation tradeofs for hyperbolic embeddings</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4460</fpage>
          -
          <lpage>4469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Khrulkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mirvakhabova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ustinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Oseledets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lempitsky</surname>
          </string-name>
          ,
          <article-title>Hyperbolic image embeddings</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6418</fpage>
          -
          <lpage>6428</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghadimi Atigh</surname>
          </string-name>
          , M. Keller-Ressel,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mettes</surname>
          </string-name>
          ,
          <article-title>Hyperbolic busemann learning with ideal prototypes</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>103</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mettes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <article-title>Searching for actions on the hyperbole</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1141</fpage>
          -
          <lpage>1150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamzaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chapel</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>T.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lefèvre</surname>
          </string-name>
          ,
          <article-title>Hyperbolic prototypical network for few shot remote sensing scene classification</article-title>
          ,
          <source>Pattern Recognition Letters</source>
          <volume>177</volume>
          (
          <year>2024</year>
          )
          <fpage>151</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Moreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marques</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Costeira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hauptmann</surname>
          </string-name>
          ,
          <article-title>Hyperbolic vs euclidean embeddings in few-shot learning: Two sides of the same coin</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>2082</fpage>
          -
          <lpage>2090</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>The numerical stability of hyperbolic representation learning</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>24925</fpage>
          -
          <lpage>24949</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Snell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <article-title>Prototypical networks for few-shot learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mettes</surname>
          </string-name>
          , E. Van der Pol, C. Snoek, Hyperspherical prototype networks,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fonio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Esposito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aldinucci</surname>
          </string-name>
          ,
          <article-title>Hyperbolic prototypical entailment cones for image classiifcation</article-title>
          , in: Y.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Mandt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agrawal</surname>
          </string-name>
          , E. Khan (Eds.),
          <source>Proceedings of The 28th International Conference on Artificial Intelligence and Statistics</source>
          , volume
          <volume>258</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>3358</fpage>
          -
          <lpage>3366</lpage>
          . URL: https://proceedings.mlr.press/v258/fonio25a.html.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>H.-M. Yang</surname>
            ,
            <given-names>X.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
          </string-name>
          , C.-L. Liu,
          <article-title>Robust classification with convolutional prototype learning</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3474</fpage>
          -
          <lpage>3482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Somervuo</surname>
          </string-name>
          , T. Kohonen,
          <article-title>Self-organizing maps and learning vector quantization for feature sequences</article-title>
          ,
          <source>Neural Processing Letters</source>
          <volume>10</volume>
          (
          <year>1999</year>
          )
          <fpage>151</fpage>
          -
          <lpage>159</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hastie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          , G. Chu,
          <article-title>Diagnosis of multiple cancer types by shrunken centroids of gene expression</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>99</volume>
          (
          <year>2002</year>
          )
          <fpage>6567</fpage>
          -
          <lpage>6572</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fonio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Paletto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cerrato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ienco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Esposito</surname>
          </string-name>
          , et al.,
          <article-title>Hierarchical priors for hyperspherical prototypical networks</article-title>
          ,
          <source>in: ESANN 2023-Proceedings, ESANN</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>459</fpage>
          -
          <lpage>464</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Landrieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S. F.</given-names>
            <surname>Garnot</surname>
          </string-name>
          ,
          <article-title>Leveraging class hierarchies with metric-guided prototype learning</article-title>
          ,
          <source>in: British Machine Vision Conference (BMVC)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Clipped hyperbolic classifiers are super-hyperbolic classifiers</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>M. van Spengler</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Berkhout</surname>
          </string-name>
          , P. Mettes, Poincaré resnet,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>5419</fpage>
          -
          <lpage>5428</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Zhang,</surname>
          </string-name>
          <article-title>Understanding hyperbolic metric learning through hard negative</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>