<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>with Embedded Lookup Tables</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marius Aasan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adín Ramírez Rivera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Computer Vision, Segmentation, Representation Learning, Data Structures, Image Compression</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, University of Oslo</institution>
          ,
          <addr-line>Problemveien 11, 0313 Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>SFI Visual Intelligence</institution>
          ,
          <addr-line>P.O. Box 6050 Langnes, 9037 Tromsø</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Pixel-level prediction tasks inherently face constraints imposed by image resolution and the number of prediction classes. As image resolutions and dimensionalities increase, these constraints lead to significant memory bottlenecks when explicitly modeling high-dimensional embeddings for each individual pixel. Addressing these bottlenecks requires the development of alternative, more eficient representations to facilitate continued progress in the field. In this work, we discuss Embedded Lookup Tables (ELUTs) with indexed segmentation maps as an alternative data structure for more memory eficient representations in image processing. We show that ELUTs are inherently compatible with cost functionals and metrics for pixel-level prediction tasks with a significant reduction in memory overhead.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR</p>
      <p>ceur-ws.org


int64


float32



(a) Dense representation.
(b) Indexed segmentation.</p>
      <p>(c) Embedded lookup table.
segmented representation (b) with an  × 
dense representation (a), compared to an  × 
indexed
embedded lookup table (c). Our proposition utilizes (c) and (b) to
form a sparse representation of (a) with significantly lower memory overhead.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Dense raster tensors are natural representations for convolutional networks (CNNs), whose design
natively consumes and emits features in grids. As discussed in Sec. 1, the demands of higher resolution
images in conjunction with higher embedding dimensions indicates that the  (  )
overhead is a
bottleneck. Current state-of-the-art foundational models [14] make use of over 131k pseudo-classes for
instance- and patch-level predictions. Extending this to pixel-level predictions remains computationally
intractable without alternative representations. While open-vocabulary approaches [15, 16] alleviate
the issue by aligning with natural language embeddings instead of fixed classes, they necessitate
high-dimensional embeddings and induce the same bottleneck.</p>
      <sec id="sec-2-1">
        <title>Frame Bufers and Palettes.</title>
        <p>In the formative stages of computer graphics, Kajiya et al. [12]
introduced the modern frame bufer for overcoming limitations of memory and compute. Early renderers
were designed such that each pixel stores an index into a color lookup table (CLUT) with RGB color
representations to enable compact frame bufers and real-time hardware color mapping. Later work
expanded on quantization and dithering for optimizing color representations in imaging [13], paving
the way for further developments with palette-based image quantization and lookups for hardware and
software rendering. Color quantization and CLUTs is actively used for efective photometric transforms
in modern GPU rendering [17] and compression in standard image formats [18, 19].</p>
      </sec>
      <sec id="sec-2-2">
        <title>Superpixels and Objecthood.</title>
        <p>One remedy to the dense bottleneck is to decouple where a feature
lives from what the feature is. If a colour-lookup table can recover an RGB triplet from an index, the
same pattern generalises: store a  -dimensional feature vector in the table and let neighbouring pixels
that share semantics reuse the index. Superpixels [20] were introduced for this exact purpose, and serve
as a strong indicator for pre-segmentation by partitioning an image into  ≪  
connected regions.</p>
        <p>By associating each region with an index and each index with a feature embedding, we can construct
image representations that align with semantic content in an image.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Token-based Architectures.</title>
        <p>Research has seen widespread adoption of Vision Transformers [21]
(ViTs), representing a move away from local receptive fields towards attention operators to model
global interaction between discrete tokens. Tokenization has largely been driven by a simplified
patchification procedure; square patches are extracted as tokens with positional embeddings to encode
spatial information. ViTs for pixel-level predictions have focused on patch upsampling and concatenation
of multiple layer activations [22, 23] to yield full resolution predictions. While studies have shown
that smaller patches are beneficial for predictive power [ 24, 25], the computational complexity of more
granular tokens is quadratic [26, 27] due to the attention operator.</p>
        <p>Recent work have looked to improve on the patch-based tokenization process in a move towards
more adaptive tokenization methods [28, 29]. Adaptive superpixel tokenization have been successfully
applied to allow ViTs to work with pixel level granularity [30, 31], and have been shown to improve
state-of-the-art open-vocabulary segmentation models [32]. As every region carries an integer index,
the image now is an index map; the per-region embeddings form a lookup table, giving exactly the
palette-style pairs illustrated in Fig. 1.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Convolutions on Discrete Topologies.</title>
        <p>Although CNNs are used for feature extraction and
superpixel tokenization [33], they are not inherently suited to discrete embedding spaces in the way
transformers are. By design, a CNN natively consumes and emits features in grids necessitating a
fully realized dense representation  ∈ ℝ × ×</p>
        <p>. However, the fundamental neighborhood operations
performed by a CNN can be extended to discrete topologies via graph neural networks [34, 35, 36]
(GNNs). Superpixel-based GNNs have been successfully applied in medical imaging [37, 38], and the
recent long range graph benchmark [39] (LRGB) include two superpixel graph datasets for segmentation,
encouraging further research in vision based GNN approaches.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Embedded Lookup Tables with Indexed Segmentation Maps</title>
      <p>
        Adaptive tokenization in ViTs with semantic edge detection provides an opportunity to rethink
underlying data structures for pixel‑level prediction. Because the multi‑head self‑attention operator is
inherently permutation‑invariant, spatial relationships must be injected through positional encodings
rather than being implicit via grid topology. This observation suggests that we no longer need to
commit to a fixed, dense lattice as the primitive representation. Instead, we encode each image as a pair
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
( ∈ {1, … ,  }  × ,  ∈ ℝ  × ),
where  ()

for token  . In training, a batch of  images {(  ,   )}=1 is combined by simple concatenation of tables
assigns each pixel  to one of  superpixel tokens, and   ∈ ℝ is the  ‑dimensional feature
  () =   () +
∑   ,   = [ 1,  2, … ,   ].
      </p>
      <p>&lt;
No padding is required – each regions index range is shifted by the cumulative token counts of preceding
samples. This is the same implementation used in GNNs; variable‑size graphs are merged into a single
supergraph by ofsetting node indices and merging edge lists, preserving each subgraph’s topology
without forcing a common size [40].</p>
      <p>Conceptually, ELUTs can be viewed an extension of indexed color representations in early frame
bufers;  is a two‑dimensional index map and  is a lookup table of learned embeddings. The memory
savings follow immediately; instead of storing  ×  × 
dense features, we store only  × 
entries plus
an  ×</p>
      <p>integer map. As long as the segmentation captures spatial redundancy this yields compression
of pixel‑level features, with support for batching of heterogeneous resolutions and token counts.</p>
      <sec id="sec-3-1">
        <title>3.1. Implementation Details</title>
        <p>An ELUT representation is straightforward to apply in ViTs, since token representations correspond
directly to the embeddings in the lookup table. Any tokenization process that partitions the image
into connected regions can be applied to extract segmentation maps   () . Diferent approaches can be
applied for feature extraction; interpolation to a fixed feature size via bounding-boxes in addition to a
joint histogram for positional embeddings provides commensurability with existing ViT approaches [30].
Alternatively, a vision encoder such as a lightweight CNN can be applied for feature extraction to
construct a dense feature map  () ∈ ℝ ×</p>
        <p>′× ′ [31, 32], and features can be extracted by taking
 , =</p>
        <p>⨁
semantic boundaries. SPiT [30] applies ground-up community detection mechanisms using a
graphbased hierarchical region merging [43].</p>
        <p>Parallel computations. Interestingly, watershed and graph region merging approaches give two
diferent approaches to flood filling [</p>
        <p>44, 45]. Watershed applies filling based on a set of seeds given a
gradient image, while graph region merging by processing all pixels uniformly in parallel while updating
the graph representation. Specifically, both approaches can be viewed as optimal spanning forests [ 46,
47] and are equivalent to optimizing a random walk over the graph [48]. Eficient implementations in
PyTorch [49] can be constructed without external dependencies and custom CUDA kernels [30].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Space Complexity Analysis</title>
        <p>Let  =  × 
denote the number of pixels in an image,  the feature dimension, and  ≤ 
the number
of regions. We compare the memory footprint of a standard dense feature map</p>
        <p>dense =  × 
against that of an ELUT representation, which is given by
We can express the relative memory usage as
Given this, ELUTs provide an overall memory reduction factor
 ELUT =</p>
        <p>⏟
index map  ∈{1,…, } ×
+
 ⏟
 ELUT =
 dense
  + 
 


=
+

1 =  +

1


,  ≡</p>
        <p>∈ (0, 1].
 =
 dense =
 ELUT</p>
        <p>1
 + 1/
.</p>
        <p>
          Suficiently high‐dimensional embeddings (  ≫ 1 ) gives a ”rule-of-thumb” estimate  ≈  −1. The
savings can be illustrated by the following example; if  compress to  = 0.05 of the pixel count, ELUTs
use only ∼ 5% of the dense footprint – a ∼ 20× savings. For an extended empirical analysis, see Sec. 4.1.
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
Worst‐case vs. best‐case. For  =
        </p>
        <p>(no compression),  = 1 and  ELUT/ dense = 1 + 1/ ,
hence ELUTs incur negligible extra overhead in high-dimensional regimes. In the ideal limit  ≪  ,
memory scales as  ( +   ) ≪  (  )</p>
        <p>. This demonstrates how ELUTs improve on the  (   )
bottleneck of dense tensor representations, trading of a linear-in-pixels index map for a much smaller,
segmentation-driven embedding table.</p>
        <p>
          Batch Processing. Eq. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) shows that ELUT batching can be implemented by table concatenation
without overhead. For a batch of  images with pixel counts   and token counts   ,
 d()ense =  ∑   ,  E()LUT =  ∑   + ∑   ,


and the same ratio analysis applies, giving
  =
∑  
∑
        </p>
        <p>.
 ( +  +   ) =  ( + 2  )
=  ( +   )
Thus an ELUT incurs  ( +   ) =  ( +   )</p>
        <p>work per image, whereas a dense raster evaluates
the same objective in  (  )</p>
        <p>
          . As with the space complexity, for  = 1 (no compression) a dense
representation outperforms ELUTs but since  ( +   ) =  ( ( + 1)) =  (  )
, the representations
technically have the same complexity. However, for the expected case  ≪ 1 we see significant reduction
in compute for ELUT representations, tied to the reduction factor  . In a practical modeling setting, the
sparsity of  also plays a role, conditional on how well  aligns with  .

∈
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Equivalence of Cost functionals and Metrics</title>
        <p>ELUT representation ( ,  ) allows for standard computation of cost metrics – such as MSE, cross-entropy,
or focal loss – as well as metrics such as Jaccard Index (IoU) or DICE coeficients. Given a set of predicted
regions  ∶  → {1, … ,  }
as well ground truth labels  ∶  → {1, … ,  }
, we define region sizes and
the region–class contingency counts
  = ∑ 1[ ()=] ,
∈
  =∑ 1[ ()=]</p>
        <p>ℓ(  ,   ) can be rewritten exactly as a weighted sum over the contingency table
where 1[⋅] denotes the Iverson bracket.</p>
        <p>Any loss or metric expressible as a sum over pixels
 
=1 =1
ℒ = ∑</p>
        <p>∑   ℓ(  , ),
recalling that   is the embedding or logits emitted by token  and  is an target embedding or class
index. Denote by  = {(, ) ∶</p>
        <p>
          &gt; 0} the set of non-empty region–class overlaps. We can then take
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          ) as an identity; it trades  pixel visits for the at most  = | |
non-empty overlaps.
        </p>
        <p>Batch Processing. Computing cost functionals and metrics in a mini-batch regime requires only
a slight modification for computing contingency tables, where each ground-truth segment needs to
be unique within each sample. This can be achieved by concatenating   = [  + ( − 1) ] 
that the original targets can be recovered using a simple modulo operation to determine the global
=1 , such
frequency for the final contingency table.</p>
        <sec id="sec-3-3-1">
          <title>Complexity reduction.</title>
          <p>
            Building the contingency table   requires a single  ( )
scan of the pixels;
each visit increments the counter (, ) ↦   +1. After that pass, any additive loss or metric of the form
(
            <xref ref-type="bibr" rid="ref11">11</xref>
            ) touches only the  non-zero overlaps, plus the  region marginals. However, the sparsity of  is
contingent on how well  matches  . Given that  &lt;  &lt;  
, the worst case yields
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>Our experiments are designed to demonstrate feasibility for ELUT representations. In Sec. 4.1 we
perform an empirical study on complexity, showing efective reduction in memory and computational
costs, verifying derivations from Sections 3.2 and 3.3. We then demonstrate the efectiveness of
hierarchical image representations in a simple lossless compression scheme.</p>
      <sec id="sec-4-1">
        <title>4.1. Empirical Complexity Analysis</title>
        <p>We compute empirical estimates of expected region counts over ImageNet [50] using an adaptive
tokenization framework with  ∈ {1, 2, 3, 4} merge iterations [30]. Table 1 show that even at conservative
levels ( = 1 ), ELUTs quarter the memory footprint, while deeper merges further compress the feature
map without altering pixel-wise losses or metrics – cf. Sec. 3.3.</p>
        <p>Next, we compute memory overhead and wall-clock throughput for variable sized batches for images
sampled from COCO [51]. We compare a dense feature representation extracted via upscaling [52] with
 = 768 and  =  = 224 to an ELUT representation. Features are extracted via the same backbone
over the same batch of images, so the only diference is their representation.</p>
        <p>We compute equivalent formulations of cross-entropy for dense and ELUT representations, and
report results in Tab. 2. While our results are not able to verify that ELUTs provide any significant
empirical computational benefits in terms of throughput, the memory overhead is still significant. The
loss is evaluated by gathering the  region logits from one contiguous lookup table and weighting them
with the  sparse overlaps, so the only overhead comes from scatter/gather operations.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Information Content and Compression</title>
        <p>For a data-structure to be efective, it should ideally result in low entropy representations. Empirically
validating the informational content of general ELUTs is non-trivial, as the representation is highly
contingent on the choice of tokenizer and features, which are typically task-dependent.</p>
        <p>We construct a representation independent of feature representations by using a superpixel tokenizer
and compress images using a hierarchical pyramid representation. Each pixel is associated with a parent
region using SPiT tokenizer [30] to form a rooted tree, i.e., we aggregate regions until only a single
node remains. We reuse the very same SPiT partitions that the model computes during training. We
then construct a graph Laplacian pyramid by representing all non-root RGB embeddings mapped to
YCbCr with a diference transform over the depth of the tree. Since a region is typically similar to its
parent this provides a sparse representation, particularly in leaf nodes representing individual pixels.</p>
        <p>To encode the overall structure, we extract the Prüfer sequence of the graph [54] along with the
ELUT. In conjunction with a Hufman encoding scheme, this yields a basic lossless compression format,
useful for evaluating the informational capacity of ELUT representations. We measure the information
density by evaluating the compression ratio (CR), bits-per-pixel (BPP) and contrast it with standard
image formats, and report them in Tab. 3. Surprisingly, this simple strategy yields better compression
than PNG, the industry standard. We note that while this is by no means a state-of-the-art approach, it
shows that hierarchical ELUTs provides highly efective image representations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusion</title>
      <p>In this work, we introduce Embedded Lookup Tables as extensions of well-established representations in
computer graphics to machine learning settings. We show that by factorising a  ×  ×  map into
an integer index image  and a compact table  ∈ ℝ  × , ELUTs replace the  (  ) memory wall
with  ( +   ) . Our experiments provide empirical evidence that show how ELUTs provide an
efective alternative in practical modeling tasks requiring pixel level granularity. Since objectives and
metrics can be rewritten using contingency counts, arithmetic shrinks roughly in proportion to the
factor  =  / .</p>
      <p>While our results are not contingent on a single tokenization framework, future work would
incorporate more extensive studies on the efect on training in dense prediction tasks and multiple tokenization
framework. While our scope is limited to 2D imaging in this work, our theoretical results naturally
extends to higher dimensional imaging. Extensions to volumetric data, light-field stacks, and video
frames should benefit even further as the dense baseline grows with an extra dimension while the index
map stays tractable.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Computations were performed on resources provided by Sigma2 (NRIS, Norway), Project NN8104K. We
acknowledge Sigma2 for awarding access to the LUMI supercomputer, owned by the EuroHPC Joint
Undertaking, hosted by CSC (Finland) and the LUMI consortium through Sigma2, Norway, Project
no. 465001382. This work was funded in part by the Research Council of Norway, via the Visual
Intelligence Centre for Research-based Innovation (grant no. 309439), and Consortium Partners.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4o and Writefull for:1 Grammar and spelling
checks, and Formatting assistance. After using these services, the authors reviewed and edited the
content as needed and takes full responsibility for the publication’s content.
1Following the taxonomy from the CEUR-WS Policy
[19] Adobe Systems Inc., Tif revision 6.0 specification, Online, 1992. URL: http://partners.adobe.com/
public/developer/en/tiff/TIFF6.pdf, originally published by Aldus Corporation as TIIF 6.0.
[20] X. Ren, J. Malik, Learning a classification model for segmentation, in: IEEE Inter. Conf. Comput.</p>
      <p>Vis. (ICCV), 2003, pp. 10–17 vol.1. doi:10.1109/ICCV.2003.1238308.
[21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words:
Transformers for image recognition at scale, in: Inter. Conf. Learn. Represent. (ICLR), 2021.
[22] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. S. Torr, L. Zhang,
Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,
in: IEEE/CVF Inter. Conf. Comput. Vis. Pattern Recog. (CVPR), 2021.
[23] R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in: IEEE Inter.</p>
      <p>Conf. Comput. Vis. (ICCV), 2021.
[24] F. Wang, Y. Yu, G. Wei, W. Shao, Y. Zhou, A. Yuille, C. Xie, Scaling laws in patchification: An
image is worth 50,176 tokens and more, 2025. URL: https://arxiv.org/abs/2502.03738, arXiv preprint
arXiv:2502.03738.
[25] R. Mojtahedi, M. Hamghalam, R. K. G. Do, A. L. Simpson, Towards optimal patch size in vision
transformers for tumor segmentation, in: International Workshop on Multiscale Multimodal
Medical Imaging (MMMI 2022), 2023. doi:10.1007/978- 3- 031- 18814- 5_11.
[26] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Davis,
A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, A. Weller, Rethinking attention with performers,
in: Inter. Conf. Learn. Represent. (ICLR), 2021. ArXiv preprint arXiv:2009.14794.
[27] Y. Han, Z. Xu, X. Shang, L. Wang, Y. Yang, S. Li, X. Liu, B. Wu, J. Bai, et al., FLatten transformer:</p>
      <p>Vision transformer using focused linear attention, in: IEEE Inter. Conf. Comput. Vis. (ICCV), 2023.
[28] J. D. Havtorn, A. Royer, T. Blankevoort, B. E. Bejnordi, MSViT: Dynamic mixed-scale tokenization
for vision transformers, in: IEEE Inter. Conf. Comput. Vis. Wksps. (ICCVW), 2023, pp. 838–848.
[29] T. Ronen, O. Levy, A. Golbert, Vision Transformers with Mixed-Resolution Tokenization ,
in: IEEE/CVF Inter. Conf. Comput. Vis. Pattern Recog. (CVPR), IEEE Computer Society, 2023.
doi:10.1109/CVPRW59228.2023.00486.
[30] M. Aasan, O. Kolbjørnsen, A. Schistad Solberg, A. Ramírez Rivera, A spitting image: Modular
superpixel tokenization in vision transformers, in: European Conf. Comput. Vis. Wksps. (ECCVW),
2024.
[31] J. Lew, S. Jang, J. Lee, S.-K. Yoo, E. Kim, S. Lee, J.-Y. C. J.-H. M. Y.-I. Mok, S. Kim, S. Yoon, Superpixel
tokenization for vision transformers: Preserving semantic integrity in visual tokens, ArXiv
abs/2412.04680 (2024). URL: https://api.semanticscholar.org/CorpusID:274581264.
[32] D. Chen, S. Cahyawijaya, J. Liu, B. Wang, P. Fung, Subobject-level image tokenization, ArXiv
abs/2402.14327 (2024). URL: https://api.semanticscholar.org/CorpusID:267782983.
[33] V. Jampani, D. Sun, M.-Y. Liu, M.-H. Yang, J. Kautz, Superpixel samping networks, in: European</p>
      <p>Conf. Comput. Vis. (ECCV), 2018.
[34] T. N. Kipf, M. Welling, Semi-Supervised Classification with Graph Convolutional Networks, in:</p>
      <p>Inter. Conf. Learn. Represent. (ICLR), 2017. URL: https://openreview.net/forum?id=SJU4ayYgl.
[35] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks?, in: Inter. Conf.</p>
      <p>Learn. Represent. (ICLR), 2019. URL: https://openreview.net/forum?id=ryGs6iA5Km.
[36] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks,
in: Proceedings of the International Conference on Learning Representations, 2018. URL: https:
//openreview.net/forum?id=rJXMpikCZ.
[37] C. Playout, Z. Legault, R. Duval, M. C. Boucher, F. Cheriet, A Region-Based Approach to Diabetic
Retinopathy Classification with Superpixel Tokenization , in: Inter. Conf. Med. Imag.
Comput.Assist. Interv. (MICCAI), volume LNCS 15005, Springer Nature Switzerland, 2024.
[38] Y. Lee, J. H. Park, S. Oh, K. Shin, J. Sun, M. Jung, C. Lee, H. Kim, J.-H. Chung, K. C. Moon, et al.,
Derivation of prognostic contextual histopathological features from whole-slide images of tumours
via graph deep learning, Nature Biomedical Engineering (2022) 1–15.
[39] V. P. Dwivedi, L. Rampášek, M. Galkin, A. Parviz, G. Wolf, A. T. Luu, D. Beaini, Long range graph
benchmark, in: Adv. Neural Inf. Process. Sys. (NeurIPS), 2022. URL: https://openreview.net/forum?
id=in7XC5RcjEn.
[40] M. Fey, J. E. Lenssen, Fast graph representation learning with pytorch geometric, in: Inter. Conf.</p>
      <p>Learn. Represent. Wksps. (ICLRW), New Orleans, USA, 2019. URL: https://arxiv.org/abs/1903.02428.
[41] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, SLIC superpixels compared to
state-of-the-art superpixel methods, IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 2274–2282.
doi:10.1109/TPAMI.2012.120.
[42] L. Najman, M. Schmitt, Watershed of a Continuous Function, Signal Process. (1994). URL:
https://hal.science/hal-00622129. doi:10.1016/0165- 1684(94)90059- 0.
[43] X. Wei, Q. Yang, Y. Gong, N. Ahuja, M. Yang, Superpixel hierarchy, IEEE Trans. Image Process.</p>
      <p>(2018). doi:10.1109/TIP.2018.2836300.
[44] H. Lieberman, How to color in a coloring book, ACM Conf. Spec. Inter. Group Graphics Interact.</p>
      <p>Techn. (SIGGRAPH) 12 (1978). doi:10.1145/965139.807380.
[45] U. Shani, Filling regions in binary raster images: A graph-theoretic approach, ACM Conf. Spec.</p>
      <p>Inter. Group Graphics Interact. Techn. (SIGGRAPH) 14 (1980).
[46] J. Cousty, G. Bertrand, L. Najman, M. Couprie, Watershed cuts: Minimum spanning forests
and the drop of water principle, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009) 1362–1374.
doi:10.1109/TPAMI.2008.173.
[47] J. Stolfi, R. de Alencar Lotufo, A. X. Falc?, The Image Foresting Transform: Theory, Algorithms,
and Applications , IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 19–29. doi:10.1109/TPAMI.
2004.10012.
[48] C. Couprie, L. Grady, L. Najman, H. Talbot, Power watershed: A unifying graph-based optimization
framework, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2011) 1384–1399. doi:10.1109/TPAMI.
2010.200.
[49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: an imperative style, high-performance deep
learning library, Adv. Neural Inf. Process. Sys. (NeurIPS) (2019).
[50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image
database, in: IEEE/CVF Inter. Conf. Comput. Vis. Pattern Recog. (CVPR), IEEE, 2009, pp. 248–255.
[51] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
COCO: common objects in context, in: D. J. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.),
European Conf. Comput. Vis. (ECCV), 2014. doi:10.1007/978- 3- 319- 10602- 1\_48.
[52] S. Fu, M. Hamilton, L. E. Brandt, A. Feldmann, Z. Zhang, W. T. Freeman, Featup: A model-agnostic
framework for features at any resolution, in: Inter. Conf. Learn. Represent. (ICLR), 2024. URL:
https://openreview.net/forum?id=GkJiNn2QDF.
[53] R. Franzen, Kodak lossless true color image suite, 2012. URL: https://r0k.us/graphics/kodak/.
[54] S. Caminiti, I. Finocchi, R. Petreschi, On coding labeled trees, Theoretical Computer Science 382
(2007) 97–108.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-Net:
          <article-title>Convolutional networks for biomedical image segmentation</article-title>
          , in: N.
          <string-name>
            <surname>Navab</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hornegger</surname>
            ,
            <given-names>W. M. W.</given-names>
          </string-name>
          <string-name>
            <surname>III</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. F. Frangi</surname>
          </string-name>
          (Eds.),
          <source>Inter. Conf. Med. Imag. Comput.-Assist. Interv. (MICCAI)</source>
          , volume
          <volume>9351</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -24574-4\_
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          , T. Cheng, B.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Mu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Xiao</surname>
          </string-name>
          ,
          <article-title>Deep high-resolution representation learning for visual recognition</article-title>
          ,
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>43</volume>
          (
          <year>2021</year>
          )
          <fpage>3349</fpage>
          -
          <lpage>3364</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2020</year>
          .
          <volume>2983686</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Cover</surname>
          </string-name>
          ,
          <article-title>Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition</article-title>
          ,
          <source>IEEE Transactions on Electronic Computers EC-14</source>
          (
          <year>1965</year>
          )
          <fpage>326</fpage>
          -
          <lpage>334</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shalev-Shwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ben-David</surname>
          </string-name>
          ,
          <source>Understanding Machine Learning: From Theory to Algorithms</source>
          , Cambridge University Press,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hardle</surname>
          </string-name>
          , L. Simar,
          <source>Applied Multivariate Statistical Analysis</source>
          , Springer Berlin Heidelberg,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Field</surname>
          </string-name>
          ,
          <article-title>Relations between the statistics of natural images and the response properties of cortical cells</article-title>
          ,
          <source>Journal of the Optical Society of America A</source>
          <volume>4</volume>
          (
          <year>1987</year>
          )
          <fpage>2379</fpage>
          -
          <lpage>2394</lpage>
          . doi:
          <volume>10</volume>
          .1364/JOSAA. 4.002379.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kersten</surname>
          </string-name>
          ,
          <article-title>Predictability and redundancy of natural images</article-title>
          ,
          <source>Journal of the Optical Society of America A</source>
          <volume>4</volume>
          (
          <year>1987</year>
          )
          <fpage>2395</fpage>
          -
          <lpage>2400</lpage>
          . doi:
          <volume>10</volume>
          .1364/JOSAA.4.002395.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Simoncelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Olshausen</surname>
          </string-name>
          ,
          <source>Natural image statistics and neural representation, Annual Review of Neuroscience</source>
          <volume>24</volume>
          (
          <year>2001</year>
          )
          <fpage>1193</fpage>
          -
          <lpage>1216</lpage>
          . doi:
          <volume>10</volume>
          .1146/annurev.neuro.
          <volume>24</volume>
          .1.1193.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Seeram</surname>
          </string-name>
          ,
          <article-title>Advances in imaging-the changing environment for the imaging technologist</article-title>
          ,
          <source>Radiologic Technology</source>
          <volume>82</volume>
          (
          <year>2011</year>
          )
          <fpage>417</fpage>
          -
          <lpage>438</lpage>
          . URL: https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC3076980/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Abramovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pensky</surname>
          </string-name>
          ,
          <article-title>Classification with many classes: Challenges and pluses</article-title>
          ,
          <source>J. Multivar. Anal</source>
          .
          <volume>174</volume>
          (
          <year>2019</year>
          )
          <article-title>104536</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0047259X19302763.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Wu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhang,</surname>
          </string-name>
          <article-title>Principled approach to the selection of the embedding dimension of networks</article-title>
          ,
          <source>Nature Communications</source>
          <volume>12</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . doi:
          <volume>10</volume>
          .1038/ s41467-021-23795-5.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Kajiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. E.</given-names>
            <surname>Southerland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Cheadle</surname>
          </string-name>
          ,
          <article-title>A random-access video frame bufer, Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>1998</year>
          , p.
          <fpage>315</fpage>
          -
          <lpage>320</lpage>
          . doi:
          <volume>10</volume>
          .1145/280811.281022.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Heckbert</surname>
          </string-name>
          ,
          <article-title>Color image quantization for frame bufer display</article-title>
          ,
          <source>in: ACM Conf. Spec. Inter. Group Graphics Interact. Techn. (SIGGRAPH)</source>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>1982</year>
          , p.
          <fpage>297</fpage>
          -
          <lpage>307</lpage>
          . doi:
          <volume>10</volume>
          .1145/800064.801294.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darcet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Moutakanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szafraniec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khalidov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>HAZIZA</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Nouby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Assran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ballas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Galuba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Howes</surname>
          </string-name>
          , P.-Y. Huang,
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rabbat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mairal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Labatut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , P. Bojanowski,
          <article-title>DINOv2: Learning robust visual features without supervision</article-title>
          ,
          <source>Trans. Mach. Learn. Res</source>
          . (
          <year>2024</year>
          ). URL: https://openreview.net/forum?id=a68SUt6zFt.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Naeem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tombari</surname>
          </string-name>
          , SILC:
          <article-title>Improving vision language pretraining with self-distillation</article-title>
          ,
          <source>in: European Conf. Comput. Vis. (ECCV)</source>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -72664-
          <issue>4</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mintun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ravi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rolland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gustafson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Whitehead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , W.-Y. Lo,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Segment anything</article-title>
          ,
          <source>in: IEEE Inter. Conf. Comput. Vis. (ICCV)</source>
          ,
          <year>2023</year>
          . URL: https://openaccess.thecvf.com/content/ICCV2023/papers/Kirillov_Segment_Anything_ ICCV_
          <year>2023</year>
          <article-title>_paper</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Selan</surname>
          </string-name>
          ,
          <article-title>High-quality real-time rendering with multi-dimensional lookup tables, in: GPU Gems 2: Programming Techniques for High-Performance Graphics</article-title>
          and
          <string-name>
            <surname>General-Purpose</surname>
            <given-names>Computation</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Addison-Wesley Professional</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>CompuServe</given-names>
            <surname>Incorporated</surname>
          </string-name>
          ,
          <article-title>Graphics interchange format (gif) specification, version 89a</article-title>
          ,
          <string-name>
            <surname>Online</surname>
          </string-name>
          ,
          <year>1989</year>
          . URL: https://www.w3.org/Graphics/GIF/spec-gif89a.
          <fpage>txt</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>