1. Introduction

1613-0073

with Embedded Lookup Tables

Marius Aasan

0 1

Adín Ramírez Rivera

0 1

Workshop

Computer Vision, Segmentation, Representation Learning, Data Structures, Image Compression

0 Department of Informatics, University of Oslo , Problemveien 11, 0313 Oslo , Norway 1 SFI Visual Intelligence , P.O. Box 6050 Langnes, 9037 Tromsø , Norway

Pixel-level prediction tasks inherently face constraints imposed by image resolution and the number of prediction classes. As image resolutions and dimensionalities increase, these constraints lead to significant memory bottlenecks when explicitly modeling high-dimensional embeddings for each individual pixel. Addressing these bottlenecks requires the development of alternative, more eficient representations to facilitate continued progress in the field. In this work, we discuss Embedded Lookup Tables (ELUTs) with indexed segmentation maps as an alternative data structure for more memory eficient representations in image processing. We show that ELUTs are inherently compatible with cost functionals and metrics for pixel-level prediction tasks with a significant reduction in memory overhead.

1. Introduction

CEUR

ceur-ws.org int64 float32 (a) Dense representation. (b) Indexed segmentation.

(c) Embedded lookup table. segmented representation (b) with an × dense representation (a), compared to an × indexed embedded lookup table (c). Our proposition utilizes (c) and (b) to form a sparse representation of (a) with significantly lower memory overhead.

2. Related Work

Dense raster tensors are natural representations for convolutional networks (CNNs), whose design natively consumes and emits features in grids. As discussed in Sec. 1, the demands of higher resolution images in conjunction with higher embedding dimensions indicates that the ( ) overhead is a bottleneck. Current state-of-the-art foundational models [14] make use of over 131k pseudo-classes for instance- and patch-level predictions. Extending this to pixel-level predictions remains computationally intractable without alternative representations. While open-vocabulary approaches [15, 16] alleviate the issue by aligning with natural language embeddings instead of fixed classes, they necessitate high-dimensional embeddings and induce the same bottleneck.

Frame Bufers and Palettes.

In the formative stages of computer graphics, Kajiya et al. [12] introduced the modern frame bufer for overcoming limitations of memory and compute. Early renderers were designed such that each pixel stores an index into a color lookup table (CLUT) with RGB color representations to enable compact frame bufers and real-time hardware color mapping. Later work expanded on quantization and dithering for optimizing color representations in imaging [13], paving the way for further developments with palette-based image quantization and lookups for hardware and software rendering. Color quantization and CLUTs is actively used for efective photometric transforms in modern GPU rendering [17] and compression in standard image formats [18, 19].

Superpixels and Objecthood.

One remedy to the dense bottleneck is to decouple where a feature lives from what the feature is. If a colour-lookup table can recover an RGB triplet from an index, the same pattern generalises: store a -dimensional feature vector in the table and let neighbouring pixels that share semantics reuse the index. Superpixels [20] were introduced for this exact purpose, and serve as a strong indicator for pre-segmentation by partitioning an image into ≪ connected regions.

By associating each region with an index and each index with a feature embedding, we can construct image representations that align with semantic content in an image.

Token-based Architectures.

Research has seen widespread adoption of Vision Transformers [21] (ViTs), representing a move away from local receptive fields towards attention operators to model global interaction between discrete tokens. Tokenization has largely been driven by a simplified patchification procedure; square patches are extracted as tokens with positional embeddings to encode spatial information. ViTs for pixel-level predictions have focused on patch upsampling and concatenation of multiple layer activations [22, 23] to yield full resolution predictions. While studies have shown that smaller patches are beneficial for predictive power [ 24, 25], the computational complexity of more granular tokens is quadratic [26, 27] due to the attention operator.

Recent work have looked to improve on the patch-based tokenization process in a move towards more adaptive tokenization methods [28, 29]. Adaptive superpixel tokenization have been successfully applied to allow ViTs to work with pixel level granularity [30, 31], and have been shown to improve state-of-the-art open-vocabulary segmentation models [32]. As every region carries an integer index, the image now is an index map; the per-region embeddings form a lookup table, giving exactly the palette-style pairs illustrated in Fig. 1.

Convolutions on Discrete Topologies.

Although CNNs are used for feature extraction and superpixel tokenization [33], they are not inherently suited to discrete embedding spaces in the way transformers are. By design, a CNN natively consumes and emits features in grids necessitating a fully realized dense representation ∈ ℝ × ×

. However, the fundamental neighborhood operations performed by a CNN can be extended to discrete topologies via graph neural networks [34, 35, 36] (GNNs). Superpixel-based GNNs have been successfully applied in medical imaging [37, 38], and the recent long range graph benchmark [39] (LRGB) include two superpixel graph datasets for segmentation, encouraging further research in vision based GNN approaches.

3. Embedded Lookup Tables with Indexed Segmentation Maps

Adaptive tokenization in ViTs with semantic edge detection provides an opportunity to rethink underlying data structures for pixel‑level prediction. Because the multi‑head self‑attention operator is inherently permutation‑invariant, spatial relationships must be injected through positional encodings rather than being implicit via grid topology. This observation suggests that we no longer need to commit to a fixed, dense lattice as the primitive representation. Instead, we encode each image as a pair ( 1 ) ( 2 ) ( ∈ {1, … , } × , ∈ ℝ × ), where () for token . In training, a batch of images {( , )}=1 is combined by simple concatenation of tables assigns each pixel to one of superpixel tokens, and ∈ ℝ is the ‑dimensional feature () = () + ∑ , = [ 1, 2, … , ].

< No padding is required – each regions index range is shifted by the cumulative token counts of preceding samples. This is the same implementation used in GNNs; variable‑size graphs are merged into a single supergraph by ofsetting node indices and merging edge lists, preserving each subgraph’s topology without forcing a common size [40].

Conceptually, ELUTs can be viewed an extension of indexed color representations in early frame bufers; is a two‑dimensional index map and is a lookup table of learned embeddings. The memory savings follow immediately; instead of storing × × dense features, we store only × entries plus an ×

integer map. As long as the segmentation captures spatial redundancy this yields compression of pixel‑level features, with support for batching of heterogeneous resolutions and token counts.

3.1. Implementation Details

An ELUT representation is straightforward to apply in ViTs, since token representations correspond directly to the embeddings in the lookup table. Any tokenization process that partitions the image into connected regions can be applied to extract segmentation maps () . Diferent approaches can be applied for feature extraction; interpolation to a fixed feature size via bounding-boxes in addition to a joint histogram for positional embeddings provides commensurability with existing ViT approaches [30]. Alternatively, a vision encoder such as a lightweight CNN can be applied for feature extraction to construct a dense feature map () ∈ ℝ ×

′× ′ [31, 32], and features can be extracted by taking , =

⨁ semantic boundaries. SPiT [30] applies ground-up community detection mechanisms using a graphbased hierarchical region merging [43].

Parallel computations. Interestingly, watershed and graph region merging approaches give two diferent approaches to flood filling [

44, 45]. Watershed applies filling based on a set of seeds given a gradient image, while graph region merging by processing all pixels uniformly in parallel while updating the graph representation. Specifically, both approaches can be viewed as optimal spanning forests [ 46, 47] and are equivalent to optimizing a random walk over the graph [48]. Eficient implementations in PyTorch [49] can be constructed without external dependencies and custom CUDA kernels [30].

3.2. Space Complexity Analysis

Let = × denote the number of pixels in an image, the feature dimension, and ≤ the number of regions. We compare the memory footprint of a standard dense feature map

dense = × against that of an ELUT representation, which is given by We can express the relative memory usage as Given this, ELUTs provide an overall memory reduction factor ELUT =

⏟ index map ∈{1,…, } × + ⏟ ELUT = dense + = + 1 = + 1 , ≡

∈ (0, 1]. = dense = ELUT

1 + 1/ .

Suficiently high‐dimensional embeddings ( ≫ 1 ) gives a ”rule-of-thumb” estimate ≈ −1. The savings can be illustrated by the following example; if compress to = 0.05 of the pixel count, ELUTs use only ∼ 5% of the dense footprint – a ∼ 20× savings. For an extended empirical analysis, see Sec. 4.1. ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) Worst‐case vs. best‐case. For =

(no compression), = 1 and ELUT/ dense = 1 + 1/ , hence ELUTs incur negligible extra overhead in high-dimensional regimes. In the ideal limit ≪ , memory scales as ( + ) ≪ ( )

. This demonstrates how ELUTs improve on the ( ) bottleneck of dense tensor representations, trading of a linear-in-pixels index map for a much smaller, segmentation-driven embedding table.

Batch Processing. Eq. ( 2 ) shows that ELUT batching can be implemented by table concatenation without overhead. For a batch of images with pixel counts and token counts , d()ense = ∑ , E()LUT = ∑ + ∑ , and the same ratio analysis applies, giving = ∑ ∑

. ( + + ) = ( + 2 ) = ( + ) Thus an ELUT incurs ( + ) = ( + )

work per image, whereas a dense raster evaluates the same objective in ( )

. As with the space complexity, for = 1 (no compression) a dense representation outperforms ELUTs but since ( + ) = ( ( + 1)) = ( ) , the representations technically have the same complexity. However, for the expected case ≪ 1 we see significant reduction in compute for ELUT representations, tied to the reduction factor . In a practical modeling setting, the sparsity of also plays a role, conditional on how well aligns with . ∈ ( 8 ) ( 9 ) ( 10 ) ( 11 ) ( 12 )

3.3. Equivalence of Cost functionals and Metrics

ELUT representation ( , ) allows for standard computation of cost metrics – such as MSE, cross-entropy, or focal loss – as well as metrics such as Jaccard Index (IoU) or DICE coeficients. Given a set of predicted regions ∶ → {1, … , } as well ground truth labels ∶ → {1, … , } , we define region sizes and the region–class contingency counts = ∑ 1[ ()=] , ∈ =∑ 1[ ()=]

ℓ( , ) can be rewritten exactly as a weighted sum over the contingency table where 1[⋅] denotes the Iverson bracket.

Any loss or metric expressible as a sum over pixels =1 =1 ℒ = ∑

∑ ℓ( , ), recalling that is the embedding or logits emitted by token and is an target embedding or class index. Denote by = {(, ) ∶

> 0} the set of non-empty region–class overlaps. We can then take ( 11 ) as an identity; it trades pixel visits for the at most = | | non-empty overlaps.

Batch Processing. Computing cost functionals and metrics in a mini-batch regime requires only a slight modification for computing contingency tables, where each ground-truth segment needs to be unique within each sample. This can be achieved by concatenating = [ + ( − 1) ] that the original targets can be recovered using a simple modulo operation to determine the global =1 , such frequency for the final contingency table.

Complexity reduction.

Building the contingency table requires a single ( ) scan of the pixels; each visit increments the counter (, ) ↦ +1. After that pass, any additive loss or metric of the form ( 11 ) touches only the non-zero overlaps, plus the region marginals. However, the sparsity of is contingent on how well matches . Given that < < , the worst case yields

4. Experimental Results

Our experiments are designed to demonstrate feasibility for ELUT representations. In Sec. 4.1 we perform an empirical study on complexity, showing efective reduction in memory and computational costs, verifying derivations from Sections 3.2 and 3.3. We then demonstrate the efectiveness of hierarchical image representations in a simple lossless compression scheme.

4.1. Empirical Complexity Analysis

We compute empirical estimates of expected region counts over ImageNet [50] using an adaptive tokenization framework with ∈ {1, 2, 3, 4} merge iterations [30]. Table 1 show that even at conservative levels ( = 1 ), ELUTs quarter the memory footprint, while deeper merges further compress the feature map without altering pixel-wise losses or metrics – cf. Sec. 3.3.

Next, we compute memory overhead and wall-clock throughput for variable sized batches for images sampled from COCO [51]. We compare a dense feature representation extracted via upscaling [52] with = 768 and = = 224 to an ELUT representation. Features are extracted via the same backbone over the same batch of images, so the only diference is their representation.

We compute equivalent formulations of cross-entropy for dense and ELUT representations, and report results in Tab. 2. While our results are not able to verify that ELUTs provide any significant empirical computational benefits in terms of throughput, the memory overhead is still significant. The loss is evaluated by gathering the region logits from one contiguous lookup table and weighting them with the sparse overlaps, so the only overhead comes from scatter/gather operations.

4.2. Information Content and Compression

For a data-structure to be efective, it should ideally result in low entropy representations. Empirically validating the informational content of general ELUTs is non-trivial, as the representation is highly contingent on the choice of tokenizer and features, which are typically task-dependent.

We construct a representation independent of feature representations by using a superpixel tokenizer and compress images using a hierarchical pyramid representation. Each pixel is associated with a parent region using SPiT tokenizer [30] to form a rooted tree, i.e., we aggregate regions until only a single node remains. We reuse the very same SPiT partitions that the model computes during training. We then construct a graph Laplacian pyramid by representing all non-root RGB embeddings mapped to YCbCr with a diference transform over the depth of the tree. Since a region is typically similar to its parent this provides a sparse representation, particularly in leaf nodes representing individual pixels.

To encode the overall structure, we extract the Prüfer sequence of the graph [54] along with the ELUT. In conjunction with a Hufman encoding scheme, this yields a basic lossless compression format, useful for evaluating the informational capacity of ELUT representations. We measure the information density by evaluating the compression ratio (CR), bits-per-pixel (BPP) and contrast it with standard image formats, and report them in Tab. 3. Surprisingly, this simple strategy yields better compression than PNG, the industry standard. We note that while this is by no means a state-of-the-art approach, it shows that hierarchical ELUTs provides highly efective image representations.

5. Discussion and Conclusion

In this work, we introduce Embedded Lookup Tables as extensions of well-established representations in computer graphics to machine learning settings. We show that by factorising a × × map into an integer index image and a compact table ∈ ℝ × , ELUTs replace the ( ) memory wall with ( + ) . Our experiments provide empirical evidence that show how ELUTs provide an efective alternative in practical modeling tasks requiring pixel level granularity. Since objectives and metrics can be rewritten using contingency counts, arithmetic shrinks roughly in proportion to the factor = / .

While our results are not contingent on a single tokenization framework, future work would incorporate more extensive studies on the efect on training in dense prediction tasks and multiple tokenization framework. While our scope is limited to 2D imaging in this work, our theoretical results naturally extends to higher dimensional imaging. Extensions to volumetric data, light-field stacks, and video frames should benefit even further as the dense baseline grows with an extra dimension while the index map stays tractable.

Acknowledgments

Computations were performed on resources provided by Sigma2 (NRIS, Norway), Project NN8104K. We acknowledge Sigma2 for awarding access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through Sigma2, Norway, Project no. 465001382. This work was funded in part by the Research Council of Norway, via the Visual Intelligence Centre for Research-based Innovation (grant no. 309439), and Consortium Partners.

Declaration on Generative AI

During the preparation of this work, the authors used GPT-4o and Writefull for:1 Grammar and spelling checks, and Formatting assistance. After using these services, the authors reviewed and edited the content as needed and takes full responsibility for the publication’s content. 1Following the taxonomy from the CEUR-WS Policy [19] Adobe Systems Inc., Tif revision 6.0 specification, Online, 1992. URL: http://partners.adobe.com/ public/developer/en/tiff/TIFF6.pdf, originally published by Aldus Corporation as TIIF 6.0. [20] X. Ren, J. Malik, Learning a classification model for segmentation, in: IEEE Inter. Conf. Comput.

Vis. (ICCV), 2003, pp. 10–17 vol.1. doi:10.1109/ICCV.2003.1238308. [21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, in: Inter. Conf. Learn. Represent. (ICLR), 2021. [22] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. S. Torr, L. Zhang, Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in: IEEE/CVF Inter. Conf. Comput. Vis. Pattern Recog. (CVPR), 2021. [23] R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in: IEEE Inter.

Conf. Comput. Vis. (ICCV), 2021. [24] F. Wang, Y. Yu, G. Wei, W. Shao, Y. Zhou, A. Yuille, C. Xie, Scaling laws in patchification: An image is worth 50,176 tokens and more, 2025. URL: https://arxiv.org/abs/2502.03738, arXiv preprint arXiv:2502.03738. [25] R. Mojtahedi, M. Hamghalam, R. K. G. Do, A. L. Simpson, Towards optimal patch size in vision transformers for tumor segmentation, in: International Workshop on Multiscale Multimodal Medical Imaging (MMMI 2022), 2023. doi:10.1007/978- 3- 031- 18814- 5_11. [26] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, A. Weller, Rethinking attention with performers, in: Inter. Conf. Learn. Represent. (ICLR), 2021. ArXiv preprint arXiv:2009.14794. [27] Y. Han, Z. Xu, X. Shang, L. Wang, Y. Yang, S. Li, X. Liu, B. Wu, J. Bai, et al., FLatten transformer:

Vision transformer using focused linear attention, in: IEEE Inter. Conf. Comput. Vis. (ICCV), 2023. [28] J. D. Havtorn, A. Royer, T. Blankevoort, B. E. Bejnordi, MSViT: Dynamic mixed-scale tokenization for vision transformers, in: IEEE Inter. Conf. Comput. Vis. Wksps. (ICCVW), 2023, pp. 838–848. [29] T. Ronen, O. Levy, A. Golbert, Vision Transformers with Mixed-Resolution Tokenization , in: IEEE/CVF Inter. Conf. Comput. Vis. Pattern Recog. (CVPR), IEEE Computer Society, 2023. doi:10.1109/CVPRW59228.2023.00486. [30] M. Aasan, O. Kolbjørnsen, A. Schistad Solberg, A. Ramírez Rivera, A spitting image: Modular superpixel tokenization in vision transformers, in: European Conf. Comput. Vis. Wksps. (ECCVW), 2024. [31] J. Lew, S. Jang, J. Lee, S.-K. Yoo, E. Kim, S. Lee, J.-Y. C. J.-H. M. Y.-I. Mok, S. Kim, S. Yoon, Superpixel tokenization for vision transformers: Preserving semantic integrity in visual tokens, ArXiv abs/2412.04680 (2024). URL: https://api.semanticscholar.org/CorpusID:274581264. [32] D. Chen, S. Cahyawijaya, J. Liu, B. Wang, P. Fung, Subobject-level image tokenization, ArXiv abs/2402.14327 (2024). URL: https://api.semanticscholar.org/CorpusID:267782983. [33] V. Jampani, D. Sun, M.-Y. Liu, M.-H. Yang, J. Kautz, Superpixel samping networks, in: European

Conf. Comput. Vis. (ECCV), 2018. [34] T. N. Kipf, M. Welling, Semi-Supervised Classification with Graph Convolutional Networks, in:

Inter. Conf. Learn. Represent. (ICLR), 2017. URL: https://openreview.net/forum?id=SJU4ayYgl. [35] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks?, in: Inter. Conf.

Learn. Represent. (ICLR), 2019. URL: https://openreview.net/forum?id=ryGs6iA5Km. [36] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks, in: Proceedings of the International Conference on Learning Representations, 2018. URL: https: //openreview.net/forum?id=rJXMpikCZ. [37] C. Playout, Z. Legault, R. Duval, M. C. Boucher, F. Cheriet, A Region-Based Approach to Diabetic Retinopathy Classification with Superpixel Tokenization , in: Inter. Conf. Med. Imag. Comput.Assist. Interv. (MICCAI), volume LNCS 15005, Springer Nature Switzerland, 2024. [38] Y. Lee, J. H. Park, S. Oh, K. Shin, J. Sun, M. Jung, C. Lee, H. Kim, J.-H. Chung, K. C. Moon, et al., Derivation of prognostic contextual histopathological features from whole-slide images of tumours via graph deep learning, Nature Biomedical Engineering (2022) 1–15. [39] V. P. Dwivedi, L. Rampášek, M. Galkin, A. Parviz, G. Wolf, A. T. Luu, D. Beaini, Long range graph benchmark, in: Adv. Neural Inf. Process. Sys. (NeurIPS), 2022. URL: https://openreview.net/forum? id=in7XC5RcjEn. [40] M. Fey, J. E. Lenssen, Fast graph representation learning with pytorch geometric, in: Inter. Conf.

Learn. Represent. Wksps. (ICLRW), New Orleans, USA, 2019. URL: https://arxiv.org/abs/1903.02428. [41] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, SLIC superpixels compared to state-of-the-art superpixel methods, IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 2274–2282. doi:10.1109/TPAMI.2012.120. [42] L. Najman, M. Schmitt, Watershed of a Continuous Function, Signal Process. (1994). URL: https://hal.science/hal-00622129. doi:10.1016/0165- 1684(94)90059- 0. [43] X. Wei, Q. Yang, Y. Gong, N. Ahuja, M. Yang, Superpixel hierarchy, IEEE Trans. Image Process.

(2018). doi:10.1109/TIP.2018.2836300. [44] H. Lieberman, How to color in a coloring book, ACM Conf. Spec. Inter. Group Graphics Interact.

Techn. (SIGGRAPH) 12 (1978). doi:10.1145/965139.807380. [45] U. Shani, Filling regions in binary raster images: A graph-theoretic approach, ACM Conf. Spec.

Inter. Group Graphics Interact. Techn. (SIGGRAPH) 14 (1980). [46] J. Cousty, G. Bertrand, L. Najman, M. Couprie, Watershed cuts: Minimum spanning forests and the drop of water principle, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009) 1362–1374. doi:10.1109/TPAMI.2008.173. [47] J. Stolfi, R. de Alencar Lotufo, A. X. Falc?, The Image Foresting Transform: Theory, Algorithms, and Applications , IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 19–29. doi:10.1109/TPAMI. 2004.10012. [48] C. Couprie, L. Grady, L. Najman, H. Talbot, Power watershed: A unifying graph-based optimization framework, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2011) 1384–1399. doi:10.1109/TPAMI. 2010.200. [49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: an imperative style, high-performance deep learning library, Adv. Neural Inf. Process. Sys. (NeurIPS) (2019). [50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image database, in: IEEE/CVF Inter. Conf. Comput. Vis. Pattern Recog. (CVPR), IEEE, 2009, pp. 248–255. [51] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft COCO: common objects in context, in: D. J. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), European Conf. Comput. Vis. (ECCV), 2014. doi:10.1007/978- 3- 319- 10602- 1\_48. [52] S. Fu, M. Hamilton, L. E. Brandt, A. Feldmann, Z. Zhang, W. T. Freeman, Featup: A model-agnostic framework for features at any resolution, in: Inter. Conf. Learn. Represent. (ICLR), 2024. URL: https://openreview.net/forum?id=GkJiNn2QDF. [53] R. Franzen, Kodak lossless true color image suite, 2012. URL: https://r0k.us/graphics/kodak/. [54] S. Caminiti, I. Finocchi, R. Petreschi, On coding labeled trees, Theoretical Computer Science 382 (2007) 97–108.

[1]

Ronneberger ,

Fischer ,

Brox , U-Net: Convolutional networks for biomedical image segmentation , in: N. Navab , J.

Hornegger , W. M. W.

III , A. F. Frangi (Eds.), Inter. Conf. Med. Imag. Comput.-Assist. Interv. (MICCAI) , volume 9351 of Lecture Notes in Computer Science, Springer, 2015 , pp. 234 - 241 . doi: 10 .1007/978-3- 319 -24574-4\_ 28 .

[2]

Wang ,

Sun , T. Cheng, B. Jiang , C.

Deng , Y.

Zhao , D.

Liu , Y.

Mu , M.

Tan , X.

Wang , W.

Liu , B.

Xiao , Deep high-resolution representation learning for visual recognition , IEEE Trans. Pattern Anal. Mach. Intell . 43 ( 2021 ) 3349 - 3364 . doi: 10 .1109/TPAMI. 2020 . 2983686 .

[3]

T. M.

Cover , Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition , IEEE Transactions on Electronic Computers EC-14 ( 1965 ) 326 - 334 .

[4]

Shalev-Shwartz ,

Ben-David , Understanding Machine Learning: From Theory to Algorithms , Cambridge University Press, 2014 .

[5]

Hardle , L. Simar, Applied Multivariate Statistical Analysis , Springer Berlin Heidelberg, 2007 .

[6]

D. J.

Field , Relations between the statistics of natural images and the response properties of cortical cells , Journal of the Optical Society of America A 4 ( 1987 ) 2379 - 2394 . doi: 10 .1364/JOSAA. 4.002379.

[7]

Kersten , Predictability and redundancy of natural images , Journal of the Optical Society of America A 4 ( 1987 ) 2395 - 2400 . doi: 10 .1364/JOSAA.4.002395.

[8]

E. P.

Simoncelli ,

B. A.

Olshausen , Natural image statistics and neural representation, Annual Review of Neuroscience 24 ( 2001 ) 1193 - 1216 . doi: 10 .1146/annurev.neuro. 24 .1.1193.

[9]

Seeram , Advances in imaging-the changing environment for the imaging technologist , Radiologic Technology 82 ( 2011 ) 417 - 438 . URL: https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC3076980/.

[10]

Abramovich ,

Pensky , Classification with many classes: Challenges and pluses , J. Multivar. Anal . 174 ( 2019 ) 104536 . URL: https://www.sciencedirect.com/science/article/pii/ S0047259X19302763.

[11]

Liu ,

Zhang , W. Wu,

Wang , X. Zhang, Principled approach to the selection of the embedding dimension of networks , Nature Communications 12 ( 2021 ) 1 - 9 . doi: 10 .1038/ s41467-021-23795-5.

[12]

J. T.

Kajiya ,

I. E.

Southerland ,

E. C.

Cheadle , A random-access video frame bufer, Association for Computing Machinery , New York, NY, USA, 1998 , p. 315 - 320 . doi: 10 .1145/280811.281022.

[13]

Heckbert , Color image quantization for frame bufer display , in: ACM Conf. Spec. Inter. Group Graphics Interact. Techn. (SIGGRAPH) , Association for Computing Machinery , New York, NY, USA, 1982 , p. 297 - 307 . doi: 10 .1145/800064.801294.

[14]

Oquab ,

Darcet ,

Moutakanni ,

H. V.

Vo ,

Szafraniec ,

Khalidov ,

Fernandez ,

HAZIZA ,

Massa ,

El-Nouby ,

Assran ,

Ballas ,

Galuba ,

Howes , P.-Y. Huang,

S.-W.

Li , I. Misra ,

Rabbat ,

Sharma , G. Synnaeve,

Xu ,

Jegou ,

Mairal ,

Labatut ,

Joulin , P. Bojanowski, DINOv2: Learning robust visual features without supervision , Trans. Mach. Learn. Res . ( 2024 ). URL: https://openreview.net/forum?id=a68SUt6zFt.

[15]

M. F.

Naeem ,

Xian ,

Zhai ,

Hoyer ,

L. Van

Gool ,

Tombari , SILC: Improving vision language pretraining with self-distillation , in: European Conf. Comput. Vis. (ECCV) , 2024 . doi: 10 .1007/ 978-3- 031 -72664- 4 _ 3 .

[16]

Kirillov ,

Mintun ,

Ravi ,

Mao ,

Rolland ,

Gustafson ,

Xiao ,

Whitehead ,

A. C.

Berg , W.-Y. Lo,

Dollár ,

Girshick , Segment anything , in: IEEE Inter. Conf. Comput. Vis. (ICCV) , 2023 . URL: https://openaccess.thecvf.com/content/ICCV2023/papers/Kirillov_Segment_Anything_ ICCV_ 2023 _paper .pdf.

[17]

Selan , High-quality real-time rendering with multi-dimensional lookup tables, in: GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose

Computation

, Addison-Wesley Professional , 2005 .

[18]

CompuServe

Incorporated , Graphics interchange format (gif) specification, version 89a , Online , 1989 . URL: https://www.w3.org/Graphics/GIF/spec-gif89a. txt .