1. Introduction

Learned Lossy Image Compression for Volumetric Medical Data

Jan Kotera

0 1

Matthias Wödlinger

Manuel Keglevic

0 0 CVL, TU Wien , Favoritenstraße 9/11, 1040 Vienna , Austria 1 Institute of information theory , CAS, Pod Vodárenskou věží 4, 182 00 Prague , Czech Republic

This work addresses the problem of lossy compression of volumetric images consisting of individual slices such as those produced by CT scans and MRI machines in medical imaging. We propose an extension of a single-image lossy compression method with an autoregressive context module to a sequential encoding of the volumetric slices. In particular, we remove the intra-slice autoregressive relation and instead condition the entropy model of the latent on the previous slice in the sequence. This modification alleviates the typical disadvantages of autoregressive contexts and leads to a significant increase in performance compared to encoding each slice independently. We test the proposed method on a dataset of diverse CT scan images in a setting with an emphasis on high-fidelity reconstruction required in medical imaging and show that it compares favorably against several established state-of-the-art codecs in both performance and runtime.

eol>Learned Image Compression Medical Image Data Deep Learning

1. Introduction

Medical imaging is a set of techniques and processes that produce images of the interior of the body for the purpose of clinical analysis, medical intervention, or visual representation of the function of the internal organs. Examples of common types of imaging systems are x-rays, computed tomography (CT) scans, magnetic resonance imaging (MRI), or ultrasound (US). Medical imaging has become a staple tool not only for medical diagnosis and treatment but also a crucial component of research, as it allows researchers and physicians to establish a knowledge base of normal anatomy and physiology to make it possible to identify abnormalities and study the efects of medical intervention. For these reasons, the amount of image data produced in healthcare and medical research Figure 1: Illustrative example of a single uncompressed slice is huge and increasing [ 1 ], as are the requirements for from the CT scan test set [ 6 ] used for performance evaluation. eficient transmission and especially storage.

Image compression methods are designed for exactly that – to enable more eficient coding of image data with little or no loss in visual quality. The first successful im- ern image compression codecs such as BPG [ 3 ], AVIF [ 4 ], age compression techniques were developed in the early or WebP [ 5 ] typically appear as by-products of a video 1990s and some of those are still being widely used today, codec development – the intra-frame component is exsuch as for example the well-known JPEG method [ 2 ]. tracted from the video codec and used as a standalone In recent years the development of novel compression image codec. methods for image and video accelerated, in line with the For mainstream everyday use in applications such as growing amount of streamed image and video data. Mod- image or video streaming, video calls, online gaming and so on the goal is for the reconstructed image to appear “natural and artefact-free” on first glance while achieving high enough compression ratios to make the above mentioned applications feasible. General-purpose video codes are therefore developed for and tested mainly on natural sequences, screen content, or synthetic scenes (eg. [ 7 ]) and typically benchmarked in perceptually lossy range of < 40dB reconstruction PSNR (eg. [ 8 ]). simi- vised manner and the minimized loss is the sum of two larly for image codecs. In the case of medical imaging terms: The distortion in the image reconstruction and the fundamental requirement is that the reconstruction the entropy (i.e. expected bitrate) of the latent. The enerror must not alter the subsequent clinical analysis. The tropy coder is used of-the-shelf and is not subject to reconstructed image must remain true to the original up training. One of the great advantages of learned image to imperceptible “noise” void of any structure. We argue compression is that the training is relatively simple and that using an established and straightforward objective cheap which makes it possible to adapt a method for a metric such as PSNR for measuring the reconstruction particular modality, such as medical images, whereas for error is the right approach here to ensure that the recon- conventional hand-designed codecs such adaptation is structed image is truly nearly identical to the original not feasible. when the reconstruction error is near zero. In our subjec- The proposed method extends [ 13 ] to volumetric medtive tests (on HDR display) we find that we are not able ical data consisting of individual slices, i.e. a sequence of to distinguish between the original and reconstructed 2D images. This type of data is acquired for example by a images above 55dB PSNR, so that is approximately our CT scan (see Fig. 1 for an example) or in an MRI. The intarget quality range. On the other hand, bellow 50dB we dividual slices are encoded in order. The transform from could identify loss of subtle structure in some images. image data to the latent representation is done for each Having the images analyzed by medical experts is unfor- slice independently, but in the entropy estimation step tunately too resource-intensive and beyond the scope of the probability model of each slice (except the first) is this work. conditioned on the previous slice, which enables a more

Another solution common in practice is using only accurate estimation of latent distribution since neighborlossless compression but such methods never achieve ing slices typically have high mutual information. This anywhere near as high compression ratios (by order of allows for higher compression ratios with no loss in the magnitude) as lossy methods – for example the study [ 9 ] reconstruction quality. On the decoding side, the imifnds that on medical data the traditional lossless codecs ages are decoded in the same order, so that the previous hardly achieve compression ratios over 4:1, while on the slice is again available when decoding the next. Note test set the proposed method has average ratio over 40:1 that the proposed method works with already digitized at PSNR > 55dB. Proper research into lossy methods uncompressed images in normalized intensity range (typis therefore surely justified. ically 8bit-16bit), it doesn’t in any way enter the process

The traditional approach to image compression are of image generation by the above mentioned imaging hand-designed codecs that are implemented as hard- techniques. coded algorithms, based on human experience and in- We show in the experimental section that this relatuition (see Sec. 2). As with many problems in image tively simple addition outperforms considerably the baseprocessing and computer vision in the last decade, av- line approach in which all slices are processed completely enues are being explored on how to learn optimal codecs independently by a single image compression method. from data. Modern research in learned image compres- Additionally, compared to processing the full volume at sion started with the works of Toderici et al. [ 10 ] as the once our approach requires a fraction of time and memifrst fully learned method applicable to large images and ory (in practice, it would be necessary to split the volume outperforming some established traditional codecs. A into small chunks and compress those separately anysurge of interest in learned image compression came af- way). We tested the method on a dataset consisting of ter the seminal works of Ballé et al. [ 11, 12 ] and Minnen CT scans of various human body parts and the proposed et al. [ 13 ]. These works laid the groundwork for further approach is competitive even compared to established research and it can be argued that most state-of-the-art standards such as JPEG, BPG, AVIF, and even VVC-intra. (SOTA) methods nowadays are extensions of these methods.

The core structure of a learned method typically con- 2. Related work sists of an autoencoder which transforms the input and produces a latent representation of the image which will For a long time, lossy image and video compression was constitute the bitstream. This representation is then quan- a problem solved exclusively in the traditional way by tized so that it can be passed to an entropy coder which hand-designed methods. Some of these methods, such losslessly converts the discrete representation to an ac- as for example H.264 [ 14 ] or H.265 [ 15 ] video codecs or tual bitstream. The third integral component is an en- JPEG image compression [ 2 ], are now in widespread use tropy model of the latent, i.e. a probability distribution in many areas of industry, research, or everyday life. Relmodel of the symbols (after quantization) of the latent atively recently, the first learned codecs appeared that representation, as this is required by the entropy coder. were able to challenge some of the traditional methods. This pipeline can be trained end-to-end in an unsuper- Arguably the biggest rise of interest started after the works of Ballé et al. [ 11, 12 ] and later Minnen et al. [ 13 ], The encoding and decoding branches of the pipeline are which laid the foundation for learned image compression. connected only via the bitstream which stores the latent These works formulated the main rate-distortion objec- and hyper-latent representation of the image. To this end tive in a learnable way, presented a model containing the the latents must be quantized, for which scalar integer three fundamental components now present in the vast rounding is used, because the entropy coder that conmajority of learned codecs – the autoencoder for image verts the values into their corresponding bit codes can transform, and the hyper-prior and the context module only operate on discrete data (continues values cannot for entropy estimation – and provided the solution for be stored in the bitstream). dealing with the discrete quantization in training. Subse- The advantage of the context module is that the enquent methods increased the performance for example tropy parameters can be very accurate and image-specific, by richer/larger model architecture (e.g. using attention- the disadvantage is that the autoregressive processing like modules) [ 16 ], improved context modules [ 17, 18, 19 ], does not play well with the parallel processing common richer entropy model (e.g. Gaussian mixtures) [ 16 ], or in deep learning. For each new pixel to be decoded the diferent simulation of quantization [ 20, 21 ]. entropy parameters must first be estimated, the pixel de

Recently, a promising research direction is coercing coded and only then can the decoding move to the next the reconstruction to better satisfy the expectations of pixel. As a result, a usually parallelized operation such the human visual system even at the expense of objective as convolution cannot be computed for the whole image (e.g. PSNR) quality. This can be achieved for example by at once but pixel by pixel in alternation with the entropy augmenting the loss by a term that better models human coder. Another disadvantage is that the context prevents perception (such as LPIPS [ 22 ]) [ 19 ], or by training the using the so-called mean-subtracted quantization, which decoder in an adversarial manner as in GANs [ 23, 24 ]. will be specified in the next section. We get rid of these Such approaches can achieve significant bitrate savings drawbacks in the proposed method by replacing the aubut unfortunately are not suitable for medical data, where toregressive context from [ 13 ] with an analogous module the reconstructed image must be objectively undistorted that runs on the previous slice in the sequence. and not just look natural.

Literature on learned compression for medical images Model details The input to our method is a sequence is relatively scarce, this area is still dominated by more tra- of 2D slices 0, . . . , − 1 (superscripts denote slices, ditional approaches such as compression in the wavelet subscripts pixel indices) which are processed in order. domain [25]. Probably the closest match for the pro- The transforms to and from the latent representation deposed method is the lossless compression of 3D volumes noted are done for each slice independently but the by Chen et al. [26]. In our work, however, we focus on entropy model, i.e. the probability distribution ^(ˆ) of lossy compression. Other works propose partitioning the quantized latent ˆ (hat denotes quantization operthe image into relevant (for the diagnosis) and less rele- ation), is conditioned on the latent of the previous slice vant regions and apply diferent compression ratios there ˆ− 1. This helps decrease the entropy of ˆ and therefore [27]. Learned lossy compression for 2D medical images the necessary bitrate while avoiding the disadvantages is investigated for example in [28]. of an autoregressive context model. It is done as follows: Instead of running the context model on the currently 3. Method encoded slice in an autoregressive fashion, we run it on the (quantized) latent ˆ− 1 of the previous slice. During The proposed approach is based on the single image decoding, the slices are processed in the same order so compression method by Minnen et al. [ 13 ], which we ˆ− 1 has already been decoded in full and is available extend for multi-slice volumetric images. The method when ˆ is being decoded and the entropy model can [ 13 ] consists of three main components: again use information from the previous slice. This approach does not require autoregressive processing but • An encoder/decoder which performs the transform can instead be done in parallel for the whole slice withbetween the input image space and the latent repre- out waiting for each new pixel to be decoded. In other sentation (commonly called “latent”). words, the context module is autoregressive in the slice • A hyper-encoder/decoder (called hyper-prior) which sequence but that does not restrict any 2D operations analyzes the latent and stores a small piece of side contained within one slice such as convolutions – instead information into the bitstream that is used later to of decoding individual pixels we can decode whole slices estimate the parameters of the probability distribution in parallel.

of the latent (the entropy model). We model the distribution ^(ˆ) of the quantized • A context module that processes the image latent in latent ˆ by a per-dimension (i.e. spatial pixel and an autoregressive fashion (i.e. causally) and is also a channel) independent Laplace distribution with mean part of the entropy model parameter estimation. and scale parameters ( , ). These two parameters

Quantizer Entropy encoder Entropy decoder

Context module Entropy module for r e d o c n e r e p y H r e d o c e d r e p y H

Quantizer Entropy encoder Entropy decoder

Entropy model for are estimated adaptively for each image and each pixel hyper-latent = ℎ([− 1, ]). This hyper-latent (incl. channels) of the latent by the hyper-prior and is quantized, ˆ = (), so that it can be stored in the context module. For quantization of the latent we the bitstream. The parameters of the entropy model use integer rounding with mean-subtraction, meaning of the quantized latent ˆ are estimated as follows. A that the value is first ofset by the estimated mean of its context module processes the previous slice’s latent distribution before being rounded, (image index omitted) ˆ− 1 and hyper-decoder ℎ processes the hyper-latent ˆ. These two are concatenated and passed through an ˆ = ⌊ − ⌉ + , (1) entropy module to produce the final entropy paramewhere ⌊·⌉ is integer rounding. This improves perfor- ters ( , ) = ([(ˆ− 1), ℎ(ˆ)]) for each pixel mance because then quantization doesn’t change the of the latent. With these parameters available the lamean of the distribution, but it requires that the entropy tent can be quantized and stored in the bitstream and the parameters of the latent are estimated before the latent is encoding proceeds to the next slice. quantized. In particular, both of the entropy estimation During decoding, operations responsible for estimatmodules (hyper-prior and context) must operate on non- ing the entropy model ^(ˆ) have to be executed again quantized values , otherwise an implicit relation would because the entropy model is required by the coder to arise. This is dificult to achieve in a single-image autore- decode ˆ from the bitstream. The hyper-latent ˆ is degressive context model and for example the quantization coded first and since the latent of the previous slice ˆ− 1 in [ 13 ] does not use mean-subtraction, but since in the is already decoded and available, the estimation of the proposed method the context module uses the previous entropy parameters (, ) proceeds as during encoding. slice, using mean-subtraction is possible. Having those, ˆ can be decoded and passed through the

The full procedure of processing a slice is illus- decoder to finally produce the reconstructed image trated in Fig. 2. The image is passed through an encoder ˆ = (ˆ). The decoding then proceeds to the next , producing the latent = (). The latent is con- slice. catenated with the latent of the previous slice, − 1, and What remains to specify is the entropy model ^(ˆ) passed through the hyper-encoder ℎ, producing the of the hyper-latent ˆ, since that is also processed by the entropy coder and stored in the bitstream. We model it Table 1 by per-channel Laplace distribution, meaning that each Model architecture details. conv is a Conv2D layer with kernel channel of has its own mean and scale parameters size , stride and output channels . transpose is a simi(, ) but those are spatially constant so that the model larly specified ConvTranspose2D. GDN and IGDN are the is not tied to a fixed image resolution. These parameters generalized divisive normalization layer [ 11 ] and its inverse, are subject to training but fixed once the model has been respectively. PReLU is the parametric ReLU [30]. trained (i.e. unlike ^(ˆ) it is not image-adaptive). For Encoder: conv k5 s2 c192 → GDN → conv k5 s2 c192 → quantization of we again use mean-subtracted rounding GDN → conv k5 s2 c192 → GDN → conv k5 s2 c192 in a similar fashion as in Eq. (1). Decoder: transpose k5 s2 c192 → IGDN → transpose k5 s2

Details of the model architecture are concisely sum- c192 → IGDN → transpose k5 s2 c192 → IGDN → transpose marized in Tab. 1. k5 s2 c1 Hyper-encoder: conv k3 s1 c192 → PReLU → conv k5 s2 Training details In training we optimize the rate- c192 → PReLU → conv k5 s2 c192 distortion loss (image indices omitted) cH2y8p8e→r-dPeRceoLdUer→: ccoonnvvkk53ss21cc139824→ PReLU → conv k5 s2 = E∼ [− log2 ^(ˆ)] + E∼ [− log2 ^(ˆ)] Context: conv k5 s1 c384

Entropy module: conv k1 s1 c768 → PReLU → conv k1 s1 + · 2552 · E∼ [︀ ‖ − ˆ‖22]︀ , c576 → PReLU → conv k1 s1 c384

(2) where controls the rate-distortion tradeof (determines approximate target bitrate) and (), the distribution of slice in each volumetric series. This model has the same uncompressed images, is evaluated by batch averaging. encoder/decoder as the multi-slice model and the same The first two terms on the right-hand side are approxi- architecture (not weights) of the hyper-prior but does mate (theoretical) bitrates required by the entropy coder not include the context and entropy module – the hyperto encode the latents. These are used in training as an esti- decoder directly predicts the (, ) parameters of the mate of the actual bitrates because the non-diferentiable latent entropy model. In validation and testing, we use entropy coders are removed from training. this auxiliary model to compress the first slice of the vol

Our description of ^(ˆ) and ^(ˆ) so far was some- ume and then proceed sequentially with the multi-slice what simplified. The Laplace parametric density is used model. only as a model to conveniently parametrize the discrete The quantization operation must be approximated durdistribution over the symbols after quantization. In the ing training because it has zero gradient almost everyactual evaluation, however, we have to account for the where. For both the latents and hyper-latents we use whole interval corresponding to each discrete value be- the straight-through quantization [29], which performs cause of quantization. This is done by integrating the integer rounding in the forward pass but acts as identity parametric density over the corresponding interval, for in the backward pass. For evaluation of the bitrate in example the entropy models, however, we simulate quantization ^(ˆ ) = ∫︁^^− +2121 ^ (), (3) wibsytaiyacditdnhiteteigvheeyrpu-ernori-fuodnredmceodndoevriaslaeunfedrsod(mwecittohhdeem(r−ega21ent,-s12thu)ebrtamrnaogcrteei.oTrnehaailsswhere ^ is the continuous Laplace density model in Eq. (1)) but the entropy estimation is calculated using parametrized by (, ) corresponding to ^(ˆ ), the dis- the uniform noise simulation, which reportedly leads to crete distribution of ˆ . In practice, this is done by using better performance [ 20 ]. the cumulative distribution function of the Laplace density. 4. Results

In each training iteration we randomly sample a small subset of consecutive slices from each image in the Dataset We trained and tested the method on the batch and process those through the model as a small Pediatric-CT-SEG dataset of CT-scan images of various volume. For the first slice 0 of this subset we calculate organs downloaded from the Cancer Imaging Archive the latent 0 = (0) using an auxiliary single-image [ 6 ] (patient and acquisition parameters specified therein). model which shares the same encoder with the multi- We chose this dataset for its diverse content. The dataset slice model. For 1, . . . , − 1 we proceed as described consists of 359 volumetric images each with a diferent above and these slices are used to evaluate the loss in number of slices ranging from 41 to 1104. We randomly Eq. (2). The first slice 0 is excluded from optimization selected 10 of the volumetric images for testing (2184 of the multi-slice model but is used to train the auxil- slices in total) and the rest for training. The 2D slices iary single-slice model used for compression of the first are 12bit grayscale images with a resolution of 512× 512, originally stored uncompressed at 16 bits per pixel (bpp).

An example slice from the dataset is in Fig. 1. 60.0 Training We trained the model on random spatial 52.5 crops of size 256× 256 and tested it on full-resolution ]dB images. For training, we randomly chose = 3 consecu- [RN50.0 tainvde esxlipcelositaisnga tghoeosdeqcoumenptriaolmpirsoecbeesstwinege. nWteratirnaiinngedspweiethd PS47.5 PBraospeolisneed batch size 8 using the Adam optimizer [31] with an initial 45.0 VVVVCC-intra ldeeacrrneiansgedrattheeolefa1rni−ng4rfaotre 1toM1it e−ra5tifoonrsaanfotethrewrh2i0c0hkwite- 42.5 JAABPVVPEGI1GF erations. We trained a new model for 6 values of in the 40.00.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 range from 0.032 to 3.2, which on the test set results in bits per pixel 0.05 to 0.65 bits per pixel, thus achieving a compression Figure 3: Rate-distortion performance of the proposed and ratio of 25:1 to 320:1 with respect to the original images. benchmark methods on the test set of CT-scan images. Benchmark methods We compare the performance of the proposed method with a baseline learned single- AV1 video codec (essentially AV1-intra), one of today’s image compression model and a number of established top codecs from those that are readily available e.g. in traditional image compression methods. The single- browsers. In the comparison we used the libaom-av1 image baseline is a learned model with the same archi- encoder via ffmpeg configured to 12bit internal processtecture as the auxiliary model we use to compress the ing, each slice in the series is compressed individually. ifrst slice and was trained on the same train set. Compar- AV1 [32] is a video codec in terms of quality approxiison with this method shows performance gain from the mately on the level of or slightly outperforming HEVC proposed sequential processing and the context module. but unlike HEVC its use is royalty-free, it is therefore The traditional methods are a broad selection ranging arguably the best video codec readily available today from well-known and established codecs commonly used (with production-level encoders and decoders available). in practice to the state-of-the-art prototype. Such com- In our comparison, we used ffmpeg/libaom-av1 in parison therefore well positions the proposed method 12bit mode and compressed each volumetric image as a in the landscape of existing methods and gives insight video sequence consisting of the individual slices. VVC into its properties in potential use in practice. Below we [33] (H.266) is the best existing video codec nowadays briefly describe each of the methods used in the com- but its development is still ongoing and the available enparison and optionally its configuration, afterwards we coders/decoders are on the prototype level and for most provide commentary on the results summarized in Fig. 3 practical use cases prohibitively slow. Its adoption in and Tab. 2. practice, medical or otherwise, is also hindered by the

Baseline is a learned single-image compression model fact that its use is not royalty free. We used the VTM with the same architecture as the proposed method but 18 reference implementation in 12bit mode and again without the context and entropy module (the hyper- compressed each volumetric image as a video sequence decoder directly predicts the entropy parameters). It consisting of the individual slices. VVC-intra [33] is the is trained on the same train set as the proposed and uses intra mode of VVC. For single-image compression, it is the same training schedule. JPEG [ 2 ] is a well-known the best available codec nowadays but currently inherits widely used compression method developed in the 90s. the disadvantages listed above for VVC. We used it in the Although used for medical data and having the advan- same configuration as VVC video but compressed each tage of being very fast both in encode and decode, it is slice individually. arguably not a very suitable method for such use as its performance is relatively low by today’s standards. We Results The rate-distortion curves of the benchmarked use the implementation in pillow. BPG [ 3 ] is essen- methods on the CT-scan test set are in shown in Fig. 3, tially a single-image wrapper of the intra-frame compres- their ranking and quantitative comparison with respect sion of the HEVC (also known as H.265) video codec. to VVC-intra is in Tab. 2 and finally, Tab. 3 shows apAlthough not widespread, it is one of the top methods proximate relative runtimes required to process the test currently available for everyday use. We used the jctvc set. In the testing we focused on high-PSNR range since encoder via the public BPG library configured to 12bit in- we envision the proposed method being used primarily ternal bitdepth. AVIF [32] is a single-image format of the in the medical domain, where sliced volumetric images therefore no match for VVC but we will see that in that are common. Let us provide some commentary on the comparison it wins on runtime. results. A clear and quantitative ranking of the methods is

The baseline learned method performs on the level of provided in Tab. 2, which shows the average bitrate inAVIF – the curves almost overlap. Although AVIF is un- crease/savings and PSNR quality loss/gain evaluated by doubtedly a better codec in a general setting, the learned the BD-Rate and BD-PSNR [34], respectively. We posibaseline exploits the advantage of domain specificity – it tioned VVC-intra as the reference SOTA image codec has been trained on similar CT data. BPG generally per- and compared all others to it, as they perform on the forms well on natural images where the target PSNR is test set. The table shows average bitrate savings and usually lower, but to achieve imperceptible distortion in performance gain in the middle and right column, remedical data we observed that the reconstruction PSNR spectively. Only the proposed method and VVC video should be above 55dB (for typical images with suficient achieve improvement (BD-Rate is negative and BD-PSNR structure). We suspect there is some issue with the con- is positive). ifguration of the encoder at high bitdepth processing Finally, in Tab. 3 we show the relative runtimes rebecause BPG obviously struggles with achieving high quired for processing (encode and decode) the whole PSNRs. It is no surprise that JPEG cannot compete with test set (10 volumetric images consisting of 2184 slices) the latest methods. VVC-intra does very well and outper- with respect to the proposed method (i.e. value < 1 forms AVIF by a large margin in the whole range. With means the method is faster than ours, > 1 means it is AV1 we experienced similar problems as with BPG – it slower). These runtimes are listed for bpp = .3, apapparently “saturates” at higher bitrates and struggles proximately the middle of the tested range, since the to achieve high PSNR, which is possibly again some is- traditional methods are slower at higher bitrates (the sue with the high bitdepth configuration of the encoder proposed has constant speed across the range). Here the (although we used the same encoder as for AVIF and ranking is quite diferent than in the performance. JPEG in that case it worked fine). But from comparison with and of course the baseline are the only methods faster AVIF in low to mid bitrates we can see that the sequential than the proposed, all others are slower and some of them “video” processing of the image volume is clearly benefi- quite significantly, especially the well-performing VVC cial with noticeable performance gain. This conclusion which is clearly prohibitively slow. We argue that the is further strengthened by the results of the VVC (video) video codecs are simply not fast enough for practical use. codec, which on performance alone is a clear winner of In fairness, the proposed method and the baseline run the whole comparison, outperforming all other methods on GPU (though still each slice sequentially) while the (including the proposed) by a margin in the whole range. traditional methods are CPU-only without any external

The proposed method is significantly better than the parallelization. On the other hand, our implementation is baseline (compare green and orange curves in Fig. 3), on intended only as a proof of concept and we didn’t invest average achieving almost 30% rate savings (for the same much efort into runtime optimization. For example, in quality) and 1.8dB quality increase (for the same rate). both encode and decode the encoder and decoder process It also outperforms all image codecs such as AVIF, BPG, each slice independently. In testing we really process and especially VVC-intra, which is no small feat. This them sequentially for simplicity while it is possible to is only due to the proposed sequential context because “batch” them and process in parallel (as many as the GPU the baseline alone is significantly below VVC-intra. It memory permits), which would reduce the runtime. is however still a relatively small and simple model and Contrary to usual customs, we do not provide examples and qualitative comparison of image reconstructions because due to the high reconstruction quality and similar performance of the benchmarked methods we were not able to come up with example images that demonstrate any noticeable diference – on screen all the results look identical.

5. Conclusion

But by looking at the results of VVC we see that further gains are undoubtedly possible and we hypothesize that those can be achieved for example by a stronger context module (ours is a rather simple stack of convolutions, not in any way input-adaptive) and possibly by introducing P-frames and B-frames as in video encoding. It is our hope that this work will motivate further research into such possibilities.

We presented an extension of a single-image learned com- Acknowledgments pression method to volumetric multi-slice images with an emphasis on the medical domain, where such type This project has received funding from the European of images is quite common. Although the modification Union’s Horizon 2020 research and innovation program is relatively simple and straightforward, it provides sev- under grant agreement No 965502. eral benefits – namely using a context module without introducing any problems with parallel processing in the decode and using mean-subtracted quantization. Both of References these improve performance without compromising the runtime. This we verified in the comparison with a number of established compression methods. The comparison shows: • Clear performance gain with respect to the baseline

due to the proposed sequential context. • Good performance in absolute numbers with respect

to the established codecs. • Very competitive runtimes (if GPUs are allowed).

The testing was carried out with emphasis on low-error reconstruction and even at PSNR=55dB (in most cases indistinguishable from the original) the proposed method achieves an average compression ratio of 40:1 with respect to the uncompressed original. We consider these results a solid proof of concept for compression of volumetric medical data.

Nevertheless, there are a number of things which can be improved or investigated further. For example, the used baseline model is far from SOTA so higher absolute performance can be gained by adopting one of the SOTA single-image learned methods as a backbone and extending that with the proposed context model. In this work, however, we focused more on investigating the relative gains from the sequential context rather than absolute performance. Next, in decode our method currently does not permit random access (as in “show me slice 42”), the whole volume needs to be decoded sequentially from the beginning. But this can be remedied by introducing intra-frames compressed by the single-image auxiliary method we use for the first slice. If we use a GOP size of 8 (meaning at most 8 slices need to be decoded for any chosen slice), we can estimate that the performance drop in Tab. 2 would be approximately − 11.4% → − 7.5% in rate and +0.64dB → +0.42dB in PSNR, which is still solid improvement over VVC-intra with practically usable runtimes in both encode and decode.

[1] Radiation risk from medical imaging , https://www.health.harvard.edu/cancer/ radiation -risk-from-medical- imaging , Sep 2021 .

[2]

Wallace , The JPEG still picture compression standard , IEEE Transactions on Consumer Electronics 38 ( 1992 ) xviii-xxxiv.

[3]

Bellard , BPG Image format, https://bellard.org/ bpg, 2018 . Accessed: 2021 -09-24.

[4] AVIF image format , https://aomediacodec.github. io/av1-avif , 2022 . Accessed: 2022 - 12 .

[5] Google, WebP Image format , https://developers. google.com/speed/webp, 2018 . Accessed: 2021 -09- 24.

[6] Pediatric-CT-SEG , Cancer Imaging Archive , https://wiki.cancerimagingarchive.net/pages/ viewpage.action?pageId= 89096588 , Aug 2022 .

[7] AOM common test conditions v2.0 , https://aomedia. org/docs/CWG-B075o_ AV2_CTC_v2 .pdf , Aug 2021 .

[8]

Mentzer ,

Toderici ,

Minnen ,

S.-J.

Hwang ,

Caelles ,

Lucic , E. Agustsson, VCT: A video compression transformer , 2022 . URL: https://arxiv. org/abs/2206.07307.

[9]

Kivijärvi ,

Ojala ,

Kaukoranta ,

Kuba ,

Nyúl ,

Nevalainen , A comparison of lossless compression methods for medical images , Computerized Medical Imaging and Graphics 22 ( 1998 ) 323 - 339 .

[10]

Toderici ,

Vincent ,

Johnston ,

S. Jin

Hwang ,

Minnen ,

Shor ,

Covell , Full resolution image compression with recurrent neural networks , in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017 .

[11]

Ballé ,

Laparra ,

E. P.

Simoncelli , End-to-end optimized image compression , in: International Conference on Learning Representations , 2017 .

[12]

Ballé ,

Minnen ,

Singh ,

S. J.

Hwang , N. John- tems 33 ( 2020 ). ston, Variational image compression with a scale [ 24]

He ,

Yang ,

Yu ,

Xu ,

Luo , Y. Chen, hyperprior, in: International Conference on Learn-

Gao ,

Shi ,

Qin ,

Wang , Po-elic: Perceptioning Representations, 2018 . oriented eficient learned image coding , in: Pro-

[13]

Minnen ,

Ballé ,

G. D.

Toderici , Joint autoregres- ceedings of the IEEE/CVF Conference on Computer sive and hierarchical priors for learned image com- Vision and Pattern Recognition (CVPR) Workshops, pression , in: S. Bengio,

Wallach ,

Larochelle , 2022 , pp. 1764 - 1769 . K. Grauman,

Cesa-Bianchi , R. Garnett (Eds.), Ad- [25]

Bruylants ,

Munteanu ,

Schelkens , Wavelet vances in Neural Information Processing Systems, based volumetric medical image compression, Sigvolume 31 , Curran

Associates

, Inc., 2018 . nal Processing: Image Communication 31 ( 2015 )

[14]

Wiegand ,

G. J.

Sullivan , G. Bjontegaard, 112 - 133 . A. Luthra , Overview of the h. 264/avc video cod- [26] Z.

Chen , S.

Gu , G.

Lu , D.

Xu , Exploiting intra-slice ing standard, IEEE Transactions on circuits and and inter-slice redundancy for learning-based losssystems for video technology 13 ( 2003 ) 560 - 576 . less volumetric image compression , IEEE Transac-

[15]

G. J.

Sullivan ,

J.-R.

Ohm , W.-J. Han, T . Wiegand, tions on Image Processing 31 ( 2022 ) 1697 - 1707 . Overview of the High Eficiency Video Coding [27] M. U. A. Ayoobkhan , E.

Chikkannan , K.

Ramakrish(HEVC) standard, IEEE Transactions on Circuits nan, Feed-forward neural network-based predictive and Systems for Video Technology 22 ( 2012 ) 1649- image coding for medical image compression, Ara1668 . bian Journal for Science and Engineering 43 ( 2018 )

[16]

Cheng , H. Sun,

Takeuchi ,

Katto , Learned im- 4239 -4247. age compression with discretized gaussian mixture [28]

Mishra ,

S. K.

Singh ,

R. K.

Singh , Lossy medical likelihoods and attention modules, in: Proceedings image compression using residual learning-based of the IEEE/CVF Conference on Computer Vision dual autoencoder model , in: 2020 IEEE 7th Utand Pattern Recognition (CVPR) , 2020 . tar Pradesh Section International Conference on

[17]

Minnen ,

Singh , Channel-wise autoregressive Electrical, Electronics and Computer Engineering entropy models for learned image compression , in: (UPCON) , 2020 , pp. 1 - 5 . 2020 IEEE International Conference on Image Pro- [29]

Bengio , Estimating or propagating gradients cessing (ICIP ), 2020 , pp. 3339 - 3343 . through stochastic neurons, 2013 . URL: https://

[18]

He ,

Zheng ,

Sun ,

Wang ,

Qin , Checker- arxiv.org/abs/1305.2982. board context model for eficient learned image [30]

He ,

Zhang , S. Ren,

Sun , Delving deep into compression, in: Proceedings of the IEEE/CVF Con- rectifiers: Surpassing human-level performance on ference on Computer Vision and Pattern Recogni- imagenet classification , in: 2015 IEEE International tion (CVPR) , 2021 , pp. 14771 - 14780 . Conference on Computer Vision (ICCV), 2015 , pp.

[19]

He ,

Yang ,

Peng , R. Ma, H. Qin,

Wang , 1026 - 1034 . Elic: Eficient learned image compression with un- [31]

D. P.

Kingma ,

Ba , Adam: A method for stochastic evenly grouped space-channel contextual adaptive optimization , in: ICLR (Poster) , 2015 . coding, in: Proceedings of the IEEE/CVF Confer- [32]

Han ,

Li ,

Mukherjee , C.-H. Chiang , A . Grange, ence on Computer Vision and Pattern Recognition C. Chen , H.

Su , S.

Parker , S.

Deng , U.

Joshi , et al., (CVPR) , 2022 , pp. 5718 - 5727 . A technical overview of AV1 , Proceedings of the

[20]

Theis ,

Shi ,

Cunningham ,

Huszár , Lossy

IEEE

109 ( 2021 ) 1435 - 1462 . image compression with compressive autoencoders , [33]

Bross ,

Y.-K.

Wang ,

Ye , S. Liu,

Chen , G. J. in: International Conference on Learning Represen- Sullivan,

J.-R.

Ohm , Overview of the Versatile Video tations , 2017 . Coding (VVC) standard and its applications , IEEE

[21]

Guo ,

Zhang ,

Feng ,

Chen , Soft then hard: Transactions on Circuits and Systems for Video Rethinking the quantization in neural image com- Technology ( 2021 ) 1 - 1 . pression, in: International Conference on Machine [34]

Bjontegaard , Calculation of average PSNR difLearning , PMLR, 2021 , pp. 3920 - 3929 . ferences between RD-curves, VCEG-M33 ( 2001 ).

[22]

Zhang ,

Isola ,

A. A.

Efros ,

Shechtman , O. Wang, The unreasonable efectiveness of deep features as a perceptual metric , in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018 .

[23]

Mentzer ,

G. D.

Toderici ,

Tschannen , E. Agustsson, High-fidelity generative image compression , Advances in Neural Information Processing Sys-