=Paper=
{{Paper
|id=Vol-3349/paper9
|storemode=property
|title=Learned Lossy Image Compression for Volumetric Medical
Data
|pdfUrl=https://ceur-ws.org/Vol-3349/paper9.pdf
|volume=Vol-3349
|authors=Jan Kotera,Matthias Woedlinger,Manuel Keglevic
|dblpUrl=https://dblp.org/rec/conf/cvww/KoteraWK23
}}
==Learned Lossy Image Compression for Volumetric Medical
Data==
Learned Lossy Image Compression for Volumetric Medical Data Jan Kotera1,2,* , Matthias Wödlinger1 and Manuel Keglevic1 1 CVL, TU Wien, Favoritenstraße 9/11, 1040 Vienna, Austria 2 Institute of information theory, CAS, Pod Vodárenskou věží 4, 182 00 Prague, Czech Republic Abstract This work addresses the problem of lossy compression of volumetric images consisting of individual slices such as those produced by CT scans and MRI machines in medical imaging. We propose an extension of a single-image lossy compression method with an autoregressive context module to a sequential encoding of the volumetric slices. In particular, we remove the intra-slice autoregressive relation and instead condition the entropy model of the latent on the previous slice in the sequence. This modification alleviates the typical disadvantages of autoregressive contexts and leads to a significant increase in performance compared to encoding each slice independently. We test the proposed method on a dataset of diverse CT scan images in a setting with an emphasis on high-fidelity reconstruction required in medical imaging and show that it compares favorably against several established state-of-the-art codecs in both performance and runtime. Keywords Learned Image Compression, Medical Image Data, Deep Learning 1. Introduction Medical imaging is a set of techniques and processes that produce images of the interior of the body for the pur- pose of clinical analysis, medical intervention, or visual representation of the function of the internal organs. Ex- amples of common types of imaging systems are x-rays, computed tomography (CT) scans, magnetic resonance imaging (MRI), or ultrasound (US). Medical imaging has become a staple tool not only for medical diagnosis and treatment but also a crucial component of research, as it allows researchers and physicians to establish a knowl- edge base of normal anatomy and physiology to make it possible to identify abnormalities and study the effects of medical intervention. For these reasons, the amount of image data produced in healthcare and medical research Figure 1: Illustrative example of a single uncompressed slice is huge and increasing [1], as are the requirements for from the CT scan test set [6] used for performance evaluation. efficient transmission and especially storage. Image compression methods are designed for exactly that – to enable more efficient coding of image data with little or no loss in visual quality. The first successful im- ern image compression codecs such as BPG [3], AVIF [4], age compression techniques were developed in the early or WebP [5] typically appear as by-products of a video 1990s and some of those are still being widely used today, codec development – the intra-frame component is ex- such as for example the well-known JPEG method [2]. tracted from the video codec and used as a standalone In recent years the development of novel compression image codec. methods for image and video accelerated, in line with the For mainstream everyday use in applications such as growing amount of streamed image and video data. Mod- image or video streaming, video calls, online gaming and so on the goal is for the reconstructed image to appear “natural and artefact-free” on first glance while achiev- 26th Computer Vision Winter Workshop, Robert Sablatnig and Florian Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 ing high enough compression ratios to make the above * Corresponding author. mentioned applications feasible. General-purpose video $ kotera@utia.cas.cz (J. Kotera); mwoedlinger@cvl.tuwien.ac.at codes are therefore developed for and tested mainly on (M. Wödlinger); keglevic@cvl.tuwien.ac.at (M. Keglevic) natural sequences, screen content, or synthetic scenes © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). (eg. [7]) and typically benchmarked in perceptually lossy CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Jan Kotera et al. CEUR Workshop Proceedings 1–9 range of < 40dB reconstruction PSNR (eg. [8]). simi- vised manner and the minimized loss is the sum of two larly for image codecs. In the case of medical imaging terms: The distortion in the image reconstruction and the fundamental requirement is that the reconstruction the entropy (i.e. expected bitrate) of the latent. The en- error must not alter the subsequent clinical analysis. The tropy coder is used off-the-shelf and is not subject to reconstructed image must remain true to the original up training. One of the great advantages of learned image to imperceptible “noise” void of any structure. We argue compression is that the training is relatively simple and that using an established and straightforward objective cheap which makes it possible to adapt a method for a metric such as PSNR for measuring the reconstruction particular modality, such as medical images, whereas for error is the right approach here to ensure that the recon- conventional hand-designed codecs such adaptation is structed image is truly nearly identical to the original not feasible. when the reconstruction error is near zero. In our subjec- The proposed method extends [13] to volumetric med- tive tests (on HDR display) we find that we are not able ical data consisting of individual slices, i.e. a sequence of to distinguish between the original and reconstructed 2D images. This type of data is acquired for example by a images above 55dB PSNR, so that is approximately our CT scan (see Fig. 1 for an example) or in an MRI. The in- target quality range. On the other hand, bellow 50dB we dividual slices are encoded in order. The transform from could identify loss of subtle structure in some images. image data to the latent representation is done for each Having the images analyzed by medical experts is unfor- slice independently, but in the entropy estimation step tunately too resource-intensive and beyond the scope of the probability model of each slice (except the first) is this work. conditioned on the previous slice, which enables a more Another solution common in practice is using only accurate estimation of latent distribution since neighbor- lossless compression but such methods never achieve ing slices typically have high mutual information. This anywhere near as high compression ratios (by order of allows for higher compression ratios with no loss in the magnitude) as lossy methods – for example the study [9] reconstruction quality. On the decoding side, the im- finds that on medical data the traditional lossless codecs ages are decoded in the same order, so that the previous hardly achieve compression ratios over 4:1, while on the slice is again available when decoding the next. Note test set the proposed method has average ratio over 40:1 that the proposed method works with already digitized at PSNR > 55dB. Proper research into lossy methods uncompressed images in normalized intensity range (typ- is therefore surely justified. ically 8bit-16bit), it doesn’t in any way enter the process The traditional approach to image compression are of image generation by the above mentioned imaging hand-designed codecs that are implemented as hard- techniques. coded algorithms, based on human experience and in- We show in the experimental section that this rela- tuition (see Sec. 2). As with many problems in image tively simple addition outperforms considerably the base- processing and computer vision in the last decade, av- line approach in which all slices are processed completely enues are being explored on how to learn optimal codecs independently by a single image compression method. from data. Modern research in learned image compres- Additionally, compared to processing the full volume at sion started with the works of Toderici et al. [10] as the once our approach requires a fraction of time and mem- first fully learned method applicable to large images and ory (in practice, it would be necessary to split the volume outperforming some established traditional codecs. A into small chunks and compress those separately any- surge of interest in learned image compression came af- way). We tested the method on a dataset consisting of ter the seminal works of Ballé et al. [11, 12] and Minnen CT scans of various human body parts and the proposed et al. [13]. These works laid the groundwork for further approach is competitive even compared to established research and it can be argued that most state-of-the-art standards such as JPEG, BPG, AVIF, and even VVC-intra. (SOTA) methods nowadays are extensions of these meth- ods. The core structure of a learned method typically con- 2. Related work sists of an autoencoder which transforms the input and For a long time, lossy image and video compression was produces a latent representation of the image which will a problem solved exclusively in the traditional way by constitute the bitstream. This representation is then quan- hand-designed methods. Some of these methods, such tized so that it can be passed to an entropy coder which as for example H.264 [14] or H.265 [15] video codecs or losslessly converts the discrete representation to an ac- JPEG image compression [2], are now in widespread use tual bitstream. The third integral component is an en- in many areas of industry, research, or everyday life. Rel- tropy model of the latent, i.e. a probability distribution atively recently, the first learned codecs appeared that model of the symbols (after quantization) of the latent were able to challenge some of the traditional methods. representation, as this is required by the entropy coder. Arguably the biggest rise of interest started after the This pipeline can be trained end-to-end in an unsuper- 2 Jan Kotera et al. CEUR Workshop Proceedings 1–9 works of Ballé et al. [11, 12] and later Minnen et al. [13], The encoding and decoding branches of the pipeline are which laid the foundation for learned image compression. connected only via the bitstream which stores the latent These works formulated the main rate-distortion objec- and hyper-latent representation of the image. To this end tive in a learnable way, presented a model containing the the latents must be quantized, for which scalar integer three fundamental components now present in the vast rounding is used, because the entropy coder that con- majority of learned codecs – the autoencoder for image verts the values into their corresponding bit codes can transform, and the hyper-prior and the context module only operate on discrete data (continues values cannot for entropy estimation – and provided the solution for be stored in the bitstream). dealing with the discrete quantization in training. Subse- The advantage of the context module is that the en- quent methods increased the performance for example tropy parameters can be very accurate and image-specific, by richer/larger model architecture (e.g. using attention- the disadvantage is that the autoregressive processing like modules) [16], improved context modules [17, 18, 19], does not play well with the parallel processing common richer entropy model (e.g. Gaussian mixtures) [16], or in deep learning. For each new pixel to be decoded the different simulation of quantization [20, 21]. entropy parameters must first be estimated, the pixel de- Recently, a promising research direction is coercing coded and only then can the decoding move to the next the reconstruction to better satisfy the expectations of pixel. As a result, a usually parallelized operation such the human visual system even at the expense of objective as convolution cannot be computed for the whole image (e.g. PSNR) quality. This can be achieved for example by at once but pixel by pixel in alternation with the entropy augmenting the loss by a term that better models human coder. Another disadvantage is that the context prevents perception (such as LPIPS [22]) [19], or by training the using the so-called mean-subtracted quantization, which decoder in an adversarial manner as in GANs [23, 24]. will be specified in the next section. We get rid of these Such approaches can achieve significant bitrate savings drawbacks in the proposed method by replacing the au- but unfortunately are not suitable for medical data, where toregressive context from [13] with an analogous module the reconstructed image must be objectively undistorted that runs on the previous slice in the sequence. and not just look natural. Literature on learned compression for medical images Model details The input to our method is a sequence is relatively scarce, this area is still dominated by more tra- of 2D slices 𝑥0 , . . . , 𝑥𝑁 −1 (superscripts denote slices, ditional approaches such as compression in the wavelet subscripts pixel indices) which are processed in order. domain [25]. Probably the closest match for the pro- The transforms to and from the latent representation de- posed method is the lossless compression of 3D volumes noted 𝑦 𝑖 are done for each slice independently but the by Chen et al. [26]. In our work, however, we focus on entropy model, i.e. the probability distribution 𝑝𝑦^ (𝑦ˆ𝑖 ) of lossy compression. Other works propose partitioning the quantized latent 𝑦ˆ𝑖 (hat denotes quantization oper- the image into relevant (for the diagnosis) and less rele-ation), is conditioned on the latent of the previous slice vant regions and apply different compression ratios there 𝑦ˆ𝑖−1 . This helps decrease the entropy of 𝑦ˆ𝑖 and therefore [27]. Learned lossy compression for 2D medical images the necessary bitrate while avoiding the disadvantages is investigated for example in [28]. of an autoregressive context model. It is done as follows: Instead of running the context model on the currently encoded slice in an autoregressive fashion, we run it on 3. Method the (quantized) latent 𝑦ˆ𝑖−1 of the previous slice. During The proposed approach is based on the single image decoding, the slices are processed in the same order so compression method by Minnen et al. [13], which we 𝑦ˆ 𝑖−1 has already been decoded in full and is available extend for multi-slice volumetric images. The method when 𝑦ˆ is being decoded and the entropy model can 𝑖 [13] consists of three main components: again use information from the previous slice. This ap- proach does not require autoregressive processing but • An encoder/decoder which performs the transform can instead be done in parallel for the whole slice with- between the input image space and the latent repre- out waiting for each new pixel to be decoded. In other sentation (commonly called “latent”). words, the context module is autoregressive in the slice • A hyper-encoder/decoder (called hyper-prior) which sequence but that does not restrict any 2D operations analyzes the latent and stores a small piece of side contained within one slice such as convolutions – instead information into the bitstream that is used later to of decoding individual pixels we can decode whole slices estimate the parameters of the probability distribution in parallel. of the latent (the entropy model). We model the distribution 𝑝𝑦^ (𝑦ˆ𝑖 ) of the quantized • A context module that processes the image latent in latent 𝑦ˆ𝑖 by a per-dimension 𝑗 (i.e. spatial pixel and an autoregressive fashion (i.e. causally) and is also a channel) independent Laplace distribution with mean part of the entropy model parameter estimation. and scale parameters (𝜇𝑖𝑗 , 𝜎𝑗𝑖 ). These two parameters 3 Jan Kotera et al. CEUR Workshop Proceedings 1–9 Hyper-encoder Encoder Quantizer Quantizer Entropy Entropy Context module encoder encoder Entropy model for Entropy Entropy Hyper-decoder decoder decoder Decoder Entropy module for Figure 2: Overview of the proposed compression pipeline. Connectors: Green are operations performed only in encode, red are operations performed only in decode and blue are operations performed both in encode and decode. Checkboard denotes the bitstream. Procedure: The input image is passed through an encoder, producing the latent 𝑦 𝑖 . The latent is concatenated with the latent of the previous slice, 𝑦 𝑖−1 , and passed through the hyper-encoder, producing the hyper-latent 𝑧 𝑖 . This hyper-latent is quantized to 𝑧^𝑖 and stored using fixed entropy model 𝑝(𝑧^). Parameters of the image-adaptive entropy model 𝑝(𝑦^𝑖 ) are estimated by a context that processes the previous slice’s latent 𝑦^𝑖−1 , and hyper-decoder that processes the hyper-latent 𝑧^𝑖 . These two are concatenated and passed through an entropy module to produce the entropy parameters ^𝑖 is stored in the bitstream. In decode the hyper-decoder, context, and entropy module have to run again (𝜇𝑖 , 𝜎 𝑖 ). The latent 𝑦 because the parameters (𝜇𝑖 , 𝜎 𝑖 ) are required for decoding of 𝑦^𝑖 from the bitstream; for this the latent of the previous slice ^𝑖−1 is already available. The decoded latent 𝑦 𝑦 ^𝑖 is passed through the decoder to produce the reconstructed image 𝑥^𝑖 . are estimated adaptively for each image 𝑖 and each pixel hyper-latent 𝑧 𝑖 = 𝐸ℎ ([𝑦 𝑖−1 , 𝑦 𝑖 ]). This hyper-latent 𝑗 (incl. channels) of the latent by the hyper-prior and is quantized, 𝑧ˆ𝑖 = 𝑄(𝑧𝑖 ), so that it can be stored in the context module. For quantization of the latent we the bitstream. The parameters of the entropy model use integer rounding with mean-subtraction, meaning of the quantized latent 𝑦ˆ𝑖 are estimated as follows. A that the value is first offset by the estimated mean of its context module 𝐶 processes the previous slice’s latent distribution before being rounded, (image index omitted) 𝑦ˆ𝑖−1 and hyper-decoder 𝐷ℎ processes the hyper-latent 𝑧ˆ𝑖 . These two are concatenated and passed through an 𝑦ˆ𝑗 = ⌊𝑦𝑗 − 𝜇𝑗 ⌉ + 𝜇𝑗 , (1) entropy module 𝐸𝑝 to produce the final entropy parame- ters (𝜇𝑖𝑗 , 𝜎𝑗𝑖 ) = 𝐸𝑝 ([𝐶(𝑦ˆ𝑖−1 ), 𝐷ℎ (𝑧ˆ𝑖 )])𝑖𝑗 for each pixel where ⌊·⌉ is integer rounding. This improves perfor- 𝑗 of the latent. With these parameters available the la- mance because then quantization doesn’t change the tent can be quantized and stored in the bitstream and the mean of the distribution, but it requires that the entropy encoding proceeds to the next slice. parameters of the latent are estimated before the latent is During decoding, operations responsible for estimat- quantized. In particular, both of the entropy estimation ing the entropy model 𝑝𝑦^ (𝑦ˆ𝑖 ) have to be executed again modules (hyper-prior and context) must operate on non- because the entropy model is required by the coder to quantized values 𝑦 𝑖 , otherwise an implicit relation would decode 𝑦ˆ𝑖 from the bitstream. The hyper-latent 𝑧ˆ𝑖 is de- arise. This is difficult to achieve in a single-image autore- coded first and since the latent of the previous slice 𝑦ˆ𝑖−1 gressive context model and for example the quantization is already decoded and available, the estimation of the in [13] does not use mean-subtraction, but since in the entropy parameters (𝜇, 𝜎) proceeds as during encoding. proposed method the context module uses the previous Having those, 𝑦ˆ𝑖 can be decoded and passed through the slice, using mean-subtraction is possible. decoder 𝐷 to finally produce the reconstructed image The full procedure of processing a slice 𝑥𝑖 is illus- 𝑥ˆ 𝑖 = 𝐷(𝑦ˆ𝑖 ). The decoding then proceeds to the next trated in Fig. 2. The image is passed through an encoder slice. 𝐸, producing the latent 𝑦 𝑖 = 𝐸(𝑥𝑖 ). The latent is con- What remains to specify is the entropy model 𝑝𝑧^ (𝑧ˆ) catenated with the latent of the previous slice, 𝑦 𝑖−1 , and of the hyper-latent 𝑧ˆ, since that is also processed by the passed through the hyper-encoder 𝐸ℎ , producing the 4 Jan Kotera et al. CEUR Workshop Proceedings 1–9 entropy coder and stored in the bitstream. We model it Table 1 by per-channel Laplace distribution, meaning that each Model architecture details. conv is a Conv2D layer with kernel channel of 𝑧 𝑖 has its own mean and scale parameters size 𝑘, stride 𝑠 and output channels 𝑐. transpose is a simi- (𝜇, 𝜎) but those are spatially constant so that the model larly specified ConvTranspose2D. GDN and IGDN are the is not tied to a fixed image resolution. These parameters generalized divisive normalization layer [11] and its inverse, are subject to training but fixed once the model has been respectively. PReLU is the parametric ReLU [30]. trained (i.e. unlike 𝑝𝑦^ (𝑦ˆ) it is not image-adaptive). For Encoder: conv k5 s2 c192 → GDN → conv k5 s2 c192 → quantization of 𝑧 we again use mean-subtracted rounding GDN → conv k5 s2 c192 → GDN → conv k5 s2 c192 in a similar fashion as in Eq. (1). Decoder: transpose k5 s2 c192 → IGDN → transpose k5 s2 Details of the model architecture are concisely sum- c192 → IGDN → transpose k5 s2 c192 → IGDN → transpose marized in Tab. 1. k5 s2 c1 Hyper-encoder: conv k3 s1 c192 → PReLU → conv k5 s2 c192 → PReLU → conv k5 s2 c192 Training details In training we optimize the rate- Hyper-decoder: conv k5 s2 c192 → PReLU → conv k5 s2 distortion loss 𝐿 (image indices omitted) c288 → PReLU → conv k3 s1 c384 𝐿 = E𝑥∼𝑝𝑥 [− log2 𝑝𝑦^ (𝑦ˆ)] + E𝑥∼𝑝𝑥 [− log2 𝑝𝑧^ (𝑧ˆ)] Context: conv k5 s1 c384 Entropy module: conv k1 s1 c768 → PReLU → conv k1 s1 + 𝜆 · 2552 · E𝑥∼𝑝𝑥 ‖𝑥 − 𝑥 ˆ ‖22 , [︀ ]︀ c576 → PReLU → conv k1 s1 c384 (2) where 𝜆 controls the rate-distortion tradeoff (determines approximate target bitrate) and 𝑝(𝑥), the distribution of slice in each volumetric series. This model has the same uncompressed images, is evaluated by batch averaging. encoder/decoder as the multi-slice model and the same The first two terms on the right-hand side are approxi- architecture (not weights) of the hyper-prior but does mate (theoretical) bitrates required by the entropy coder not include the context and entropy module – the hyper- to encode the latents. These are used in training as an esti- decoder directly predicts the (𝜇, 𝜎) parameters of the mate of the actual bitrates because the non-differentiable latent entropy model. In validation and testing, we use entropy coders are removed from training. this auxiliary model to compress the first slice of the vol- Our description of 𝑝𝑦^ (𝑦ˆ) and 𝑝𝑧^ (𝑧ˆ) so far was some- ume and then proceed sequentially with the multi-slice what simplified. The Laplace parametric density is used model. only as a model to conveniently parametrize the discrete The quantization operation must be approximated dur- distribution over the symbols after quantization. In the ing training because it has zero gradient almost every- actual evaluation, however, we have to account for the where. For both the latents 𝑦 and hyper-latents 𝑧 we use whole interval corresponding to each discrete value be- the straight-through quantization [29], which performs cause of quantization. This is done by integrating the integer rounding in the forward pass but acts as identity parametric density over the corresponding interval, for in the backward pass. For evaluation of the bitrate in example the entropy models, however, we simulate quantization by additive uniform noise from the (− 12 , 12 ) range. This ∫︁ 𝑦^𝑖𝑗 + 1 2 𝑝𝑦^ (𝑦ˆ𝑖𝑗 ) = 𝑃𝑦^𝑖 (𝑡)𝑑𝑡, (3) way the hyper-decoder and decoder get the more real- 𝑗 ^𝑖𝑗 − 1 𝑦 2 istic integer-rounded values (with mean-subtraction as where 𝑃𝑦^𝑖 is the continuous Laplace density model in Eq. (1)) but the entropy estimation is calculated using 𝑗 parametrized by (𝜇, 𝜎) corresponding to 𝑝𝑦^ (𝑦ˆ𝑖𝑗 ), the dis- the uniform noise simulation, which reportedly leads to crete distribution of 𝑦ˆ𝑖𝑗 . In practice, this is done by using better performance [20]. the cumulative distribution function of the Laplace den- sity. 4. Results In each training iteration we randomly sample a small subset of 𝑛 consecutive slices from each image in the Dataset We trained and tested the method on the batch and process those through the model as a small Pediatric-CT-SEG dataset of CT-scan images of various volume. For the first slice 𝑥0 of this subset we calculate organs downloaded from the Cancer Imaging Archive the latent 𝑦 0 = 𝐸(𝑥0 ) using an auxiliary single-image [6] (patient and acquisition parameters specified therein). model which shares the same encoder with the multi- We chose this dataset for its diverse content. The dataset slice model. For 𝑥1 , . . . , 𝑥𝑛−1 we proceed as described consists of 359 volumetric images each with a different above and these slices are used to evaluate the loss in number of slices ranging from 41 to 1104. We randomly Eq. (2). The first slice 𝑥0 is excluded from optimization selected 10 of the volumetric images for testing (2184 of the multi-slice model but is used to train the auxil- slices in total) and the rest for training. The 2D slices iary single-slice model used for compression of the first 5 Jan Kotera et al. CEUR Workshop Proceedings 1–9 Rate-distortion (PSNR) on the CT-scan test set are 12bit grayscale images with a resolution of 512×512, 60.0 originally stored uncompressed at 16 bits per pixel (bpp). 57.5 An example slice from the dataset is in Fig. 1. 55.0 Training We trained the model on random spatial 52.5 crops of size 256×256 and tested it on full-resolution PSNR [dB] images. For training, we randomly chose 𝑛 = 3 consecu- 50.0 tive slices as a good compromise between training speed 47.5 Proposed and exploiting the sequential processing. We trained with Baseline VVC batch size 8 using the Adam optimizer [31] with an initial 45.0 VVC-intra learning rate of 1𝑒 − 4 for 1M iterations after which we AV1 42.5 AVIF decreased the learning rate to 1𝑒 − 5 for another 200k it- BPG JPEG erations. We trained a new model for 6 values of 𝜆 in the 40.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 range from 0.032 to 3.2, which on the test set results in bits per pixel 0.05 to 0.65 bits per pixel, thus achieving a compression Figure 3: Rate-distortion performance of the proposed and ratio of 25:1 to 320:1 with respect to the original images. benchmark methods on the test set of CT-scan images. Benchmark methods We compare the performance of the proposed method with a baseline learned single- AV1 video codec (essentially AV1-intra), one of today’s image compression model and a number of established top codecs from those that are readily available e.g. in traditional image compression methods. The single- browsers. In the comparison we used the libaom-av1 image baseline is a learned model with the same archi- encoder via ffmpeg configured to 12bit internal process- tecture as the auxiliary model we use to compress the ing, each slice in the series is compressed individually. first slice and was trained on the same train set. Compar- AV1 [32] is a video codec in terms of quality approxi- ison with this method shows performance gain from the mately on the level of or slightly outperforming HEVC proposed sequential processing and the context module. but unlike HEVC its use is royalty-free, it is therefore The traditional methods are a broad selection ranging arguably the best video codec readily available today from well-known and established codecs commonly used (with production-level encoders and decoders available). in practice to the state-of-the-art prototype. Such com- In our comparison, we used ffmpeg/libaom-av1 in parison therefore well positions the proposed method 12bit mode and compressed each volumetric image as a in the landscape of existing methods and gives insight video sequence consisting of the individual slices. VVC into its properties in potential use in practice. Below we [33] (H.266) is the best existing video codec nowadays briefly describe each of the methods used in the com- but its development is still ongoing and the available en- parison and optionally its configuration, afterwards we coders/decoders are on the prototype level and for most provide commentary on the results summarized in Fig. 3 practical use cases prohibitively slow. Its adoption in and Tab. 2. practice, medical or otherwise, is also hindered by the Baseline is a learned single-image compression model fact that its use is not royalty free. We used the VTM with the same architecture as the proposed method but 18 reference implementation in 12bit mode and again without the context and entropy module (the hyper- compressed each volumetric image as a video sequence decoder directly predicts the entropy parameters). It consisting of the individual slices. VVC-intra [33] is the is trained on the same train set as the proposed and uses intra mode of VVC. For single-image compression, it is the same training schedule. JPEG [2] is a well-known the best available codec nowadays but currently inherits widely used compression method developed in the 90s. the disadvantages listed above for VVC. We used it in the Although used for medical data and having the advan- same configuration as VVC video but compressed each tage of being very fast both in encode and decode, it is slice individually. arguably not a very suitable method for such use as its performance is relatively low by today’s standards. We Results The rate-distortion curves of the benchmarked use the implementation in pillow. BPG [3] is essen- methods on the CT-scan test set are in shown in Fig. 3, tially a single-image wrapper of the intra-frame compres- their ranking and quantitative comparison with respect sion of the HEVC (also known as H.265) video codec. to VVC-intra is in Tab. 2 and finally, Tab. 3 shows ap- Although not widespread, it is one of the top methods proximate relative runtimes required to process the test currently available for everyday use. We used the jctvc set. In the testing we focused on high-PSNR range since encoder via the public BPG library configured to 12bit in- we envision the proposed method being used primarily ternal bitdepth. AVIF [32] is a single-image format of the in the medical domain, where sliced volumetric images 6 Jan Kotera et al. CEUR Workshop Proceedings 1–9 Table 2 Table 3 Relative bitrate increase (BD-Rate [34], negative means sav- Approximate relative time required to encode and decode the ings) and quality gain (BD-PSNR [34], positive means improve- full test set at bpp = .3 compared to the proposed method ment) of the benchmarked methods compared to VVC-intra (𝑡 = 35 seconds). Times include file I/O where unavoidable. in the range PSNR > 45dB. Method Device Time [𝑡] Method BD-Rate [%] BD-PSNR [dB] JPEG CPU 2e-1 JPEG +248.2 -7.57 Baseline GPU 7e-1 BPG +51.6 -2.40 Proposed GPU 1. AVIF +22.7 -1.26 BPG CPU 7e1 Baseline +20.4 -1.14 AVIF CPU 1.4e2 AV1 +6.2 -0.40 AV1 CPU 1e3 VVC-intra 0.0 0.00 VVC-intra CPU 1.5e3 Proposed -11.4 +0.64 VVC CPU 5.5e3 VVC -23.6 +1.44 therefore no match for VVC but we will see that in that are common. Let us provide some commentary on the comparison it wins on runtime. results. A clear and quantitative ranking of the methods is The baseline learned method performs on the level of provided in Tab. 2, which shows the average bitrate in- AVIF – the curves almost overlap. Although AVIF is un- crease/savings and PSNR quality loss/gain evaluated by doubtedly a better codec in a general setting, the learned the BD-Rate and BD-PSNR [34], respectively. We posi- baseline exploits the advantage of domain specificity – it tioned VVC-intra as the reference SOTA image codec has been trained on similar CT data. BPG generally per- and compared all others to it, as they perform on the forms well on natural images where the target PSNR is test set. The table shows average bitrate savings and usually lower, but to achieve imperceptible distortion in performance gain in the middle and right column, re- medical data we observed that the reconstruction PSNR spectively. Only the proposed method and VVC video should be above 55dB (for typical images with sufficient achieve improvement (BD-Rate is negative and BD-PSNR structure). We suspect there is some issue with the con- is positive). figuration of the encoder at high bitdepth processing Finally, in Tab. 3 we show the relative runtimes re- because BPG obviously struggles with achieving high quired for processing (encode and decode) the whole PSNRs. It is no surprise that JPEG cannot compete with test set (10 volumetric images consisting of 2184 slices) the latest methods. VVC-intra does very well and outper- with respect to the proposed method (i.e. value < 1 forms AVIF by a large margin in the whole range. With means the method is faster than ours, > 1 means it is AV1 we experienced similar problems as with BPG – it slower). These runtimes are listed for bpp = .3, ap- apparently “saturates” at higher bitrates and struggles proximately the middle of the tested range, since the to achieve high PSNR, which is possibly again some is- traditional methods are slower at higher bitrates (the sue with the high bitdepth configuration of the encoder proposed has constant speed across the range). Here the (although we used the same encoder as for AVIF and ranking is quite different than in the performance. JPEG in that case it worked fine). But from comparison with and of course the baseline are the only methods faster AVIF in low to mid bitrates we can see that the sequential than the proposed, all others are slower and some of them “video” processing of the image volume is clearly benefi- quite significantly, especially the well-performing VVC cial with noticeable performance gain. This conclusion which is clearly prohibitively slow. We argue that the is further strengthened by the results of the VVC (video) video codecs are simply not fast enough for practical use. codec, which on performance alone is a clear winner of In fairness, the proposed method and the baseline run the whole comparison, outperforming all other methods on GPU (though still each slice sequentially) while the (including the proposed) by a margin in the whole range. traditional methods are CPU-only without any external The proposed method is significantly better than the parallelization. On the other hand, our implementation is baseline (compare green and orange curves in Fig. 3), on intended only as a proof of concept and we didn’t invest average achieving almost 30% rate savings (for the same much effort into runtime optimization. For example, in quality) and 1.8dB quality increase (for the same rate). both encode and decode the encoder and decoder process It also outperforms all image codecs such as AVIF, BPG, each slice independently. In testing we really process and especially VVC-intra, which is no small feat. This them sequentially for simplicity while it is possible to is only due to the proposed sequential context because “batch” them and process in parallel (as many as the GPU the baseline alone is significantly below VVC-intra. It memory permits), which would reduce the runtime. is however still a relatively small and simple model and Contrary to usual customs, we do not provide exam- 7 Jan Kotera et al. CEUR Workshop Proceedings 1–9 ples and qualitative comparison of image reconstructions But by looking at the results of VVC we see that further because due to the high reconstruction quality and simi- gains are undoubtedly possible and we hypothesize that lar performance of the benchmarked methods we were those can be achieved for example by a stronger context not able to come up with example images that demon- module (ours is a rather simple stack of convolutions, not strate any noticeable difference – on screen all the results in any way input-adaptive) and possibly by introducing look identical. P-frames and B-frames as in video encoding. It is our hope that this work will motivate further research into such possibilities. 5. Conclusion We presented an extension of a single-image learned com- Acknowledgments pression method to volumetric multi-slice images with an emphasis on the medical domain, where such type This project has received funding from the European of images is quite common. Although the modification Union’s Horizon 2020 research and innovation program is relatively simple and straightforward, it provides sev- under grant agreement No 965502. eral benefits – namely using a context module without introducing any problems with parallel processing in the decode and using mean-subtracted quantization. Both of References these improve performance without compromising the [1] Radiation risk from medical imaging, runtime. This we verified in the comparison with a num- https://www.health.harvard.edu/cancer/ ber of established compression methods. The comparison radiation-risk-from-medical-imaging, Sep shows: 2021. • Clear performance gain with respect to the baseline [2] G. Wallace, The JPEG still picture compression stan- due to the proposed sequential context. dard, IEEE Transactions on Consumer Electronics • Good performance in absolute numbers with respect 38 (1992) xviii–xxxiv. to the established codecs. [3] F. Bellard, BPG Image format, https://bellard.org/ • Very competitive runtimes (if GPUs are allowed). bpg, 2018. Accessed: 2021-09-24. The testing was carried out with emphasis on low-error [4] AVIF image format, https://aomediacodec.github. reconstruction and even at PSNR=55dB (in most cases io/av1-avif, 2022. Accessed: 2022-12. indistinguishable from the original) the proposed method [5] Google, WebP Image format, https://developers. achieves an average compression ratio of 40:1 with re- google.com/speed/webp, 2018. Accessed: 2021-09- spect to the uncompressed original. We consider these 24. results a solid proof of concept for compression of volu- [6] Pediatric-CT-SEG, Cancer Imaging Archive, metric medical data. https://wiki.cancerimagingarchive.net/pages/ Nevertheless, there are a number of things which can viewpage.action?pageId=89096588, Aug 2022. be improved or investigated further. For example, the [7] AOM common test conditions v2.0, https://aomedia. used baseline model is far from SOTA so higher absolute org/docs/CWG-B075o_AV2_CTC_v2.pdf, Aug performance can be gained by adopting one of the SOTA 2021. single-image learned methods as a backbone and extend- [8] F. Mentzer, G. Toderici, D. Minnen, S.-J. Hwang, ing that with the proposed context model. In this work, S. Caelles, M. Lucic, E. Agustsson, VCT: A video however, we focused more on investigating the relative compression transformer, 2022. URL: https://arxiv. gains from the sequential context rather than absolute org/abs/2206.07307. performance. Next, in decode our method currently does [9] J. Kivijärvi, T. Ojala, T. Kaukoranta, A. Kuba, not permit random access (as in “show me slice 42”), the L. Nyúl, O. Nevalainen, A comparison of lossless whole volume needs to be decoded sequentially from compression methods for medical images, Com- the beginning. But this can be remedied by introducing puterized Medical Imaging and Graphics 22 (1998) intra-frames compressed by the single-image auxiliary 323–339. method we use for the first slice. If we use a GOP size of [10] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, 8 (meaning at most 8 slices need to be decoded for any D. Minnen, J. Shor, M. Covell, Full resolution image chosen slice), we can estimate that the performance drop compression with recurrent neural networks, in: in Tab. 2 would be approximately −11.4% → −7.5% Proceedings of the IEEE Conference on Computer in rate and +0.64dB → +0.42dB in PSNR, which is Vision and Pattern Recognition (CVPR), 2017. still solid improvement over VVC-intra with practically [11] J. Ballé, V. Laparra, E. P. Simoncelli, End-to-end usable runtimes in both encode and decode. optimized image compression, in: International Conference on Learning Representations, 2017. 8 Jan Kotera et al. CEUR Workshop Proceedings 1–9 [12] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, N. John- tems 33 (2020). ston, Variational image compression with a scale [24] D. He, Z. Yang, H. Yu, T. Xu, J. Luo, Y. Chen, hyperprior, in: International Conference on Learn- C. Gao, X. Shi, H. Qin, Y. Wang, Po-elic: Perception- ing Representations, 2018. oriented efficient learned image coding, in: Pro- [13] D. Minnen, J. Ballé, G. D. Toderici, Joint autoregres- ceedings of the IEEE/CVF Conference on Computer sive and hierarchical priors for learned image com- Vision and Pattern Recognition (CVPR) Workshops, pression, in: S. Bengio, H. Wallach, H. Larochelle, 2022, pp. 1764–1769. K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Ad- [25] T. Bruylants, A. Munteanu, P. Schelkens, Wavelet vances in Neural Information Processing Systems, based volumetric medical image compression, Sig- volume 31, Curran Associates, Inc., 2018. nal Processing: Image Communication 31 (2015) [14] T. Wiegand, G. J. Sullivan, G. Bjontegaard, 112–133. A. Luthra, Overview of the h. 264/avc video cod- [26] Z. Chen, S. Gu, G. Lu, D. Xu, Exploiting intra-slice ing standard, IEEE Transactions on circuits and and inter-slice redundancy for learning-based loss- systems for video technology 13 (2003) 560–576. less volumetric image compression, IEEE Transac- [15] G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand, tions on Image Processing 31 (2022) 1697–1707. Overview of the High Efficiency Video Coding [27] M. U. A. Ayoobkhan, E. Chikkannan, K. Ramakrish- (HEVC) standard, IEEE Transactions on Circuits nan, Feed-forward neural network-based predictive and Systems for Video Technology 22 (2012) 1649– image coding for medical image compression, Ara- 1668. bian Journal for Science and Engineering 43 (2018) [16] Z. Cheng, H. Sun, M. Takeuchi, J. Katto, Learned im- 4239–4247. age compression with discretized gaussian mixture [28] D. Mishra, S. K. Singh, R. K. Singh, Lossy medical likelihoods and attention modules, in: Proceedings image compression using residual learning-based of the IEEE/CVF Conference on Computer Vision dual autoencoder model, in: 2020 IEEE 7th Ut- and Pattern Recognition (CVPR), 2020. tar Pradesh Section International Conference on [17] D. Minnen, S. Singh, Channel-wise autoregressive Electrical, Electronics and Computer Engineering entropy models for learned image compression, in: (UPCON), 2020, pp. 1–5. 2020 IEEE International Conference on Image Pro- [29] Y. Bengio, Estimating or propagating gradients cessing (ICIP), 2020, pp. 3339–3343. through stochastic neurons, 2013. URL: https:// [18] D. He, Y. Zheng, B. Sun, Y. Wang, H. Qin, Checker- arxiv.org/abs/1305.2982. board context model for efficient learned image [30] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into compression, in: Proceedings of the IEEE/CVF Con- rectifiers: Surpassing human-level performance on ference on Computer Vision and Pattern Recogni- imagenet classification, in: 2015 IEEE International tion (CVPR), 2021, pp. 14771–14780. Conference on Computer Vision (ICCV), 2015, pp. [19] D. He, Z. Yang, W. Peng, R. Ma, H. Qin, Y. Wang, 1026–1034. Elic: Efficient learned image compression with un- [31] D. P. Kingma, J. Ba, Adam: A method for stochastic evenly grouped space-channel contextual adaptive optimization, in: ICLR (Poster), 2015. coding, in: Proceedings of the IEEE/CVF Confer- [32] J. Han, B. Li, D. Mukherjee, C.-H. Chiang, A. Grange, ence on Computer Vision and Pattern Recognition C. Chen, H. Su, S. Parker, S. Deng, U. Joshi, et al., (CVPR), 2022, pp. 5718–5727. A technical overview of AV1, Proceedings of the [20] L. Theis, W. Shi, A. Cunningham, F. Huszár, Lossy IEEE 109 (2021) 1435–1462. image compression with compressive autoencoders, [33] B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. in: International Conference on Learning Represen- Sullivan, J.-R. Ohm, Overview of the Versatile Video tations, 2017. Coding (VVC) standard and its applications, IEEE [21] Z. Guo, Z. Zhang, R. Feng, Z. Chen, Soft then hard: Transactions on Circuits and Systems for Video Rethinking the quantization in neural image com- Technology (2021) 1–1. pression, in: International Conference on Machine [34] G. Bjontegaard, Calculation of average PSNR dif- Learning, PMLR, 2021, pp. 3920–3929. ferences between RD-curves, VCEG-M33 (2001). [22] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of deep features as a perceptual metric, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [23] F. Mentzer, G. D. Toderici, M. Tschannen, E. Agusts- son, High-fidelity generative image compression, Advances in Neural Information Processing Sys- 9