=Paper= {{Paper |id=Vol-3349/paper9 |storemode=property |title=Learned Lossy Image Compression for Volumetric Medical Data |pdfUrl=https://ceur-ws.org/Vol-3349/paper9.pdf |volume=Vol-3349 |authors=Jan Kotera,Matthias Woedlinger,Manuel Keglevic |dblpUrl=https://dblp.org/rec/conf/cvww/KoteraWK23 }} ==Learned Lossy Image Compression for Volumetric Medical Data== https://ceur-ws.org/Vol-3349/paper9.pdf
Learned Lossy Image Compression for Volumetric Medical
Data
Jan Kotera1,2,* , Matthias Wödlinger1 and Manuel Keglevic1
1
    CVL, TU Wien, Favoritenstraße 9/11, 1040 Vienna, Austria
2
    Institute of information theory, CAS, Pod Vodárenskou věží 4, 182 00 Prague, Czech Republic


                                          Abstract
                                          This work addresses the problem of lossy compression of volumetric images consisting of individual slices such as those
                                          produced by CT scans and MRI machines in medical imaging. We propose an extension of a single-image lossy compression
                                          method with an autoregressive context module to a sequential encoding of the volumetric slices. In particular, we remove
                                          the intra-slice autoregressive relation and instead condition the entropy model of the latent on the previous slice in the
                                          sequence. This modification alleviates the typical disadvantages of autoregressive contexts and leads to a significant increase
                                          in performance compared to encoding each slice independently. We test the proposed method on a dataset of diverse CT scan
                                          images in a setting with an emphasis on high-fidelity reconstruction required in medical imaging and show that it compares
                                          favorably against several established state-of-the-art codecs in both performance and runtime.

                                          Keywords
                                          Learned Image Compression, Medical Image Data, Deep Learning



1. Introduction
Medical imaging is a set of techniques and processes that
produce images of the interior of the body for the pur-
pose of clinical analysis, medical intervention, or visual
representation of the function of the internal organs. Ex-
amples of common types of imaging systems are x-rays,
computed tomography (CT) scans, magnetic resonance
imaging (MRI), or ultrasound (US). Medical imaging has
become a staple tool not only for medical diagnosis and
treatment but also a crucial component of research, as it
allows researchers and physicians to establish a knowl-
edge base of normal anatomy and physiology to make it
possible to identify abnormalities and study the effects of
medical intervention. For these reasons, the amount of
image data produced in healthcare and medical research
                                                                                                     Figure 1: Illustrative example of a single uncompressed slice
is huge and increasing [1], as are the requirements for                                              from the CT scan test set [6] used for performance evaluation.
efficient transmission and especially storage.
   Image compression methods are designed for exactly
that – to enable more efficient coding of image data with
little or no loss in visual quality. The first successful im-                                          ern image compression codecs such as BPG [3], AVIF [4],
age compression techniques were developed in the early                                                 or WebP [5] typically appear as by-products of a video
1990s and some of those are still being widely used today,                                             codec development – the intra-frame component is ex-
such as for example the well-known JPEG method [2].                                                    tracted from the video codec and used as a standalone
In recent years the development of novel compression                                                   image codec.
methods for image and video accelerated, in line with the                                                 For mainstream everyday use in applications such as
growing amount of streamed image and video data. Mod-                                                  image or video streaming, video calls, online gaming and
                                                                                                       so on the goal is for the reconstructed image to appear
                                                                                                       “natural and artefact-free” on first glance while achiev-
26th Computer Vision Winter Workshop, Robert Sablatnig and Florian
Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023                                         ing high enough compression ratios to make the above
*
  Corresponding author.                                                                                mentioned applications feasible. General-purpose video
$ kotera@utia.cas.cz (J. Kotera); mwoedlinger@cvl.tuwien.ac.at                                         codes are therefore developed for and tested mainly on
(M. Wödlinger); keglevic@cvl.tuwien.ac.at (M. Keglevic)                                                natural sequences, screen content, or synthetic scenes
          © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   (eg. [7]) and typically benchmarked in perceptually lossy
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                                 1
Jan Kotera et al. CEUR Workshop Proceedings                                                                               1–9



range of < 40dB reconstruction PSNR (eg. [8]). simi-              vised manner and the minimized loss is the sum of two
larly for image codecs. In the case of medical imaging            terms: The distortion in the image reconstruction and
the fundamental requirement is that the reconstruction            the entropy (i.e. expected bitrate) of the latent. The en-
error must not alter the subsequent clinical analysis. The        tropy coder is used off-the-shelf and is not subject to
reconstructed image must remain true to the original up           training. One of the great advantages of learned image
to imperceptible “noise” void of any structure. We argue          compression is that the training is relatively simple and
that using an established and straightforward objective           cheap which makes it possible to adapt a method for a
metric such as PSNR for measuring the reconstruction              particular modality, such as medical images, whereas for
error is the right approach here to ensure that the recon-        conventional hand-designed codecs such adaptation is
structed image is truly nearly identical to the original          not feasible.
when the reconstruction error is near zero. In our subjec-           The proposed method extends [13] to volumetric med-
tive tests (on HDR display) we find that we are not able          ical data consisting of individual slices, i.e. a sequence of
to distinguish between the original and reconstructed             2D images. This type of data is acquired for example by a
images above 55dB PSNR, so that is approximately our              CT scan (see Fig. 1 for an example) or in an MRI. The in-
target quality range. On the other hand, bellow 50dB we           dividual slices are encoded in order. The transform from
could identify loss of subtle structure in some images.           image data to the latent representation is done for each
Having the images analyzed by medical experts is unfor-           slice independently, but in the entropy estimation step
tunately too resource-intensive and beyond the scope of           the probability model of each slice (except the first) is
this work.                                                        conditioned on the previous slice, which enables a more
   Another solution common in practice is using only              accurate estimation of latent distribution since neighbor-
lossless compression but such methods never achieve               ing slices typically have high mutual information. This
anywhere near as high compression ratios (by order of             allows for higher compression ratios with no loss in the
magnitude) as lossy methods – for example the study [9]           reconstruction quality. On the decoding side, the im-
finds that on medical data the traditional lossless codecs        ages are decoded in the same order, so that the previous
hardly achieve compression ratios over 4:1, while on the          slice is again available when decoding the next. Note
test set the proposed method has average ratio over 40:1          that the proposed method works with already digitized
at PSNR > 55dB. Proper research into lossy methods                uncompressed images in normalized intensity range (typ-
is therefore surely justified.                                    ically 8bit-16bit), it doesn’t in any way enter the process
   The traditional approach to image compression are              of image generation by the above mentioned imaging
hand-designed codecs that are implemented as hard-                techniques.
coded algorithms, based on human experience and in-                  We show in the experimental section that this rela-
tuition (see Sec. 2). As with many problems in image              tively simple addition outperforms considerably the base-
processing and computer vision in the last decade, av-            line approach in which all slices are processed completely
enues are being explored on how to learn optimal codecs           independently by a single image compression method.
from data. Modern research in learned image compres-              Additionally, compared to processing the full volume at
sion started with the works of Toderici et al. [10] as the        once our approach requires a fraction of time and mem-
first fully learned method applicable to large images and         ory (in practice, it would be necessary to split the volume
outperforming some established traditional codecs. A              into small chunks and compress those separately any-
surge of interest in learned image compression came af-           way). We tested the method on a dataset consisting of
ter the seminal works of Ballé et al. [11, 12] and Minnen         CT scans of various human body parts and the proposed
et al. [13]. These works laid the groundwork for further          approach is competitive even compared to established
research and it can be argued that most state-of-the-art          standards such as JPEG, BPG, AVIF, and even VVC-intra.
(SOTA) methods nowadays are extensions of these meth-
ods.
   The core structure of a learned method typically con-          2. Related work
sists of an autoencoder which transforms the input and
                                                                  For a long time, lossy image and video compression was
produces a latent representation of the image which will
                                                                  a problem solved exclusively in the traditional way by
constitute the bitstream. This representation is then quan-
                                                                  hand-designed methods. Some of these methods, such
tized so that it can be passed to an entropy coder which
                                                                  as for example H.264 [14] or H.265 [15] video codecs or
losslessly converts the discrete representation to an ac-
                                                                  JPEG image compression [2], are now in widespread use
tual bitstream. The third integral component is an en-
                                                                  in many areas of industry, research, or everyday life. Rel-
tropy model of the latent, i.e. a probability distribution
                                                                  atively recently, the first learned codecs appeared that
model of the symbols (after quantization) of the latent
                                                                  were able to challenge some of the traditional methods.
representation, as this is required by the entropy coder.
                                                                  Arguably the biggest rise of interest started after the
This pipeline can be trained end-to-end in an unsuper-



                                                              2
Jan Kotera et al. CEUR Workshop Proceedings                                                                                 1–9



works of Ballé et al. [11, 12] and later Minnen et al. [13],          The encoding and decoding branches of the pipeline are
which laid the foundation for learned image compression.              connected only via the bitstream which stores the latent
These works formulated the main rate-distortion objec-                and hyper-latent representation of the image. To this end
tive in a learnable way, presented a model containing the             the latents must be quantized, for which scalar integer
three fundamental components now present in the vast                  rounding is used, because the entropy coder that con-
majority of learned codecs – the autoencoder for image                verts the values into their corresponding bit codes can
transform, and the hyper-prior and the context module                 only operate on discrete data (continues values cannot
for entropy estimation – and provided the solution for                be stored in the bitstream).
dealing with the discrete quantization in training. Subse-               The advantage of the context module is that the en-
quent methods increased the performance for example                   tropy parameters can be very accurate and image-specific,
by richer/larger model architecture (e.g. using attention-            the disadvantage is that the autoregressive processing
like modules) [16], improved context modules [17, 18, 19],            does not play well with the parallel processing common
richer entropy model (e.g. Gaussian mixtures) [16], or                in deep learning. For each new pixel to be decoded the
different simulation of quantization [20, 21].                        entropy parameters must first be estimated, the pixel de-
    Recently, a promising research direction is coercing              coded and only then can the decoding move to the next
the reconstruction to better satisfy the expectations of              pixel. As a result, a usually parallelized operation such
the human visual system even at the expense of objective              as convolution cannot be computed for the whole image
(e.g. PSNR) quality. This can be achieved for example by              at once but pixel by pixel in alternation with the entropy
augmenting the loss by a term that better models human                coder. Another disadvantage is that the context prevents
perception (such as LPIPS [22]) [19], or by training the              using the so-called mean-subtracted quantization, which
decoder in an adversarial manner as in GANs [23, 24].                 will be specified in the next section. We get rid of these
Such approaches can achieve significant bitrate savings               drawbacks in the proposed method by replacing the au-
but unfortunately are not suitable for medical data, where            toregressive context from [13] with an analogous module
the reconstructed image must be objectively undistorted               that runs on the previous slice in the sequence.
and not just look natural.
    Literature on learned compression for medical images  Model details The input to our method is a sequence
is relatively scarce, this area is still dominated by more tra-
                                                          of 2D slices 𝑥0 , . . . , 𝑥𝑁 −1 (superscripts denote slices,
ditional approaches such as compression in the wavelet    subscripts pixel indices) which are processed in order.
domain [25]. Probably the closest match for the pro-      The transforms to and from the latent representation de-
posed method is the lossless compression of 3D volumes    noted 𝑦 𝑖 are done for each slice independently but the
by Chen et al. [26]. In our work, however, we focus on    entropy model, i.e. the probability distribution 𝑝𝑦^ (𝑦ˆ𝑖 ) of
lossy compression. Other works propose partitioning       the quantized latent 𝑦ˆ𝑖 (hat denotes quantization oper-
the image into relevant (for the diagnosis) and less rele-ation), is conditioned on the latent of the previous slice
vant regions and apply different compression ratios there 𝑦ˆ𝑖−1 . This helps decrease the entropy of 𝑦ˆ𝑖 and therefore
[27]. Learned lossy compression for 2D medical images     the necessary bitrate while avoiding the disadvantages
is investigated for example in [28].                      of an autoregressive context model. It is done as follows:
                                                          Instead of running the context model on the currently
                                                          encoded slice in an autoregressive fashion, we run it on
3. Method                                                 the (quantized) latent 𝑦ˆ𝑖−1 of the previous slice. During
The proposed approach is based on the single image decoding,          the slices are processed in the same order so
compression method by Minnen et al. [13], which we 𝑦ˆ
                                                            𝑖−1
                                                                 has already been decoded in full and is available
extend for multi-slice volumetric images. The method when 𝑦ˆ is being decoded and the entropy model can
                                                                   𝑖


[13] consists of three main components:                   again use information from the previous slice. This ap-
                                                          proach does not require autoregressive processing but
• An encoder/decoder which performs the transform
                                                          can instead be done in parallel for the whole slice with-
  between the input image space and the latent repre-
                                                          out waiting for each new pixel to be decoded. In other
  sentation (commonly called “latent”).
                                                          words, the context module is autoregressive in the slice
• A hyper-encoder/decoder (called hyper-prior) which sequence but that does not restrict any 2D operations
  analyzes the latent and stores a small piece of side contained within one slice such as convolutions – instead
  information into the bitstream that is used later to of decoding individual pixels we can decode whole slices
  estimate the parameters of the probability distribution in parallel.
  of the latent (the entropy model).                         We model the distribution 𝑝𝑦^ (𝑦ˆ𝑖 ) of the quantized
• A context module that processes the image latent in latent 𝑦ˆ𝑖 by a per-dimension 𝑗 (i.e. spatial pixel and
  an autoregressive fashion (i.e. causally) and is also a channel) independent Laplace distribution with mean
  part of the entropy model parameter estimation.         and scale parameters (𝜇𝑖𝑗 , 𝜎𝑗𝑖 ). These two parameters



                                                                  3
Jan Kotera et al. CEUR Workshop Proceedings                                                                                      1–9




                                                                               Hyper-encoder
                  Encoder
                               Quantizer                                                       Quantizer




                               Entropy                                                         Entropy
                                                   Context module
                               encoder                                                         encoder


                                                                                                                 Entropy model
                                                                                                                    for


                               Entropy                                                         Entropy




                                                                               Hyper-decoder
                               decoder                                                         decoder
                  Decoder




                                                   Entropy module
                                                      for




Figure 2: Overview of the proposed compression pipeline. Connectors: Green are operations performed only in encode,
red are operations performed only in decode and blue are operations performed both in encode and decode. Checkboard
denotes the bitstream. Procedure: The input image is passed through an encoder, producing the latent 𝑦 𝑖 . The latent is
concatenated with the latent of the previous slice, 𝑦 𝑖−1 , and passed through the hyper-encoder, producing the hyper-latent
𝑧 𝑖 . This hyper-latent is quantized to 𝑧^𝑖 and stored using fixed entropy model 𝑝(𝑧^). Parameters of the image-adaptive entropy
model 𝑝(𝑦^𝑖 ) are estimated by a context that processes the previous slice’s latent 𝑦^𝑖−1 , and hyper-decoder that processes
the hyper-latent 𝑧^𝑖 . These two are concatenated and passed through an entropy module to produce the entropy parameters
                        ^𝑖 is stored in the bitstream. In decode the hyper-decoder, context, and entropy module have to run again
(𝜇𝑖 , 𝜎 𝑖 ). The latent 𝑦
because the parameters (𝜇𝑖 , 𝜎 𝑖 ) are required for decoding of 𝑦^𝑖 from the bitstream; for this the latent of the previous slice
^𝑖−1 is already available. The decoded latent 𝑦
𝑦                                                   ^𝑖 is passed through the decoder to produce the reconstructed image 𝑥^𝑖 .



are estimated adaptively for each image 𝑖 and each pixel              hyper-latent 𝑧 𝑖 = 𝐸ℎ ([𝑦 𝑖−1 , 𝑦 𝑖 ]). This hyper-latent
𝑗 (incl. channels) of the latent by the hyper-prior and               is quantized, 𝑧ˆ𝑖 = 𝑄(𝑧𝑖 ), so that it can be stored in
the context module. For quantization of the latent we                 the bitstream. The parameters of the entropy model
use integer rounding with mean-subtraction, meaning                   of the quantized latent 𝑦ˆ𝑖 are estimated as follows. A
that the value is first offset by the estimated mean of its           context module 𝐶 processes the previous slice’s latent
distribution before being rounded, (image index omitted)              𝑦ˆ𝑖−1 and hyper-decoder 𝐷ℎ processes the hyper-latent
                                                                      𝑧ˆ𝑖 . These two are concatenated and passed through an
                   𝑦ˆ𝑗 = ⌊𝑦𝑗 − 𝜇𝑗 ⌉ + 𝜇𝑗 ,                  (1)       entropy module 𝐸𝑝 to produce the final entropy parame-
                                                                      ters (𝜇𝑖𝑗 , 𝜎𝑗𝑖 ) = 𝐸𝑝 ([𝐶(𝑦ˆ𝑖−1 ), 𝐷ℎ (𝑧ˆ𝑖 )])𝑖𝑗 for each pixel
where ⌊·⌉ is integer rounding. This improves perfor-
                                                                      𝑗 of the latent. With these parameters available the la-
mance because then quantization doesn’t change the
                                                                      tent can be quantized and stored in the bitstream and the
mean of the distribution, but it requires that the entropy
                                                                      encoding proceeds to the next slice.
parameters of the latent are estimated before the latent is
                                                                           During decoding, operations responsible for estimat-
quantized. In particular, both of the entropy estimation
                                                                      ing the entropy model 𝑝𝑦^ (𝑦ˆ𝑖 ) have to be executed again
modules (hyper-prior and context) must operate on non-
                                                                      because the entropy model is required by the coder to
quantized values 𝑦 𝑖 , otherwise an implicit relation would
                                                                      decode 𝑦ˆ𝑖 from the bitstream. The hyper-latent 𝑧ˆ𝑖 is de-
arise. This is difficult to achieve in a single-image autore-
                                                                      coded first and since the latent of the previous slice 𝑦ˆ𝑖−1
gressive context model and for example the quantization
                                                                      is already decoded and available, the estimation of the
in [13] does not use mean-subtraction, but since in the
                                                                      entropy parameters (𝜇, 𝜎) proceeds as during encoding.
proposed method the context module uses the previous
                                                                      Having those, 𝑦ˆ𝑖 can be decoded and passed through the
slice, using mean-subtraction is possible.
                                                                      decoder 𝐷 to finally produce the reconstructed image
   The full procedure of processing a slice 𝑥𝑖 is illus-
                                                                      𝑥ˆ 𝑖 = 𝐷(𝑦ˆ𝑖 ). The decoding then proceeds to the next
trated in Fig. 2. The image is passed through an encoder
                                                                      slice.
𝐸, producing the latent 𝑦 𝑖 = 𝐸(𝑥𝑖 ). The latent is con-
                                                                           What remains to specify is the entropy model 𝑝𝑧^ (𝑧ˆ)
catenated with the latent of the previous slice, 𝑦 𝑖−1 , and
                                                                      of the hyper-latent 𝑧ˆ, since that is also processed by the
passed through the hyper-encoder 𝐸ℎ , producing the



                                                                  4
Jan Kotera et al. CEUR Workshop Proceedings                                                                                  1–9



entropy coder and stored in the bitstream. We model it              Table 1
by per-channel Laplace distribution, meaning that each              Model architecture details. conv is a Conv2D layer with kernel
channel of 𝑧 𝑖 has its own mean and scale parameters                size 𝑘, stride 𝑠 and output channels 𝑐. transpose is a simi-
(𝜇, 𝜎) but those are spatially constant so that the model           larly specified ConvTranspose2D. GDN and IGDN are the
is not tied to a fixed image resolution. These parameters           generalized divisive normalization layer [11] and its inverse,
are subject to training but fixed once the model has been           respectively. PReLU is the parametric ReLU [30].
trained (i.e. unlike 𝑝𝑦^ (𝑦ˆ) it is not image-adaptive). For        Encoder: conv k5 s2 c192 → GDN → conv k5 s2 c192 →
quantization of 𝑧 we again use mean-subtracted rounding             GDN → conv k5 s2 c192 → GDN → conv k5 s2 c192
in a similar fashion as in Eq. (1).                                 Decoder: transpose k5 s2 c192 → IGDN → transpose k5 s2
   Details of the model architecture are concisely sum-             c192 → IGDN → transpose k5 s2 c192 → IGDN → transpose
marized in Tab. 1.                                                  k5 s2 c1
                                                                    Hyper-encoder: conv k3 s1 c192 → PReLU → conv k5 s2
                                                                    c192 → PReLU → conv k5 s2 c192
Training details In training we optimize the rate-
                                                                    Hyper-decoder: conv k5 s2 c192 → PReLU → conv k5 s2
distortion loss 𝐿 (image indices omitted)                           c288 → PReLU → conv k3 s1 c384
  𝐿 = E𝑥∼𝑝𝑥 [− log2 𝑝𝑦^ (𝑦ˆ)] + E𝑥∼𝑝𝑥 [− log2 𝑝𝑧^ (𝑧ˆ)]             Context: conv k5 s1 c384
                                                                    Entropy module: conv k1 s1 c768 → PReLU → conv k1 s1
    + 𝜆 · 2552 · E𝑥∼𝑝𝑥 ‖𝑥 − 𝑥  ˆ ‖22 ,
                       [︀           ]︀
                                                                    c576 → PReLU → conv k1 s1 c384
                                                       (2)
where 𝜆 controls the rate-distortion tradeoff (determines
approximate target bitrate) and 𝑝(𝑥), the distribution of       slice in each volumetric series. This model has the same
uncompressed images, is evaluated by batch averaging.           encoder/decoder as the multi-slice model and the same
The first two terms on the right-hand side are approxi-         architecture (not weights) of the hyper-prior but does
mate (theoretical) bitrates required by the entropy coder       not include the context and entropy module – the hyper-
to encode the latents. These are used in training as an esti-   decoder directly predicts the (𝜇, 𝜎) parameters of the
mate of the actual bitrates because the non-differentiable      latent entropy model. In validation and testing, we use
entropy coders are removed from training.                       this auxiliary model to compress the first slice of the vol-
   Our description of 𝑝𝑦^ (𝑦ˆ) and 𝑝𝑧^ (𝑧ˆ) so far was some-    ume and then proceed sequentially with the multi-slice
what simplified. The Laplace parametric density is used         model.
only as a model to conveniently parametrize the discrete           The quantization operation must be approximated dur-
distribution over the symbols after quantization. In the        ing training because it has zero gradient almost every-
actual evaluation, however, we have to account for the          where. For both the latents 𝑦 and hyper-latents 𝑧 we use
whole interval corresponding to each discrete value be-         the straight-through quantization [29], which performs
cause of quantization. This is done by integrating the          integer rounding in the forward pass but acts as identity
parametric density over the corresponding interval, for         in the backward pass. For evaluation of the bitrate in
example                                                         the entropy models, however, we simulate quantization
                                                                by additive uniform noise from the (− 12 , 12 ) range. This
                             ∫︁ 𝑦^𝑖𝑗 + 1
                                       2
               𝑝𝑦^ (𝑦ˆ𝑖𝑗 ) =             𝑃𝑦^𝑖 (𝑡)𝑑𝑡,      (3)   way   the hyper-decoder and decoder get the more real-
                                         𝑗
                               ^𝑖𝑗 − 1
                               𝑦     2
                                                                istic integer-rounded values (with mean-subtraction as
where 𝑃𝑦^𝑖 is the continuous Laplace density model in Eq. (1)) but the entropy estimation is calculated using
           𝑗
parametrized by (𝜇, 𝜎) corresponding to 𝑝𝑦^ (𝑦ˆ𝑖𝑗 ), the dis- the uniform noise simulation, which reportedly leads to
crete distribution of 𝑦ˆ𝑖𝑗 . In practice, this is done by using better performance [20].
the cumulative distribution function of the Laplace den-
sity.                                                           4. Results
   In each training iteration we randomly sample a small
subset of 𝑛 consecutive slices from each image in the Dataset We trained and tested the method on the
batch and process those through the model as a small Pediatric-CT-SEG dataset of CT-scan images of various
volume. For the first slice 𝑥0 of this subset we calculate organs downloaded from the Cancer Imaging Archive
the latent 𝑦 0 = 𝐸(𝑥0 ) using an auxiliary single-image [6] (patient and acquisition parameters specified therein).
model which shares the same encoder with the multi- We chose this dataset for its diverse content. The dataset
slice model. For 𝑥1 , . . . , 𝑥𝑛−1 we proceed as described consists of 359 volumetric images each with a different
above and these slices are used to evaluate the loss in number of slices ranging from 41 to 1104. We randomly
Eq. (2). The first slice 𝑥0 is excluded from optimization selected 10 of the volumetric images for testing (2184
of the multi-slice model but is used to train the auxil- slices in total) and the rest for training. The 2D slices
iary single-slice model used for compression of the first



                                                                5
Jan Kotera et al. CEUR Workshop Proceedings                                                                                          1–9


                                                                                    Rate-distortion (PSNR) on the CT-scan test set
are 12bit grayscale images with a resolution of 512×512,                     60.0

originally stored uncompressed at 16 bits per pixel (bpp).                   57.5
An example slice from the dataset is in Fig. 1.
                                                                             55.0

Training We trained the model on random spatial               52.5
crops of size 256×256 and tested it on full-resolution




                                                                 PSNR [dB]
images. For training, we randomly chose 𝑛 = 3 consecu-        50.0

tive slices as a good compromise between training speed       47.5                                            Proposed
and exploiting the sequential processing. We trained with                                                     Baseline
                                                                                                              VVC
batch size 8 using the Adam optimizer [31] with an initial    45.0
                                                                                                              VVC-intra
learning rate of 1𝑒 − 4 for 1M iterations after which we                                                      AV1
                                                              42.5                                            AVIF
decreased the learning rate to 1𝑒 − 5 for another 200k it-                                                    BPG
                                                                                                              JPEG
erations. We trained a new model for 6 values of 𝜆 in the     40.0
                                                                  0.0 0.1    0.2     0.3          0.4 0.5  0.6          0.7
range from 0.032 to 3.2, which on the test set results in                             bits per pixel
0.05 to 0.65 bits per pixel, thus achieving a compression Figure 3: Rate-distortion performance of the proposed and
ratio of 25:1 to 320:1 with respect to the original images. benchmark methods on the test set of CT-scan images.

Benchmark methods We compare the performance
of the proposed method with a baseline learned single-           AV1 video codec (essentially AV1-intra), one of today’s
image compression model and a number of established              top codecs from those that are readily available e.g. in
traditional image compression methods. The single-               browsers. In the comparison we used the libaom-av1
image baseline is a learned model with the same archi-           encoder via ffmpeg configured to 12bit internal process-
tecture as the auxiliary model we use to compress the            ing, each slice in the series is compressed individually.
first slice and was trained on the same train set. Compar-       AV1 [32] is a video codec in terms of quality approxi-
ison with this method shows performance gain from the            mately on the level of or slightly outperforming HEVC
proposed sequential processing and the context module.           but unlike HEVC its use is royalty-free, it is therefore
The traditional methods are a broad selection ranging            arguably the best video codec readily available today
from well-known and established codecs commonly used             (with production-level encoders and decoders available).
in practice to the state-of-the-art prototype. Such com-         In our comparison, we used ffmpeg/libaom-av1 in
parison therefore well positions the proposed method             12bit mode and compressed each volumetric image as a
in the landscape of existing methods and gives insight           video sequence consisting of the individual slices. VVC
into its properties in potential use in practice. Below we       [33] (H.266) is the best existing video codec nowadays
briefly describe each of the methods used in the com-            but its development is still ongoing and the available en-
parison and optionally its configuration, afterwards we          coders/decoders are on the prototype level and for most
provide commentary on the results summarized in Fig. 3           practical use cases prohibitively slow. Its adoption in
and Tab. 2.                                                      practice, medical or otherwise, is also hindered by the
   Baseline is a learned single-image compression model          fact that its use is not royalty free. We used the VTM
with the same architecture as the proposed method but            18 reference implementation in 12bit mode and again
without the context and entropy module (the hyper-               compressed each volumetric image as a video sequence
decoder directly predicts the entropy parameters). It            consisting of the individual slices. VVC-intra [33] is the
is trained on the same train set as the proposed and uses        intra mode of VVC. For single-image compression, it is
the same training schedule. JPEG [2] is a well-known             the best available codec nowadays but currently inherits
widely used compression method developed in the 90s.             the disadvantages listed above for VVC. We used it in the
Although used for medical data and having the advan-             same configuration as VVC video but compressed each
tage of being very fast both in encode and decode, it is         slice individually.
arguably not a very suitable method for such use as its
performance is relatively low by today’s standards. We
                                                                 Results The rate-distortion curves of the benchmarked
use the implementation in pillow. BPG [3] is essen-
                                                                 methods on the CT-scan test set are in shown in Fig. 3,
tially a single-image wrapper of the intra-frame compres-
                                                                 their ranking and quantitative comparison with respect
sion of the HEVC (also known as H.265) video codec.
                                                                 to VVC-intra is in Tab. 2 and finally, Tab. 3 shows ap-
Although not widespread, it is one of the top methods
                                                                 proximate relative runtimes required to process the test
currently available for everyday use. We used the jctvc
                                                                 set. In the testing we focused on high-PSNR range since
encoder via the public BPG library configured to 12bit in-
                                                                 we envision the proposed method being used primarily
ternal bitdepth. AVIF [32] is a single-image format of the
                                                                 in the medical domain, where sliced volumetric images




                                                             6
Jan Kotera et al. CEUR Workshop Proceedings                                                                                1–9



Table 2                                                             Table 3
Relative bitrate increase (BD-Rate [34], negative means sav-        Approximate relative time required to encode and decode the
ings) and quality gain (BD-PSNR [34], positive means improve-       full test set at bpp = .3 compared to the proposed method
ment) of the benchmarked methods compared to VVC-intra              (𝑡 = 35 seconds). Times include file I/O where unavoidable.
in the range PSNR > 45dB.
                                                                                 Method        Device    Time [𝑡]
       Method        BD-Rate [%]     BD-PSNR [dB]                                JPEG          CPU           2e-1
       JPEG               +248.2             -7.57                               Baseline      GPU           7e-1
       BPG                 +51.6             -2.40                               Proposed      GPU              1.
       AVIF                +22.7             -1.26                               BPG           CPU            7e1
       Baseline            +20.4             -1.14                               AVIF          CPU          1.4e2
       AV1                  +6.2             -0.40                               AV1           CPU            1e3
       VVC-intra              0.0             0.00                               VVC-intra     CPU          1.5e3
       Proposed            -11.4            +0.64                                VVC           CPU          5.5e3
       VVC                 -23.6            +1.44


                                                                    therefore no match for VVC but we will see that in that
are common. Let us provide some commentary on the                   comparison it wins on runtime.
results.                                                               A clear and quantitative ranking of the methods is
   The baseline learned method performs on the level of             provided in Tab. 2, which shows the average bitrate in-
AVIF – the curves almost overlap. Although AVIF is un-              crease/savings and PSNR quality loss/gain evaluated by
doubtedly a better codec in a general setting, the learned          the BD-Rate and BD-PSNR [34], respectively. We posi-
baseline exploits the advantage of domain specificity – it          tioned VVC-intra as the reference SOTA image codec
has been trained on similar CT data. BPG generally per-             and compared all others to it, as they perform on the
forms well on natural images where the target PSNR is               test set. The table shows average bitrate savings and
usually lower, but to achieve imperceptible distortion in           performance gain in the middle and right column, re-
medical data we observed that the reconstruction PSNR               spectively. Only the proposed method and VVC video
should be above 55dB (for typical images with sufficient            achieve improvement (BD-Rate is negative and BD-PSNR
structure). We suspect there is some issue with the con-            is positive).
figuration of the encoder at high bitdepth processing                  Finally, in Tab. 3 we show the relative runtimes re-
because BPG obviously struggles with achieving high                 quired for processing (encode and decode) the whole
PSNRs. It is no surprise that JPEG cannot compete with              test set (10 volumetric images consisting of 2184 slices)
the latest methods. VVC-intra does very well and outper-            with respect to the proposed method (i.e. value < 1
forms AVIF by a large margin in the whole range. With               means the method is faster than ours, > 1 means it is
AV1 we experienced similar problems as with BPG – it                slower). These runtimes are listed for bpp = .3, ap-
apparently “saturates” at higher bitrates and struggles             proximately the middle of the tested range, since the
to achieve high PSNR, which is possibly again some is-              traditional methods are slower at higher bitrates (the
sue with the high bitdepth configuration of the encoder             proposed has constant speed across the range). Here the
(although we used the same encoder as for AVIF and                  ranking is quite different than in the performance. JPEG
in that case it worked fine). But from comparison with              and of course the baseline are the only methods faster
AVIF in low to mid bitrates we can see that the sequential          than the proposed, all others are slower and some of them
“video” processing of the image volume is clearly benefi-           quite significantly, especially the well-performing VVC
cial with noticeable performance gain. This conclusion              which is clearly prohibitively slow. We argue that the
is further strengthened by the results of the VVC (video)           video codecs are simply not fast enough for practical use.
codec, which on performance alone is a clear winner of              In fairness, the proposed method and the baseline run
the whole comparison, outperforming all other methods               on GPU (though still each slice sequentially) while the
(including the proposed) by a margin in the whole range.            traditional methods are CPU-only without any external
   The proposed method is significantly better than the             parallelization. On the other hand, our implementation is
baseline (compare green and orange curves in Fig. 3), on            intended only as a proof of concept and we didn’t invest
average achieving almost 30% rate savings (for the same             much effort into runtime optimization. For example, in
quality) and 1.8dB quality increase (for the same rate).            both encode and decode the encoder and decoder process
It also outperforms all image codecs such as AVIF, BPG,             each slice independently. In testing we really process
and especially VVC-intra, which is no small feat. This              them sequentially for simplicity while it is possible to
is only due to the proposed sequential context because              “batch” them and process in parallel (as many as the GPU
the baseline alone is significantly below VVC-intra. It             memory permits), which would reduce the runtime.
is however still a relatively small and simple model and               Contrary to usual customs, we do not provide exam-



                                                                7
Jan Kotera et al. CEUR Workshop Proceedings                                                                               1–9



ples and qualitative comparison of image reconstructions              But by looking at the results of VVC we see that further
because due to the high reconstruction quality and simi-           gains are undoubtedly possible and we hypothesize that
lar performance of the benchmarked methods we were                 those can be achieved for example by a stronger context
not able to come up with example images that demon-                module (ours is a rather simple stack of convolutions, not
strate any noticeable difference – on screen all the results       in any way input-adaptive) and possibly by introducing
look identical.                                                    P-frames and B-frames as in video encoding. It is our
                                                                   hope that this work will motivate further research into
                                                                   such possibilities.
5. Conclusion
We presented an extension of a single-image learned com-           Acknowledgments
pression method to volumetric multi-slice images with
an emphasis on the medical domain, where such type                 This project has received funding from the European
of images is quite common. Although the modification               Union’s Horizon 2020 research and innovation program
is relatively simple and straightforward, it provides sev-         under grant agreement No 965502.
eral benefits – namely using a context module without
introducing any problems with parallel processing in the
decode and using mean-subtracted quantization. Both of             References
these improve performance without compromising the
                                                                    [1] Radiation      risk   from      medical     imaging,
runtime. This we verified in the comparison with a num-
                                                                        https://www.health.harvard.edu/cancer/
ber of established compression methods. The comparison
                                                                        radiation-risk-from-medical-imaging,             Sep
shows:
                                                                        2021.
• Clear performance gain with respect to the baseline               [2] G. Wallace, The JPEG still picture compression stan-
   due to the proposed sequential context.                              dard, IEEE Transactions on Consumer Electronics
• Good performance in absolute numbers with respect                     38 (1992) xviii–xxxiv.
   to the established codecs.                                       [3] F. Bellard, BPG Image format, https://bellard.org/
• Very competitive runtimes (if GPUs are allowed).                      bpg, 2018. Accessed: 2021-09-24.
The testing was carried out with emphasis on low-error              [4] AVIF image format, https://aomediacodec.github.
reconstruction and even at PSNR=55dB (in most cases                     io/av1-avif, 2022. Accessed: 2022-12.
indistinguishable from the original) the proposed method            [5] Google, WebP Image format, https://developers.
achieves an average compression ratio of 40:1 with re-                  google.com/speed/webp, 2018. Accessed: 2021-09-
spect to the uncompressed original. We consider these                   24.
results a solid proof of concept for compression of volu-           [6] Pediatric-CT-SEG, Cancer Imaging Archive,
metric medical data.                                                    https://wiki.cancerimagingarchive.net/pages/
   Nevertheless, there are a number of things which can                 viewpage.action?pageId=89096588, Aug 2022.
be improved or investigated further. For example, the               [7] AOM common test conditions v2.0, https://aomedia.
used baseline model is far from SOTA so higher absolute                 org/docs/CWG-B075o_AV2_CTC_v2.pdf,               Aug
performance can be gained by adopting one of the SOTA                   2021.
single-image learned methods as a backbone and extend-              [8] F. Mentzer, G. Toderici, D. Minnen, S.-J. Hwang,
ing that with the proposed context model. In this work,                 S. Caelles, M. Lucic, E. Agustsson, VCT: A video
however, we focused more on investigating the relative                  compression transformer, 2022. URL: https://arxiv.
gains from the sequential context rather than absolute                  org/abs/2206.07307.
performance. Next, in decode our method currently does              [9] J. Kivijärvi, T. Ojala, T. Kaukoranta, A. Kuba,
not permit random access (as in “show me slice 42”), the                L. Nyúl, O. Nevalainen, A comparison of lossless
whole volume needs to be decoded sequentially from                      compression methods for medical images, Com-
the beginning. But this can be remedied by introducing                  puterized Medical Imaging and Graphics 22 (1998)
intra-frames compressed by the single-image auxiliary                   323–339.
method we use for the first slice. If we use a GOP size of         [10] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang,
8 (meaning at most 8 slices need to be decoded for any                  D. Minnen, J. Shor, M. Covell, Full resolution image
chosen slice), we can estimate that the performance drop                compression with recurrent neural networks, in:
in Tab. 2 would be approximately −11.4% → −7.5%                         Proceedings of the IEEE Conference on Computer
in rate and +0.64dB → +0.42dB in PSNR, which is                         Vision and Pattern Recognition (CVPR), 2017.
still solid improvement over VVC-intra with practically            [11] J. Ballé, V. Laparra, E. P. Simoncelli, End-to-end
usable runtimes in both encode and decode.                              optimized image compression, in: International
                                                                        Conference on Learning Representations, 2017.



                                                               8
Jan Kotera et al. CEUR Workshop Proceedings                                                                             1–9



[12] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, N. John-              tems 33 (2020).
     ston, Variational image compression with a scale             [24] D. He, Z. Yang, H. Yu, T. Xu, J. Luo, Y. Chen,
     hyperprior, in: International Conference on Learn-                C. Gao, X. Shi, H. Qin, Y. Wang, Po-elic: Perception-
     ing Representations, 2018.                                        oriented efficient learned image coding, in: Pro-
[13] D. Minnen, J. Ballé, G. D. Toderici, Joint autoregres-            ceedings of the IEEE/CVF Conference on Computer
     sive and hierarchical priors for learned image com-               Vision and Pattern Recognition (CVPR) Workshops,
     pression, in: S. Bengio, H. Wallach, H. Larochelle,               2022, pp. 1764–1769.
     K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Ad-          [25] T. Bruylants, A. Munteanu, P. Schelkens, Wavelet
     vances in Neural Information Processing Systems,                  based volumetric medical image compression, Sig-
     volume 31, Curran Associates, Inc., 2018.                         nal Processing: Image Communication 31 (2015)
[14] T. Wiegand, G. J. Sullivan, G. Bjontegaard,                       112–133.
     A. Luthra, Overview of the h. 264/avc video cod-             [26] Z. Chen, S. Gu, G. Lu, D. Xu, Exploiting intra-slice
     ing standard, IEEE Transactions on circuits and                   and inter-slice redundancy for learning-based loss-
     systems for video technology 13 (2003) 560–576.                   less volumetric image compression, IEEE Transac-
[15] G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand,                 tions on Image Processing 31 (2022) 1697–1707.
     Overview of the High Efficiency Video Coding                 [27] M. U. A. Ayoobkhan, E. Chikkannan, K. Ramakrish-
     (HEVC) standard, IEEE Transactions on Circuits                    nan, Feed-forward neural network-based predictive
     and Systems for Video Technology 22 (2012) 1649–                  image coding for medical image compression, Ara-
     1668.                                                             bian Journal for Science and Engineering 43 (2018)
[16] Z. Cheng, H. Sun, M. Takeuchi, J. Katto, Learned im-              4239–4247.
     age compression with discretized gaussian mixture            [28] D. Mishra, S. K. Singh, R. K. Singh, Lossy medical
     likelihoods and attention modules, in: Proceedings                image compression using residual learning-based
     of the IEEE/CVF Conference on Computer Vision                     dual autoencoder model, in: 2020 IEEE 7th Ut-
     and Pattern Recognition (CVPR), 2020.                             tar Pradesh Section International Conference on
[17] D. Minnen, S. Singh, Channel-wise autoregressive                  Electrical, Electronics and Computer Engineering
     entropy models for learned image compression, in:                 (UPCON), 2020, pp. 1–5.
     2020 IEEE International Conference on Image Pro-             [29] Y. Bengio, Estimating or propagating gradients
     cessing (ICIP), 2020, pp. 3339–3343.                              through stochastic neurons, 2013. URL: https://
[18] D. He, Y. Zheng, B. Sun, Y. Wang, H. Qin, Checker-                arxiv.org/abs/1305.2982.
     board context model for efficient learned image              [30] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into
     compression, in: Proceedings of the IEEE/CVF Con-                 rectifiers: Surpassing human-level performance on
     ference on Computer Vision and Pattern Recogni-                   imagenet classification, in: 2015 IEEE International
     tion (CVPR), 2021, pp. 14771–14780.                               Conference on Computer Vision (ICCV), 2015, pp.
[19] D. He, Z. Yang, W. Peng, R. Ma, H. Qin, Y. Wang,                  1026–1034.
     Elic: Efficient learned image compression with un-           [31] D. P. Kingma, J. Ba, Adam: A method for stochastic
     evenly grouped space-channel contextual adaptive                  optimization, in: ICLR (Poster), 2015.
     coding, in: Proceedings of the IEEE/CVF Confer-              [32] J. Han, B. Li, D. Mukherjee, C.-H. Chiang, A. Grange,
     ence on Computer Vision and Pattern Recognition                   C. Chen, H. Su, S. Parker, S. Deng, U. Joshi, et al.,
     (CVPR), 2022, pp. 5718–5727.                                      A technical overview of AV1, Proceedings of the
[20] L. Theis, W. Shi, A. Cunningham, F. Huszár, Lossy                 IEEE 109 (2021) 1435–1462.
     image compression with compressive autoencoders,             [33] B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J.
     in: International Conference on Learning Represen-                Sullivan, J.-R. Ohm, Overview of the Versatile Video
     tations, 2017.                                                    Coding (VVC) standard and its applications, IEEE
[21] Z. Guo, Z. Zhang, R. Feng, Z. Chen, Soft then hard:               Transactions on Circuits and Systems for Video
     Rethinking the quantization in neural image com-                  Technology (2021) 1–1.
     pression, in: International Conference on Machine            [34] G. Bjontegaard, Calculation of average PSNR dif-
     Learning, PMLR, 2021, pp. 3920–3929.                              ferences between RD-curves, VCEG-M33 (2001).
[22] R. Zhang, P. Isola, A. A. Efros, E. Shechtman,
     O. Wang, The unreasonable effectiveness of deep
     features as a perceptual metric, in: Proceedings
     of the IEEE Conference on Computer Vision and
     Pattern Recognition (CVPR), 2018.
[23] F. Mentzer, G. D. Toderici, M. Tschannen, E. Agusts-
     son, High-fidelity generative image compression,
     Advances in Neural Information Processing Sys-



                                                              9