<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learned Lossy Image Compression for Volumetric Medical Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Kotera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Wödlinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Keglevic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CVL, TU Wien</institution>
          ,
          <addr-line>Favoritenstraße 9/11, 1040 Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of information theory</institution>
          ,
          <addr-line>CAS, Pod Vodárenskou věží 4, 182 00 Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work addresses the problem of lossy compression of volumetric images consisting of individual slices such as those produced by CT scans and MRI machines in medical imaging. We propose an extension of a single-image lossy compression method with an autoregressive context module to a sequential encoding of the volumetric slices. In particular, we remove the intra-slice autoregressive relation and instead condition the entropy model of the latent on the previous slice in the sequence. This modification alleviates the typical disadvantages of autoregressive contexts and leads to a significant increase in performance compared to encoding each slice independently. We test the proposed method on a dataset of diverse CT scan images in a setting with an emphasis on high-fidelity reconstruction required in medical imaging and show that it compares favorably against several established state-of-the-art codecs in both performance and runtime.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Learned Image Compression</kwd>
        <kwd>Medical Image Data</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Medical imaging is a set of techniques and processes that
produce images of the interior of the body for the
purpose of clinical analysis, medical intervention, or visual
representation of the function of the internal organs.
Examples of common types of imaging systems are x-rays,
computed tomography (CT) scans, magnetic resonance
imaging (MRI), or ultrasound (US). Medical imaging has
become a staple tool not only for medical diagnosis and
treatment but also a crucial component of research, as it
allows researchers and physicians to establish a
knowledge base of normal anatomy and physiology to make it
possible to identify abnormalities and study the efects of
medical intervention. For these reasons, the amount of
image data produced in healthcare and medical research Figure 1: Illustrative example of a single uncompressed slice
is huge and increasing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], as are the requirements for from the CT scan test set [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used for performance evaluation.
eficient transmission and especially storage.
      </p>
      <p>
        Image compression methods are designed for exactly
that – to enable more eficient coding of image data with
little or no loss in visual quality. The first successful im- ern image compression codecs such as BPG [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], AVIF [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
age compression techniques were developed in the early or WebP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] typically appear as by-products of a video
1990s and some of those are still being widely used today, codec development – the intra-frame component is
exsuch as for example the well-known JPEG method [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. tracted from the video codec and used as a standalone
In recent years the development of novel compression image codec.
methods for image and video accelerated, in line with the For mainstream everyday use in applications such as
growing amount of streamed image and video data. Mod- image or video streaming, video calls, online gaming and
so on the goal is for the reconstructed image to appear
“natural and artefact-free” on first glance while
achieving high enough compression ratios to make the above
mentioned applications feasible. General-purpose video
codes are therefore developed for and tested mainly on
natural sequences, screen content, or synthetic scenes
(eg. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) and typically benchmarked in perceptually lossy
range of &lt; 40dB reconstruction PSNR (eg. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). simi- vised manner and the minimized loss is the sum of two
larly for image codecs. In the case of medical imaging terms: The distortion in the image reconstruction and
the fundamental requirement is that the reconstruction the entropy (i.e. expected bitrate) of the latent. The
enerror must not alter the subsequent clinical analysis. The tropy coder is used of-the-shelf and is not subject to
reconstructed image must remain true to the original up training. One of the great advantages of learned image
to imperceptible “noise” void of any structure. We argue compression is that the training is relatively simple and
that using an established and straightforward objective cheap which makes it possible to adapt a method for a
metric such as PSNR for measuring the reconstruction particular modality, such as medical images, whereas for
error is the right approach here to ensure that the recon- conventional hand-designed codecs such adaptation is
structed image is truly nearly identical to the original not feasible.
when the reconstruction error is near zero. In our subjec- The proposed method extends [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to volumetric
medtive tests (on HDR display) we find that we are not able ical data consisting of individual slices, i.e. a sequence of
to distinguish between the original and reconstructed 2D images. This type of data is acquired for example by a
images above 55dB PSNR, so that is approximately our CT scan (see Fig. 1 for an example) or in an MRI. The
intarget quality range. On the other hand, bellow 50dB we dividual slices are encoded in order. The transform from
could identify loss of subtle structure in some images. image data to the latent representation is done for each
Having the images analyzed by medical experts is unfor- slice independently, but in the entropy estimation step
tunately too resource-intensive and beyond the scope of the probability model of each slice (except the first) is
this work. conditioned on the previous slice, which enables a more
      </p>
      <p>
        Another solution common in practice is using only accurate estimation of latent distribution since
neighborlossless compression but such methods never achieve ing slices typically have high mutual information. This
anywhere near as high compression ratios (by order of allows for higher compression ratios with no loss in the
magnitude) as lossy methods – for example the study [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] reconstruction quality. On the decoding side, the
imifnds that on medical data the traditional lossless codecs ages are decoded in the same order, so that the previous
hardly achieve compression ratios over 4:1, while on the slice is again available when decoding the next. Note
test set the proposed method has average ratio over 40:1 that the proposed method works with already digitized
at PSNR &gt; 55dB. Proper research into lossy methods uncompressed images in normalized intensity range
(typis therefore surely justified. ically 8bit-16bit), it doesn’t in any way enter the process
      </p>
      <p>
        The traditional approach to image compression are of image generation by the above mentioned imaging
hand-designed codecs that are implemented as hard- techniques.
coded algorithms, based on human experience and in- We show in the experimental section that this
relatuition (see Sec. 2). As with many problems in image tively simple addition outperforms considerably the
baseprocessing and computer vision in the last decade, av- line approach in which all slices are processed completely
enues are being explored on how to learn optimal codecs independently by a single image compression method.
from data. Modern research in learned image compres- Additionally, compared to processing the full volume at
sion started with the works of Toderici et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as the once our approach requires a fraction of time and
memifrst fully learned method applicable to large images and ory (in practice, it would be necessary to split the volume
outperforming some established traditional codecs. A into small chunks and compress those separately
anysurge of interest in learned image compression came af- way). We tested the method on a dataset consisting of
ter the seminal works of Ballé et al. [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] and Minnen CT scans of various human body parts and the proposed
et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. These works laid the groundwork for further approach is competitive even compared to established
research and it can be argued that most state-of-the-art standards such as JPEG, BPG, AVIF, and even VVC-intra.
(SOTA) methods nowadays are extensions of these
methods.
      </p>
      <p>
        The core structure of a learned method typically con- 2. Related work
sists of an autoencoder which transforms the input and
produces a latent representation of the image which will For a long time, lossy image and video compression was
constitute the bitstream. This representation is then quan- a problem solved exclusively in the traditional way by
tized so that it can be passed to an entropy coder which hand-designed methods. Some of these methods, such
losslessly converts the discrete representation to an ac- as for example H.264 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or H.265 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] video codecs or
tual bitstream. The third integral component is an en- JPEG image compression [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], are now in widespread use
tropy model of the latent, i.e. a probability distribution in many areas of industry, research, or everyday life.
Relmodel of the symbols (after quantization) of the latent atively recently, the first learned codecs appeared that
representation, as this is required by the entropy coder. were able to challenge some of the traditional methods.
This pipeline can be trained end-to-end in an unsuper- Arguably the biggest rise of interest started after the
works of Ballé et al. [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] and later Minnen et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], The encoding and decoding branches of the pipeline are
which laid the foundation for learned image compression. connected only via the bitstream which stores the latent
These works formulated the main rate-distortion objec- and hyper-latent representation of the image. To this end
tive in a learnable way, presented a model containing the the latents must be quantized, for which scalar integer
three fundamental components now present in the vast rounding is used, because the entropy coder that
conmajority of learned codecs – the autoencoder for image verts the values into their corresponding bit codes can
transform, and the hyper-prior and the context module only operate on discrete data (continues values cannot
for entropy estimation – and provided the solution for be stored in the bitstream).
dealing with the discrete quantization in training. Subse- The advantage of the context module is that the
enquent methods increased the performance for example tropy parameters can be very accurate and image-specific,
by richer/larger model architecture (e.g. using attention- the disadvantage is that the autoregressive processing
like modules) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], improved context modules [
        <xref ref-type="bibr" rid="ref17 ref18 ref19">17, 18, 19</xref>
        ], does not play well with the parallel processing common
richer entropy model (e.g. Gaussian mixtures) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], or in deep learning. For each new pixel to be decoded the
diferent simulation of quantization [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ]. entropy parameters must first be estimated, the pixel
de
      </p>
      <p>
        Recently, a promising research direction is coercing coded and only then can the decoding move to the next
the reconstruction to better satisfy the expectations of pixel. As a result, a usually parallelized operation such
the human visual system even at the expense of objective as convolution cannot be computed for the whole image
(e.g. PSNR) quality. This can be achieved for example by at once but pixel by pixel in alternation with the entropy
augmenting the loss by a term that better models human coder. Another disadvantage is that the context prevents
perception (such as LPIPS [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], or by training the using the so-called mean-subtracted quantization, which
decoder in an adversarial manner as in GANs [
        <xref ref-type="bibr" rid="ref23">23, 24</xref>
        ]. will be specified in the next section. We get rid of these
Such approaches can achieve significant bitrate savings drawbacks in the proposed method by replacing the
aubut unfortunately are not suitable for medical data, where toregressive context from [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] with an analogous module
the reconstructed image must be objectively undistorted that runs on the previous slice in the sequence.
and not just look natural.
      </p>
      <p>
        Literature on learned compression for medical images Model details The input to our method is a sequence
is relatively scarce, this area is still dominated by more tra- of 2D slices 0, . . . , − 1 (superscripts denote slices,
ditional approaches such as compression in the wavelet subscripts pixel indices) which are processed in order.
domain [25]. Probably the closest match for the pro- The transforms to and from the latent representation
deposed method is the lossless compression of 3D volumes noted  are done for each slice independently but the
by Chen et al. [26]. In our work, however, we focus on entropy model, i.e. the probability distribution ^(ˆ) of
lossy compression. Other works propose partitioning the quantized latent ˆ (hat denotes quantization
operthe image into relevant (for the diagnosis) and less rele- ation), is conditioned on the latent of the previous slice
vant regions and apply diferent compression ratios there ˆ− 1. This helps decrease the entropy of ˆ and therefore
[27]. Learned lossy compression for 2D medical images the necessary bitrate while avoiding the disadvantages
is investigated for example in [28]. of an autoregressive context model. It is done as follows:
Instead of running the context model on the currently
3. Method encoded slice in an autoregressive fashion, we run it on
the (quantized) latent ˆ− 1 of the previous slice. During
The proposed approach is based on the single image decoding, the slices are processed in the same order so
compression method by Minnen et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which we ˆ− 1 has already been decoded in full and is available
extend for multi-slice volumetric images. The method when ˆ is being decoded and the entropy model can
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] consists of three main components: again use information from the previous slice. This
approach does not require autoregressive processing but
• An encoder/decoder which performs the transform can instead be done in parallel for the whole slice
withbetween the input image space and the latent repre- out waiting for each new pixel to be decoded. In other
sentation (commonly called “latent”). words, the context module is autoregressive in the slice
• A hyper-encoder/decoder (called hyper-prior) which sequence but that does not restrict any 2D operations
analyzes the latent and stores a small piece of side contained within one slice such as convolutions – instead
information into the bitstream that is used later to of decoding individual pixels we can decode whole slices
estimate the parameters of the probability distribution in parallel.
      </p>
      <p>of the latent (the entropy model). We model the distribution ^(ˆ) of the quantized
• A context module that processes the image latent in latent ˆ by a per-dimension  (i.e. spatial pixel and
an autoregressive fashion (i.e. causally) and is also a channel) independent Laplace distribution with mean
part of the entropy model parameter estimation. and scale parameters (  ,   ). These two parameters</p>
      <p>Quantizer
Entropy
encoder
Entropy
decoder</p>
      <p>Context module
Entropy module
for
r
e
d
o
c
n
e
r
e
p
y
H
r
e
d
o
c
e
d
r
e
p
y
H</p>
      <p>Quantizer
Entropy
encoder
Entropy
decoder</p>
      <p>
        Entropy model
for
are estimated adaptively for each image  and each pixel hyper-latent  = ℎ([− 1, ]). This hyper-latent
 (incl. channels) of the latent by the hyper-prior and is quantized, ˆ = (), so that it can be stored in
the context module. For quantization of the latent we the bitstream. The parameters of the entropy model
use integer rounding with mean-subtraction, meaning of the quantized latent ˆ are estimated as follows. A
that the value is first ofset by the estimated mean of its context module  processes the previous slice’s latent
distribution before being rounded, (image index omitted) ˆ− 1 and hyper-decoder ℎ processes the hyper-latent
ˆ. These two are concatenated and passed through an
ˆ = ⌊ −   ⌉ +   , (1) entropy module  to produce the final entropy
paramewhere ⌊·⌉ is integer rounding. This improves perfor- ters (  ,   ) = ([(ˆ− 1), ℎ(ˆ)]) for each pixel
mance because then quantization doesn’t change the  of the latent. With these parameters available the
lamean of the distribution, but it requires that the entropy tent can be quantized and stored in the bitstream and the
parameters of the latent are estimated before the latent is encoding proceeds to the next slice.
quantized. In particular, both of the entropy estimation During decoding, operations responsible for
estimatmodules (hyper-prior and context) must operate on non- ing the entropy model ^(ˆ) have to be executed again
quantized values , otherwise an implicit relation would because the entropy model is required by the coder to
arise. This is dificult to achieve in a single-image autore- decode ˆ from the bitstream. The hyper-latent ˆ is
degressive context model and for example the quantization coded first and since the latent of the previous slice ˆ− 1
in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] does not use mean-subtraction, but since in the is already decoded and available, the estimation of the
proposed method the context module uses the previous entropy parameters (,  ) proceeds as during encoding.
slice, using mean-subtraction is possible. Having those, ˆ can be decoded and passed through the
      </p>
      <p>
        The full procedure of processing a slice  is illus- decoder  to finally produce the reconstructed image
trated in Fig. 2. The image is passed through an encoder ˆ = (ˆ). The decoding then proceeds to the next
, producing the latent  = (). The latent is con- slice.
catenated with the latent of the previous slice, − 1, and What remains to specify is the entropy model ^(ˆ)
passed through the hyper-encoder ℎ, producing the of the hyper-latent ˆ, since that is also processed by the
entropy coder and stored in the bitstream. We model it Table 1
by per-channel Laplace distribution, meaning that each Model architecture details. conv is a Conv2D layer with kernel
channel of  has its own mean and scale parameters size , stride  and output channels . transpose is a
simi(,  ) but those are spatially constant so that the model larly specified ConvTranspose2D. GDN and IGDN are the
is not tied to a fixed image resolution. These parameters generalized divisive normalization layer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and its inverse,
are subject to training but fixed once the model has been respectively. PReLU is the parametric ReLU [30].
trained (i.e. unlike ^(ˆ) it is not image-adaptive). For Encoder: conv k5 s2 c192 → GDN → conv k5 s2 c192 →
quantization of  we again use mean-subtracted rounding GDN → conv k5 s2 c192 → GDN → conv k5 s2 c192
in a similar fashion as in Eq. (1). Decoder: transpose k5 s2 c192 → IGDN → transpose k5 s2
      </p>
      <p>Details of the model architecture are concisely sum- c192 → IGDN → transpose k5 s2 c192 → IGDN → transpose
marized in Tab. 1. k5 s2 c1
Hyper-encoder: conv k3 s1 c192 → PReLU → conv k5 s2
Training details In training we optimize the rate- c192 → PReLU → conv k5 s2 c192
distortion loss  (image indices omitted) cH2y8p8e→r-dPeRceoLdUer→: ccoonnvvkk53ss21cc139824→ PReLU → conv k5 s2
 = E∼  [− log2 ^(ˆ)] + E∼  [− log2 ^(ˆ)] Context: conv k5 s1 c384</p>
      <p>Entropy module: conv k1 s1 c768 → PReLU → conv k1 s1
+  · 2552 · E∼  [︀ ‖ − ˆ‖22]︀ , c576 → PReLU → conv k1 s1 c384</p>
      <p>(2)
where  controls the rate-distortion tradeof (determines
approximate target bitrate) and (), the distribution of slice in each volumetric series. This model has the same
uncompressed images, is evaluated by batch averaging. encoder/decoder as the multi-slice model and the same
The first two terms on the right-hand side are approxi- architecture (not weights) of the hyper-prior but does
mate (theoretical) bitrates required by the entropy coder not include the context and entropy module – the
hyperto encode the latents. These are used in training as an esti- decoder directly predicts the (,  ) parameters of the
mate of the actual bitrates because the non-diferentiable latent entropy model. In validation and testing, we use
entropy coders are removed from training. this auxiliary model to compress the first slice of the
vol</p>
      <p>
        Our description of ^(ˆ) and ^(ˆ) so far was some- ume and then proceed sequentially with the multi-slice
what simplified. The Laplace parametric density is used model.
only as a model to conveniently parametrize the discrete The quantization operation must be approximated
durdistribution over the symbols after quantization. In the ing training because it has zero gradient almost
everyactual evaluation, however, we have to account for the where. For both the latents  and hyper-latents  we use
whole interval corresponding to each discrete value be- the straight-through quantization [29], which performs
cause of quantization. This is done by integrating the integer rounding in the forward pass but acts as identity
parametric density over the corresponding interval, for in the backward pass. For evaluation of the bitrate in
example the entropy models, however, we simulate quantization
^(ˆ ) = ∫︁^^− +2121 ^ (), (3)
wibsytaiyacditdnhiteteigvheeyrpu-ernori-fuodnredmceodndoevriaslaeunfedrsod(mwecittohhdeem(r−ega21ent,-s12thu)ebrtamrnaogcrteei.oTrnehaailsswhere ^ is the continuous Laplace density model in Eq. (1)) but the entropy estimation is calculated using

parametrized by (,  ) corresponding to ^(ˆ ), the dis- the uniform noise simulation, which reportedly leads to
crete distribution of ˆ . In practice, this is done by using better performance [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
the cumulative distribution function of the Laplace
density. 4. Results
      </p>
      <p>
        In each training iteration we randomly sample a small
subset of  consecutive slices from each image in the Dataset We trained and tested the method on the
batch and process those through the model as a small Pediatric-CT-SEG dataset of CT-scan images of various
volume. For the first slice 0 of this subset we calculate organs downloaded from the Cancer Imaging Archive
the latent 0 = (0) using an auxiliary single-image [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (patient and acquisition parameters specified therein).
model which shares the same encoder with the multi- We chose this dataset for its diverse content. The dataset
slice model. For 1, . . . , − 1 we proceed as described consists of 359 volumetric images each with a diferent
above and these slices are used to evaluate the loss in number of slices ranging from 41 to 1104. We randomly
Eq. (2). The first slice 0 is excluded from optimization selected 10 of the volumetric images for testing (2184
of the multi-slice model but is used to train the auxil- slices in total) and the rest for training. The 2D slices
iary single-slice model used for compression of the first
are 12bit grayscale images with a resolution of 512× 512,
originally stored uncompressed at 16 bits per pixel (bpp).
      </p>
      <p>An example slice from the dataset is in Fig. 1.
60.0
Training We trained the model on random spatial 52.5
crops of size 256× 256 and tested it on full-resolution ]dB
images. For training, we randomly chose  = 3 consecu- [RN50.0
tainvde esxlipcelositaisnga tghoeosdeqcoumenptriaolmpirsoecbeesstwinege. nWteratirnaiinngedspweiethd PS47.5 PBraospeolisneed
batch size 8 using the Adam optimizer [31] with an initial 45.0 VVVVCC-intra
ldeeacrrneiansgedrattheeolefa1rni−ng4rfaotre 1toM1it e−ra5tifoonrsaanfotethrewrh2i0c0hkwite- 42.5 JAABPVVPEGI1GF
erations. We trained a new model for 6 values of  in the 40.00.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
range from 0.032 to 3.2, which on the test set results in bits per pixel
0.05 to 0.65 bits per pixel, thus achieving a compression Figure 3: Rate-distortion performance of the proposed and
ratio of 25:1 to 320:1 with respect to the original images. benchmark methods on the test set of CT-scan images.
Benchmark methods We compare the performance
of the proposed method with a baseline learned single- AV1 video codec (essentially AV1-intra), one of today’s
image compression model and a number of established top codecs from those that are readily available e.g. in
traditional image compression methods. The single- browsers. In the comparison we used the libaom-av1
image baseline is a learned model with the same archi- encoder via ffmpeg configured to 12bit internal
processtecture as the auxiliary model we use to compress the ing, each slice in the series is compressed individually.
ifrst slice and was trained on the same train set. Compar- AV1 [32] is a video codec in terms of quality
approxiison with this method shows performance gain from the mately on the level of or slightly outperforming HEVC
proposed sequential processing and the context module. but unlike HEVC its use is royalty-free, it is therefore
The traditional methods are a broad selection ranging arguably the best video codec readily available today
from well-known and established codecs commonly used (with production-level encoders and decoders available).
in practice to the state-of-the-art prototype. Such com- In our comparison, we used ffmpeg/libaom-av1 in
parison therefore well positions the proposed method 12bit mode and compressed each volumetric image as a
in the landscape of existing methods and gives insight video sequence consisting of the individual slices. VVC
into its properties in potential use in practice. Below we [33] (H.266) is the best existing video codec nowadays
briefly describe each of the methods used in the com- but its development is still ongoing and the available
enparison and optionally its configuration, afterwards we coders/decoders are on the prototype level and for most
provide commentary on the results summarized in Fig. 3 practical use cases prohibitively slow. Its adoption in
and Tab. 2. practice, medical or otherwise, is also hindered by the</p>
      <p>
        Baseline is a learned single-image compression model fact that its use is not royalty free. We used the VTM
with the same architecture as the proposed method but 18 reference implementation in 12bit mode and again
without the context and entropy module (the hyper- compressed each volumetric image as a video sequence
decoder directly predicts the entropy parameters). It consisting of the individual slices. VVC-intra [33] is the
is trained on the same train set as the proposed and uses intra mode of VVC. For single-image compression, it is
the same training schedule. JPEG [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a well-known the best available codec nowadays but currently inherits
widely used compression method developed in the 90s. the disadvantages listed above for VVC. We used it in the
Although used for medical data and having the advan- same configuration as VVC video but compressed each
tage of being very fast both in encode and decode, it is slice individually.
arguably not a very suitable method for such use as its
performance is relatively low by today’s standards. We Results The rate-distortion curves of the benchmarked
use the implementation in pillow. BPG [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is essen- methods on the CT-scan test set are in shown in Fig. 3,
tially a single-image wrapper of the intra-frame compres- their ranking and quantitative comparison with respect
sion of the HEVC (also known as H.265) video codec. to VVC-intra is in Tab. 2 and finally, Tab. 3 shows
apAlthough not widespread, it is one of the top methods proximate relative runtimes required to process the test
currently available for everyday use. We used the jctvc set. In the testing we focused on high-PSNR range since
encoder via the public BPG library configured to 12bit in- we envision the proposed method being used primarily
ternal bitdepth. AVIF [32] is a single-image format of the in the medical domain, where sliced volumetric images
therefore no match for VVC but we will see that in that
are common. Let us provide some commentary on the comparison it wins on runtime.
results. A clear and quantitative ranking of the methods is
      </p>
      <p>The baseline learned method performs on the level of provided in Tab. 2, which shows the average bitrate
inAVIF – the curves almost overlap. Although AVIF is un- crease/savings and PSNR quality loss/gain evaluated by
doubtedly a better codec in a general setting, the learned the BD-Rate and BD-PSNR [34], respectively. We
posibaseline exploits the advantage of domain specificity – it tioned VVC-intra as the reference SOTA image codec
has been trained on similar CT data. BPG generally per- and compared all others to it, as they perform on the
forms well on natural images where the target PSNR is test set. The table shows average bitrate savings and
usually lower, but to achieve imperceptible distortion in performance gain in the middle and right column,
remedical data we observed that the reconstruction PSNR spectively. Only the proposed method and VVC video
should be above 55dB (for typical images with suficient achieve improvement (BD-Rate is negative and BD-PSNR
structure). We suspect there is some issue with the con- is positive).
ifguration of the encoder at high bitdepth processing Finally, in Tab. 3 we show the relative runtimes
rebecause BPG obviously struggles with achieving high quired for processing (encode and decode) the whole
PSNRs. It is no surprise that JPEG cannot compete with test set (10 volumetric images consisting of 2184 slices)
the latest methods. VVC-intra does very well and outper- with respect to the proposed method (i.e. value &lt; 1
forms AVIF by a large margin in the whole range. With means the method is faster than ours, &gt; 1 means it is
AV1 we experienced similar problems as with BPG – it slower). These runtimes are listed for bpp = .3,
apapparently “saturates” at higher bitrates and struggles proximately the middle of the tested range, since the
to achieve high PSNR, which is possibly again some is- traditional methods are slower at higher bitrates (the
sue with the high bitdepth configuration of the encoder proposed has constant speed across the range). Here the
(although we used the same encoder as for AVIF and ranking is quite diferent than in the performance. JPEG
in that case it worked fine). But from comparison with and of course the baseline are the only methods faster
AVIF in low to mid bitrates we can see that the sequential than the proposed, all others are slower and some of them
“video” processing of the image volume is clearly benefi- quite significantly, especially the well-performing VVC
cial with noticeable performance gain. This conclusion which is clearly prohibitively slow. We argue that the
is further strengthened by the results of the VVC (video) video codecs are simply not fast enough for practical use.
codec, which on performance alone is a clear winner of In fairness, the proposed method and the baseline run
the whole comparison, outperforming all other methods on GPU (though still each slice sequentially) while the
(including the proposed) by a margin in the whole range. traditional methods are CPU-only without any external</p>
      <p>The proposed method is significantly better than the parallelization. On the other hand, our implementation is
baseline (compare green and orange curves in Fig. 3), on intended only as a proof of concept and we didn’t invest
average achieving almost 30% rate savings (for the same much efort into runtime optimization. For example, in
quality) and 1.8dB quality increase (for the same rate). both encode and decode the encoder and decoder process
It also outperforms all image codecs such as AVIF, BPG, each slice independently. In testing we really process
and especially VVC-intra, which is no small feat. This them sequentially for simplicity while it is possible to
is only due to the proposed sequential context because “batch” them and process in parallel (as many as the GPU
the baseline alone is significantly below VVC-intra. It memory permits), which would reduce the runtime.
is however still a relatively small and simple model and Contrary to usual customs, we do not provide
examples and qualitative comparison of image reconstructions
because due to the high reconstruction quality and
similar performance of the benchmarked methods we were
not able to come up with example images that
demonstrate any noticeable diference – on screen all the results
look identical.</p>
    </sec>
    <sec id="sec-2">
      <title>5. Conclusion</title>
      <p>But by looking at the results of VVC we see that further
gains are undoubtedly possible and we hypothesize that
those can be achieved for example by a stronger context
module (ours is a rather simple stack of convolutions, not
in any way input-adaptive) and possibly by introducing
P-frames and B-frames as in video encoding. It is our
hope that this work will motivate further research into
such possibilities.</p>
      <p>We presented an extension of a single-image learned com- Acknowledgments
pression method to volumetric multi-slice images with
an emphasis on the medical domain, where such type This project has received funding from the European
of images is quite common. Although the modification Union’s Horizon 2020 research and innovation program
is relatively simple and straightforward, it provides sev- under grant agreement No 965502.
eral benefits – namely using a context module without
introducing any problems with parallel processing in the
decode and using mean-subtracted quantization. Both of References
these improve performance without compromising the
runtime. This we verified in the comparison with a
number of established compression methods. The comparison
shows:
• Clear performance gain with respect to the baseline</p>
      <p>due to the proposed sequential context.
• Good performance in absolute numbers with respect</p>
      <p>to the established codecs.
• Very competitive runtimes (if GPUs are allowed).</p>
      <p>The testing was carried out with emphasis on low-error
reconstruction and even at PSNR=55dB (in most cases
indistinguishable from the original) the proposed method
achieves an average compression ratio of 40:1 with
respect to the uncompressed original. We consider these
results a solid proof of concept for compression of
volumetric medical data.</p>
      <p>Nevertheless, there are a number of things which can
be improved or investigated further. For example, the
used baseline model is far from SOTA so higher absolute
performance can be gained by adopting one of the SOTA
single-image learned methods as a backbone and
extending that with the proposed context model. In this work,
however, we focused more on investigating the relative
gains from the sequential context rather than absolute
performance. Next, in decode our method currently does
not permit random access (as in “show me slice 42”), the
whole volume needs to be decoded sequentially from
the beginning. But this can be remedied by introducing
intra-frames compressed by the single-image auxiliary
method we use for the first slice. If we use a GOP size of
8 (meaning at most 8 slices need to be decoded for any
chosen slice), we can estimate that the performance drop
in Tab. 2 would be approximately − 11.4% → − 7.5%
in rate and +0.64dB → +0.42dB in PSNR, which is
still solid improvement over VVC-intra with practically
usable runtimes in both encode and decode.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Radiation risk from medical imaging</article-title>
          , https://www.health.harvard.edu/cancer/ radiation
          <article-title>-risk-from-medical-</article-title>
          <string-name>
            <surname>imaging</surname>
          </string-name>
          ,
          <year>Sep 2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <article-title>The JPEG still picture compression standard</article-title>
          ,
          <source>IEEE Transactions on Consumer Electronics</source>
          <volume>38</volume>
          (
          <year>1992</year>
          )
          <article-title>xviii-xxxiv.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bellard</surname>
          </string-name>
          , BPG Image format, https://bellard.org/ bpg,
          <year>2018</year>
          . Accessed:
          <fpage>2021</fpage>
          -09-24.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] AVIF image format</article-title>
          , https://aomediacodec.github. io/av1-avif ,
          <year>2022</year>
          . Accessed:
          <fpage>2022</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] Google, WebP Image format</article-title>
          , https://developers. google.com/speed/webp,
          <year>2018</year>
          . Accessed:
          <fpage>2021</fpage>
          -09- 24.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Pediatric-CT-SEG</surname>
          </string-name>
          ,
          <article-title>Cancer Imaging Archive</article-title>
          , https://wiki.cancerimagingarchive.net/pages/ viewpage.action?pageId=
          <fpage>89096588</fpage>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] AOM common test conditions v2.0</article-title>
          , https://aomedia. org/docs/CWG-B075o_
          <article-title>AV2_CTC_v2</article-title>
          .pdf ,
          <year>Aug 2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mentzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Toderici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Minnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Caelles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lucic</surname>
          </string-name>
          , E. Agustsson,
          <article-title>VCT: A video compression transformer</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv. org/abs/2206.07307.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kivijärvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ojala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kaukoranta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nyúl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nevalainen</surname>
          </string-name>
          ,
          <article-title>A comparison of lossless compression methods for medical images</article-title>
          ,
          <source>Computerized Medical Imaging and Graphics</source>
          <volume>22</volume>
          (
          <year>1998</year>
          )
          <fpage>323</fpage>
          -
          <lpage>339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Toderici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Johnston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Jin</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Minnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Covell</surname>
          </string-name>
          ,
          <article-title>Full resolution image compression with recurrent neural networks</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ballé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Laparra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Simoncelli</surname>
          </string-name>
          ,
          <article-title>End-to-end optimized image compression</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ballé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Minnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          , N. John- tems
          <volume>33</volume>
          (
          <year>2020</year>
          ).
          <article-title>ston, Variational image compression with a scale [</article-title>
          24]
          <string-name>
            <given-names>D.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          , Y. Chen, hyperprior, in: International Conference on Learn-
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Po-elic: Perceptioning Representations,
          <year>2018</year>
          .
          <article-title>oriented eficient learned image coding</article-title>
          , in: Pro-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Minnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ballé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Toderici</surname>
          </string-name>
          ,
          <article-title>Joint autoregres- ceedings of the IEEE/CVF Conference on Computer sive and hierarchical priors for learned image com- Vision and Pattern Recognition (CVPR) Workshops, pression</article-title>
          , in: S. Bengio,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>1764</fpage>
          -
          <lpage>1769</lpage>
          . K. Grauman,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cesa-Bianchi</surname>
          </string-name>
          , R. Garnett (Eds.), Ad- [25]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bruylants</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Munteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schelkens</surname>
          </string-name>
          ,
          <source>Wavelet vances in Neural Information Processing Systems, based volumetric medical image compression, Sigvolume</source>
          <volume>31</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2018</year>
          . nal Processing:
          <source>Image Communication</source>
          <volume>31</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Sullivan</surname>
          </string-name>
          , G. Bjontegaard,
          <volume>112</volume>
          -
          <fpage>133</fpage>
          . A.
          <string-name>
            <surname>Luthra</surname>
            , Overview of the h. 264/avc video cod- [26]
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Exploiting intra-slice ing standard, IEEE Transactions on circuits and and inter-slice redundancy for learning-based losssystems for video technology 13 (</article-title>
          <year>2003</year>
          )
          <fpage>560</fpage>
          -
          <lpage>576</lpage>
          .
          <article-title>less volumetric image compression</article-title>
          , IEEE Transac-
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Ohm</surname>
          </string-name>
          , W.-J. Han,
          <string-name>
            <surname>T</surname>
          </string-name>
          . Wiegand,
          <source>tions on Image Processing</source>
          <volume>31</volume>
          (
          <year>2022</year>
          )
          <fpage>1697</fpage>
          -
          <lpage>1707</lpage>
          .
          <article-title>Overview of the High Eficiency Video Coding</article-title>
          [27]
          <string-name>
            <surname>M. U. A. Ayoobkhan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Chikkannan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Ramakrish(HEVC) standard, IEEE Transactions on Circuits nan, Feed-forward neural network-based predictive</article-title>
          and
          <source>Systems for Video Technology</source>
          <volume>22</volume>
          (
          <year>2012</year>
          )
          <article-title>1649- image coding for medical image compression, Ara1668</article-title>
          .
          <source>bian Journal for Science and Engineering</source>
          <volume>43</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , H. Sun,
          <string-name>
            <given-names>M.</given-names>
            <surname>Takeuchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Katto</surname>
          </string-name>
          , Learned im-
          <volume>4239</volume>
          -4247.
          <article-title>age compression with discretized gaussian mixture</article-title>
          [28]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Lossy medical likelihoods and attention modules, in: Proceedings image compression using residual learning-based of the IEEE/CVF Conference on Computer Vision dual autoencoder model</article-title>
          ,
          <source>in: 2020 IEEE 7th Utand Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          . tar Pradesh Section International Conference on
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Minnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Channel-wise autoregressive Electrical, Electronics and Computer Engineering entropy models for learned image compression</article-title>
          ,
          <source>in: (UPCON)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . 2020 IEEE International Conference on Image Pro- [29]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Estimating or propagating gradients cessing (ICIP</article-title>
          ),
          <year>2020</year>
          , pp.
          <fpage>3339</fpage>
          -
          <lpage>3343</lpage>
          . through stochastic neurons,
          <year>2013</year>
          . URL: https://
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qin</surname>
          </string-name>
          , Checker- arxiv.org/abs/1305.2982.
          <article-title>board context model for eficient learned image</article-title>
          [30]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Delving deep into compression, in: Proceedings of the IEEE/CVF Con- rectifiers: Surpassing human-level performance on ference on Computer Vision and Pattern Recogni- imagenet classification</article-title>
          ,
          <source>in: 2015 IEEE International tion (CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>14771</fpage>
          -
          <lpage>14780</lpage>
          . Conference on Computer Vision (ICCV),
          <year>2015</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          , R. Ma, H. Qin,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <volume>1026</volume>
          -
          <fpage>1034</fpage>
          . Elic:
          <article-title>Eficient learned image compression with un-</article-title>
          [31]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic evenly grouped space-channel contextual adaptive optimization</article-title>
          ,
          <source>in: ICLR (Poster)</source>
          ,
          <year>2015</year>
          . coding,
          <source>in: Proceedings of the IEEE/CVF</source>
          Confer- [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Chiang</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Grange, ence on Computer Vision and
          <string-name>
            <surname>Pattern Recognition C. Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Parker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Joshi</surname>
          </string-name>
          , et al.,
          <source>(CVPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>5718</fpage>
          -
          <lpage>5727</lpage>
          .
          <article-title>A technical overview of AV1</article-title>
          , Proceedings of the
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Theis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huszár</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lossy</surname>
            <given-names>IEEE</given-names>
          </string-name>
          109 (
          <year>2021</year>
          )
          <fpage>1435</fpage>
          -
          <lpage>1462</lpage>
          .
          <article-title>image compression with compressive autoencoders</article-title>
          , [33]
          <string-name>
            <given-names>B.</given-names>
            <surname>Bross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. J. in: International Conference on Learning Represen- Sullivan,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Ohm</surname>
          </string-name>
          ,
          <article-title>Overview of the Versatile Video tations</article-title>
          ,
          <year>2017</year>
          .
          <article-title>Coding (VVC) standard and its applications</article-title>
          , IEEE
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Soft then hard: Transactions on Circuits and Systems for Video Rethinking the quantization in neural image com-</article-title>
          <source>Technology</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . pression, in: International Conference on Machine [34]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bjontegaard</surname>
          </string-name>
          ,
          <article-title>Calculation of average PSNR difLearning</article-title>
          , PMLR,
          <year>2021</year>
          , pp.
          <fpage>3920</fpage>
          -
          <lpage>3929</lpage>
          .
          <article-title>ferences between RD-curves, VCEG-M33 (</article-title>
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Efros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shechtman</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Wang,</surname>
          </string-name>
          <article-title>The unreasonable efectiveness of deep features as a perceptual metric</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mentzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Toderici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tschannen</surname>
          </string-name>
          , E. Agustsson,
          <article-title>High-fidelity generative image compression</article-title>
          ,
          <source>Advances in Neural Information Processing Sys-</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>