<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Standard vs. Learning-based Codecs for Real Time Endoscopic Video Transmission</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aldo Marzullo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martina Golini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Catellani</string-name>
          <email>mcatellani@asst-pg23.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena D</string-name>
          <email>elena.demomig@polimi.it</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deparment of Mathematics and Computer Science, University of Calabria</institution>
          ,
          <addr-line>Italy https://</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Elettronica, Informazione e Bioingengeria, Politecnico di Milano</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>NVIDIA, NVIDIA AI Technology Center</institution>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Urology Department, Surgery Division, ASST Papa Giovanni XXIII</institution>
          ,
          <addr-line>Bergamo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We compare traditional encoding/decoding methods for real time video streaming, like H264/AVC and H265/HEVC, and deep learning based methods, that are expected to deliver higher video quality at lower bandwidth in the next future. We concentrate our attention on the case of endoscopic videos, where streaming is part of a closed-loop system and robot-assisted minimally invasive surgery is performed on a patient in real time. Beyond low bandwidth and high video quality, such application also demands for low latency to guarantee the stability of the closed loop system and thus high safety standards. We analyze pros and cons of the deep learning approach in this domain, highlighting areas where deep neural networks overcome the traditional approach, and those that require further development. Our observations may be used as guidelines for the future research activity on video streaming in the surgical domain as well as in areas with similar requirements.</p>
      </abstract>
      <kwd-group>
        <kwd>Surgical Video</kwd>
        <kwd>Latency</kwd>
        <kwd>Bandwidth</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Compression</kwd>
        <kwd>Codec</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The advent of Deep Learning (DL) has allowed signi cant advances in many
scienti c and technological elds, including video compression and transmission.
Learning based solutions have shown great e ectiveness in reducing bandwidth
Copyright ©2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
requirements while ensuring high quality of the transmitted video at the same
time. For this reason, they are nowadays widely investigated, among other
companies and in di erent application elds, by the principal streaming providers,
like Disney [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or Net ix [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where the best compression rates can be achieved
by resorting to specialized, per-title coding [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In recent years, advances in telecommunication technology and video
coding systems have opened a new perspective also for surgical telementoring,
remote diagnosis and teleoperation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this scenario, quali ed surgeons give
real-time supervision and technical help to the on-site physician during
surgical procedures. This o ers a transformative opportunity for accessing and
delivering high-quality healthcare in resource-poor settings, particularly, but not
exclusively, in disaster-a ected and distant rural areas. However, in these
contexts, high amounts of data (including high-resolution video frames) need to be
transmitted, and bandwidth constraints constitute one of the primary
bottlenecks for achieving real-time performances [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Indeed, it has been found that
transmission errors (i.e., packet loss) can signi cantly reduce the perceived video
quality for several surgical procedures, resulting in implications for the success
of surgery tasks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. For these reasons, proper compression methods are crucial.
Furthermore, when dealing
with remote surgery, which
often includes a closed-loop
control mechanism and a
surgery robot in the system,
also latency has to be kept
under strict control to
guarantee the stability (and,
consequently, the safety) of the
entire system.
      </p>
      <p>Lossless compression
algorithms are not well suited for
these tasks, due to the large
bandwidth and the high
latency requested. Conversely,
lossy compression methods
provide a valid alternative
for real-time streaming, as
they consume less bandwidth,
but video quality and
latency need to be carefully
controlled to guarantee a
good user experience. The
H.264/AVC codec, which is
widely adopted in several
applications and can be
hardware accelerated on many</p>
      <p>Fig. 1. In a typical minimally invasive surgery
scenario, a remote surgeon controls a robot equipped
with cameras (e.g. endoscopes) to frame the surgery</p>
      <p>
        eld. Images are transmitted to the surgeon that
takes action and controls the robot movements in a
closed loop system. The latency introduced by data
transmission has to be limited to guarantee ne and
stable control of the robot. The amount of
transmitted data per second have to be compatible with
the bandwidth of the transmission channel. Finally,
the quality of the images received at the remote site
must be high enough to guarantee that no clinically
signi cant information is lost in the data
transmission process.
devices, represents nowadays
the most viable choice also in the eld of Minimally Invasive Surgery (MIS) [
        <xref ref-type="bibr" rid="ref10 ref2">2,
10</xref>
        ]. Its successor, H.265/HEVC, overcomes it in terms of quality, but it is
computationally more demanding and, for this reason, not as widely spread as H.264.
Both H.264 and HEVC are based on the hybrid prediction/transform coding
method, proposed for the rst time in 1979 by Netravali and Stuller [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], and
can introduce block artefacts and other forms of quality compression degradation
because of the quantization step applied before data transmission.
      </p>
      <p>
        However, neither H.264 nor HEVC leverage the potential of learning methods
in general, and DL in particular, that have just been started to be exploited for
o -line video compression and streaming [
        <xref ref-type="bibr" rid="ref1 ref15 ref3">15, 3, 1</xref>
        ]. In this context, solutions for
increasing the performance of one of the ve main modules of the traditional
codecs (intra-prediction, inter-prediction, quantization, entropy coding and loop
ltering) have been proposed, as well as brand new codecs [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Here we perform a careful analysis of one of the state-of-the-art DL methods
(Deep Video Compression, DVC [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) for real-time video compression and
transmission, and its comparison against H.264 and HEVC. We perform our analysis
in the context of MIS, as this is the one of the most challenging applications
with strict requirements in terms of quality and latency at the same time. More
speci cally, to realize a stable and clinically e ective system, three constrains
must necessarily be met (see Fig.1):
{ Quality: the quality of the transmitted frames has to be good enough to
guarantee that the surgeon can detect any detail which is clinically relevant,
such as a small bleeding or unexpected tumor masses [
        <xref ref-type="bibr" rid="ref10 ref2">2, 10</xref>
        ].
{ Latency: the images of the surgery eld acquired by the camera, as well
as the control signal going back to the robotic arm, must be received with
the smallest possible delay, to guarantee the stability (no oscillations) of the
closed-loop system [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]; this aspect is particularly critical if the surgeon is
remote. Notice that both average and maximum latency are important in
this context.
{ Bandwidth: as for any video transmission system, the bandwidth required
to transmit the data does not have to exceed the bandwidth allowed by the
transmission system.
      </p>
      <p>Our experiments highlight the points in favor of DL and the aspects that
deserves more attention in future research for its adoption in real time video
streaming of endoscopic videos, but our ndings can be of use in other
application elds with similar requirements. More in detail, for the tested DL-based
codec (named DVC), we measured higher image quality and reduced bandwidth
in comparison to H.264 and HEVC, at the cost of an increase in the latency,
that can be however reduced with optimized implementations. In general, when
compared to traditional codecs, DVC preserves better the high frequency
components while introducing some light color shift, which is perceptually not too
relevant. A more re ned analysis revealed that the quality, as well as the latency
in the transmission of the frames with DVC, have a large range of variation,
which is detrimental for its practical adoption. Moreover, we found that I-Frames
transmitted with DVC are characterized by high latency, and their high quality
strongly in uence (in a positive way) the quality of the following frames that
use prediction to be transmitted with limited bandwidth. This suggests that one
of the key aspect for the adoption of DL in video transmission systems is the
development of computationally e cient, single frame compression methods.</p>
      <p>The paper is organized as follows: we perform a detailed review of the
stateof-the-art video compression and transmission methods in the next section, then
we introduce the experimental setup adopted to compare H.264, HEVC, and
DVC, and we present and discuss the results towards the end of the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Among the many traditional video codecs, H.264 and HEVC are the most
adopted and di used [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Both these codecs are based on the hybrid
prediction/transform coding method, rst proposed in 1979 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Despite HEVC
overcomes H.264 in several aspects, the latter is hardware friendly, and thus easily
implemented and distributed in its accelerated version. This makes it the de
facto standard for most of the existing video streaming applications, including
the surgical domain [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. As a consequence of the optimized implementation, it
is also characterized by small latency. Counter side, as both H.264 and HEVC
use block-based coding (i.e., square blocks in the transmitted frames are
processed independently one from each other), these schemes can introduce block
artifacts; quality degradation can also be associated to the quantization of the
compressed data stream. For these reasons, DL-based codec have started to be
explored as a promising alternative.
      </p>
      <p>
        Researchers developed several DL-based methods in recent years [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], where
neural networks have been used for building brand new end-to-end schemes for
compression, enhancement and restoration of the video quality, or as a tool
to increase the performance of one of the ve main modules of the traditional
codecs: intra-prediction, inter-prediction, quantization, entropy coding and loop
ltering [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        For instance, Lu et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] proposed Deep Video Compression (DVC), one of
the rst end-to-end DL-based video codec. In DVC, each step of the traditional
compression pipeline was substituted by DL models which were jointly trained at
minimizing the reconstruction error while reducing the bits used for compression,
reaching state-of-the-art results at the time of publication. As this is one of
the most comprehensive DL-based approach, covering all the aspects of video
encoding, transmission and decoding, it is also the one we considered for our
experiments.
      </p>
      <p>
        When used for replacing single modules of the codec pipeline, DL-based
intra/inter prediction solutions, as well as post processing ltering techniques, have
shown good performance especially in a low bit-rate scenario. Li et al. proposed
a ve layers CNN-based block up-sampling scheme for intra-frame coding [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
Each Coding Tree Unit (the basic processing unit of the HEVC) is rstly
downsampled, then coded by HEVC, and eventually decoded and up-sampled to its
original resolution and post-processed by the CNN. This scheme achieved an
important reduction in terms of required bandwidth (5:5% for HEVC common
test sequences and 9:0% for Ultra High De nition test sequences). On the other
hand, compression noise due to the dependency of the CNN from the
Quantization Parameters (QPs) used in compressed training videos has been highlighted
in some cases. Moreover, the CNN encoding/decoding time was signi cantly
higher when compared to HEVC (although without any optimization for speed
on the CNN side [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). The same authors proposed an extension of their scheme
for inter-frame predictions in 2019 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Feng et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] developed a dual network
structure to improve the reconstruction quality of videos compressed at low
resolution. Here, an enhancement module operates before a super-resolution network
to deal with sampling and compression artifacts separately. The model achieved
about 31:5% bit-rate saving when compared to HEVC. Zhang et al. proposed
a residual convolutional neural network for loop ltering in HEVC [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. In this
scheme, the QP range is divided into several bands and a dedicated network
is trained with a progressive training scheme for each of them. This framework
achieved substantial coding gains, especially for low bit rates, but the encoding
time heavily increased [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Another approach based on transmitting frames
using the traditional H.264 codec and then re ning the transmitted frames was
proposed in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], where a small encoder network is rst trained to generate a
binary code that is transmitted together with the frame data, whereas on the
decoder side a small DNN decoder applies a residual correction to the frames
decoded by H.264.
      </p>
      <p>
        The literature about the use of DL for video compression and transmission
in the surgical domain is, on the other hand, limited. It has been shown that the
detection of clinically relevant spatio-temporal information can be exploited to
save compression time, while maintaining high quality in the transmitted frames.
To this aim, CNNs are used to segment the input frames and detect Regions of
Interest (ROI) whose quality needs to be better preserved. In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], Munzer et al.
identi ed domain-speci c features of endoscopic videos that can be exploited for
an e cient compression. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Ghamsarian et al. proposed a cataract surgery
video compression approach, based on HEVC, with the aim of preserving high
quality in meaningful regions. Two separate networks were employed for the
classi cation and segmentation of such regions and larger distortion on
irrelevant content was allowed. Hassan et al. proposed a CNN-based segmentation
network (S-CNN) which has been demonstrated useful for real time applications
in a limited bandwidth scenario [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It is composed by four convolution layers
that identify the surgical regions that need to be transmitted in high quality,
di erently from the background. Low QP values - corresponding to high quality
outputs - are then used for SR regions, whereas high QP values are used for the
background. In comparison to the standard HEVC scheme, S-CNN achieved an
average bit-rate reduction of 88.8% at HQ settings (QP in range of 0{20) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Di erently from DVC, none of the aforementioned approaches adopted in the
surgery context covers all the ve components of a codec system.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Testing of Standard and DL-based Codecs</title>
      <p>
        In our experiments we performed a comparison in the surgical domain between
traditional (H.264, HEVC) codecs and DVC [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], one of the rst completely
DL-based video codecs.
      </p>
      <p>
        We selected robotic assisted radical prostatectomy (RARP) as a
representative procedure, as it constitutes one of the most performed robotic assisted
MIS operation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. RARP includes three phases. The rst one is the pelvic
lymphadenectomy, where the focus is concentrated at the level of the iliac vessels;
in this phase, the most delicate structures in the center of the eld are the blood
vessels that must be freed from the lymph nodes; the surgical eld is small,
the surgical movements are slow and more delicate. In the following step, called
"demolition phase", the prostate is isolated posteriorly from the bladder, from
the nerve bands laterally and anteriorly from the urethra; here the surgical eld
is wider, movements are faster and the organ of interest, the prostate, is in the
center of the visual eld; the peripheral area is occupied by the iliac vessels
laterally and by the pubic bone over. In the last reconstructive phase, the bladder
neck is sutured to the urethra; the surgical eld is tight since the anastomosis
between bladder and urethra is performed in the small pelvis; movements are
small and mostly in the center of the surgical eld.
      </p>
      <p>We extracted ten clips (40 seconds each) from 94 minutes, high quality
(1200 720) youtube video with the endoscopic view captured during RARP
(Fig. 2). The clips were selected to maximize their diversity from di erent phases
of RARP, and thus they include di erent anatomical sections, surgery
instruments, levels of illumination and degrees of action performed in the surgery eld.</p>
      <p>
        For each 40s clip, our aim was to measure quality, bandwidth and latency as
a function of the adopted codec. To do so, we compressed and decompressed each
clip using the H.264 and HEVC implementations provided by mpeg [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Each
clip was encoded at di erent bitrates, ranging from 1 to 30 Mb/s (corresponding
approximately to 0.30 to 9.25 Bit Per Pixels, BPP). Compressing at di erent bit
rate allowed investigating the codec performance as a function of the
transmission bandwidth. In mpeg, H.264 and HEVC come with prede ned presets that
achieve di erent compression ratio / frame quality / compression time (latency)
compromises. More speci cally, some preset is designed to compress the frame
in a short amount of time (low latency) at the cost of decreased quality and
larger bandwidth, while others achieve the highest compression rate and frame
quality, but require more processing time. In our experiment, we considered the
Ultrafast, Medium and Slow presets, whose interpretation in terms of quality /
bandwidth / latency should be clear to the reader.
      </p>
      <p>For each frame in the encoded/decodec clip, and each BPP/preset pair, we
measured then the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity
(SSIM), where the original, uncompressed video is the ground truth.
Furthermore, we measured the encoding and decoding time that, once summed to the
transmission time, gives the total latency.</p>
      <p>
        To perform our comparison, we encoded and decoded the same clips with
DVC [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], where all the encoding and decoding components (motion estimation,
motion compensation, residual compression, motion compression, quantization,
and bit rate estimation) are implemented by end-to-end neural networks. In
particular, a learning based optical ow estimation network obtains the motion
information; two auto-encoders compress and reconstruct the corresponding
motion and residual information. All the components are jointly trained to optimize
a bit rate - distortion (measured by PSNR or SSIM) trade-o through a single
loss function, controlled by the hyperparameter . We considered the DVC model
optimized for PSNR, trained on Vimeo-90k (composed of a large variety of real
world scenes and actions) and using = 2048 to achieve the best reconstruction
quality. It is worth noting that in our DVC implementation, only di erential
frames are encoded using the DL models, while I-frames are encoded through
the BPG compression scheme. We measured PSNR, SSIM and compression /
decompression times for DVC (as well as for BPG) and compared them with
those obtained for H.264 and HVEC. To evaluate the impact of the I-Frame
coding on the overall performance of DVC, we tested three DVC con gurations,
where the I-Frame was encoded every 5, 10 and 150 frames.
4
4.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Average quality and encoding / decoding time</title>
        <p>The rst row of Fig. 3 shows the average PSNR and SSIM computed over the
entire set of clips, together with the encoding and decoding time, as a function
of the BPP, for H.264 and di erent codec presets. As expected, the quality of
the decoded frames increases with the bandwidth. On the other hand, also the
encoding and decoding time increases with BPP (as more time is required to
process a larger amount of data), up to the point (at least for the case of the
slow preset, BPP &gt; 2: Fig. 3, rst row, third panel, green line) where encoding
is not feasible in real time anymore, at least for the video resolution considered
here. HEVC shows a similar behavior (second row in the same gure), although,
when compared to H.264, it achieves a slightly better frame quality in terms of
both PSNR and SSIM, but at the cost of a higher encoding time, that makes it
unsuitable for real time streaming (at least for the video resolution considered
here), if not using the fast preset.</p>
        <p>Fig. 3 also reports the performance of the DL-based DVC codec, that achieves
a higher average frame quality while encoding data at a lower bit rate. On the
other hand, the encoding and decoding time for the unoptimized DVC
implementation considered here are signi cantly (one order of magnitude or more)
higher than those measured for the highly optimized for speed and hardware
accelerated H.264.</p>
        <p>By increasing the frequency with which the I-Frames are encoded in DVC,
a overall higher average frame quality can be achieved. This trend is clearly
explained by the fact that the pure BPG codec, which is characterized by minimal
compression loss, achieves the best PSNR/BPP compromises (and the overall
best SSIM), but at the price of a much higher encoding time, which renders BPG
unsuitable for real time applications.</p>
        <p>4
6
2
H
C
V
E</p>
        <p>H</p>
      </sec>
      <sec id="sec-4-2">
        <title>Per frame quality and encoding / decoding time</title>
        <p>To investigate more in detail the performances of the three codecs, we also
conducted a per-frame analysis. Figs. 4 and 5 report the per-frame PSNR and SSIM
for two 40s clips extracted from the RARP video. The two clips are
characterized by di erent content, luminance level and dynamic conditions observed by
the endoscopic camera (Fig. 2). On video 1, DVC achieves an average frame
quality comparable to that obtained by H.264 and HEVC under the medium
and slow preset, while it performs better than H.264 under the ultrafast preset.
Both the traditional codes and DVC show an oscillatory pattern in terms of
quality that are associated with the transmission of the I-Frames in high quality
(PBG compression in the case of DVC). Fig. 4 also shows a sudden performance
decrease in terms of PSNR (and a less signi cant decrease in terms of SSIM)
for DVC around frame 540 (denoted as (A)). Visual inspection (Fig. 2, video
1 (A)) reveals that smoke is present at this point in the clip, which alters the
colors of the scene in a way that is not well captured by the DVC compression
algorithm. Furthermore, we can observe that H.264 and HEVC achieve better
quality (compared to DVC) in the rst part of the clip, characterized by slow
motion and static scenes, while DVC performs better in the second part ((B) in
Fig. 4), where the surgical instruments execute fast incisions and drifts of the
camera eld of view are visible. While the quality of the frames compressed and
transmitted by H.264 and HEVC drifts signi cantly along the clip, DVC
performances in terms of quality are more stable. The quality of the frames treated
by DVC remains pretty constant over time also in video 2, where the RARP
procedure is signi cantly more dynamic compared to video 1 (Fig. 5). In this
conditions, DVC overcomes H.264 and HVEC in terms of quality in almost every
frame.</p>
        <p>Finally, Fig. 6 shows the residual error between the original frames and the
same frames compressed and reconstructed with H.264, HEVC and DVC, for
three video sequences randomly extracted from our set. When H.264 is adopted,
errors are mostly concentrated in the high frequency domain, i.e. mostly around
edges and small image details. HEVC clearly achieves higher quality, with
residual errors in the middle / high frequency domain. When DVC is adopted, on
the other hand, the error is dominated by low frequency residuals, that can be
seen as slight color shift of large objects, while edges and small details appear
to be well reconstructed. The result from visual inspection is thus consistent
with the PSNR and SSIM metrics measured in our previous experiments: while
DVC does not always overcome H.264 and HEVC in terms of PSNR, because
of a generalized color shift that creates a numerically large error with (likely)
poor clinical signi cance, it also preserves edges better, which leads to a better
quality perceived by a human observer, as captured by SSIM (a metric that was
designed to vaguely resemble the response of the human visual system).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion and Conclusion</title>
      <p>We have analyzed the problem of video transmission in the speci c case of
surgical operations, which is characterized by peculiar constraints: to guarantee the
stability of the system, the latency must be kept under control, while bandwidth
may be limited; at the same time, the quality of the transmitted frames has to
be su cient to guarantee that no signi cant clinical information is lost in the
process of compressing, transmitting and reconstructing frames.</p>
      <p>Our results show that, despite some of the existing codecs are already capable
of (and, indeed, already used for) transmitting frames at low latency and good
quality, DNNs (or, at least, the DVC method herein considered) are potentially
capable of transmitting video frames at higher quality while consuming less
bandwidth. Furthermore, it is worth noticing that the considered network was
trained on a real-world scenes dataset and higher reconstruction quality can
be expected if the network is trained on endoscopic videos. However, the naive
implementation of DVC considered here (and likely those of many other DNNs)
does not satisfy the latency constraint | in other words, whereas traditional
approaches based on H.264 and HEVC codecs are already largely optimized for
speed and quality, DNNs have a much larger margin of improvement that is not
completely explored yet. Some of the possible optimizations in terms of both
quality and speed are therefore mentioned in the following.</p>
      <p>We observed that, when using DVC, the average quality of the transmitted
frames signi cantly increases when many I-Frames are encoded through BPG.</p>
      <p>
        On the other hand, the BPG encoding time (around one second per frame in
our experiments) largely exceeds the maximum allowed by surgical practice and
consequently obliges surgeons to change their work ow (e.g. adopting a
moveand-wait strategy) while also a ecting the e ectiveness of the surgical
operation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Therefore, exploring more e cient methods to compress and transmit
I-Frames at high quality is needed, so that the latency introduced while
transmitting them is less critical. Solutions that exploit both past and future frames
(e.g. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]) cannot be applied in the case of real-time streaming, whereas other
directions like partial transmission (block-based) of I-Frames or the adoption of
ad-hoc vocabolaries [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] promise to deliver signi cant improvements in the future.
      </p>
      <p>
        In the case of the non optimized DVC implementations considered here,
the latency (even without considering the transmission of the I-Frames) still
exceeds the limit that allows a surgeon to e ectively operate without being
a ected by it (around 160ms [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]); even more important, as the encoding time
is larger than 33ms per frame1, it does not even allow real time transmission at
30Hz, as a signi cant delay would be accumulated over time. There are however
several methods to accelerate DNNs that can be easily exploited, ranging from
lightweight learning based solutions for image compression [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to software based
solutions that prune the DNN to make them more e cient [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], hardware-aware
optimizations of the network implementation [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and up to the adoption of
GPU accelerators for DNN like Tensor cores [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] or even ad-hoc hardware DNN
implementations [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] that can be seen as equivalent to hardware-accelerated H.264
encoders and decoders.
1 It is worthy noticing that the total latency is given by the sum of the encoding,
transmission, and decoding times; on the other hand, the maximum among (encoding
+ transmission) and (transmission + decoding) de nes the working frequency of
the system | e.g., if encoding and transmission take overall 100ms, the maximum
number of frames transmitted per second and without accumulating delays will be
1s / 100ms / frame = 10 frames.
      </p>
      <p>
        In the end, it is worth noticing that often surgery procedures also make use
of stereo images to enable the perception of 3D information in the surgical eld.
This complicates the transmission procedure, as the size of the data doubles, but
it also o ers the possibility to take advantage of redundancy in left and right
views for the development of novel stereo codecs (see for instance in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) that
can nd application in elds even very distant from the medical ones, such as
videogames or virtual reality.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <article-title>Per-title encode optimization (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chaabouni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaudeau</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lambert</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moureaux</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gallet</surname>
          </string-name>
          , P.: H.
          <article-title>264 medical video compression for telemedicine: A performance analysis</article-title>
          .
          <source>IRBM</source>
          <volume>37</volume>
          (
          <issue>1</issue>
          ),
          <volume>40</volume>
          {
          <fpage>48</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bampis</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norkin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bovik</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          :
          <article-title>Proxiqa: A proxy approach to perceptual optimization of learned image compression</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          <volume>30</volume>
          ,
          <volume>360</volume>
          {
          <fpage>373</fpage>
          (
          <year>2021</year>
          ). https://doi.org/10.1109/tip.
          <year>2020</year>
          .
          <volume>3036752</volume>
          , http://dx.doi.org/10.1109/TIP.
          <year>2020</year>
          .3036752
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Collins,
          <string-name>
            <given-names>J.W.</given-names>
            , Ma, R.,
            <surname>Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Hung</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.J.</surname>
          </string-name>
          :
          <article-title>Telementoring for minimally invasive surgery</article-title>
          .
          <source>In: Digital Surgery</source>
          , pp.
          <volume>361</volume>
          {
          <fpage>378</fpage>
          . Springer (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dasgupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirby</surname>
            ,
            <given-names>R.S.:</given-names>
          </string-name>
          <article-title>The current status of robot-assisted radical prostatectomy</article-title>
          .
          <source>Asian journal of andrology 11(1)</source>
          ,
          <volume>90</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>A dual-network based super-resolution for compressed high de nition video</article-title>
          . In: Paci c Rim Conference on Multimedia. pp.
          <volume>600</volume>
          {
          <fpage>610</fpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ghamsarian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amirpourazarian</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Timmerer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taschwer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Scho mann, K.:
          <article-title>Relevance-based compression of cataract surgery videos using convolutional neural networks</article-title>
          .
          <source>In: Proceedings of the 28th ACM International Conference on Multimedia</source>
          . pp.
          <volume>3577</volume>
          {
          <issue>3585</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Han,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ,
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Pedram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Horowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.A.</given-names>
            ,
            <surname>Dally</surname>
          </string-name>
          , W.J.: Eie:
          <article-title>E cient inference engine on compressed deep neural network</article-title>
          .
          <source>ISCA '16</source>
          , IEEE Press (
          <year>2016</year>
          ). https://doi.org/10.1109/ISCA.
          <year>2016</year>
          .
          <volume>30</volume>
          , https://doi.org/10.1109/ISCA.
          <year>2016</year>
          .30
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hassan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghafoor</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tariq</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zia</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>High e ciency video coding (hevc){based surgical telementoring system using shallow convolutional neural network</article-title>
          .
          <source>Journal of digital imaging 32(6)</source>
          ,
          <volume>1027</volume>
          {
          <fpage>1043</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kumcu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bombeke</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jovanov</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Platisa</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luong</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Looy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nieuwenhove</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schelkens</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Philips</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Visual quality assessment of h.264/avc compressed laparoscopic video (01</article-title>
          <year>2015</year>
          ). https://doi.org/10.1117/12.2044336
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kumcu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vermeulen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elprama</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duysburgh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Platisa</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Nieuwenhove</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Van De Winkel,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Jacobs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Van Looy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Philips</surname>
          </string-name>
          , W.:
          <article-title>E ect of video lag on laparoscopic surgery: correlation between performance and usability at low latencies</article-title>
          .
          <source>The International Journal of Medical Robotics and Computer Assisted Surgery</source>
          <volume>13</volume>
          (
          <issue>2</issue>
          ),
          <year>e1758</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , H.:
          <article-title>Convolutional neural network-based block up-sampling for intra frame coding</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>28</volume>
          (
          <issue>9</issue>
          ),
          <volume>2316</volume>
          {
          <fpage>2330</fpage>
          (
          <year>2018</year>
          ). https://doi.org/10.1109/TCSVT.
          <year>2017</year>
          .2727682
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural network-based block up-sampling for hevc</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>29</volume>
          (
          <issue>12</issue>
          ),
          <volume>3701</volume>
          {
          <fpage>3715</fpage>
          (
          <year>2019</year>
          ). https://doi.org/10.1109/TCSVT.
          <year>2018</year>
          .2884203
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urtasun</surname>
          </string-name>
          , R.: Dsic:
          <article-title>Deep stereo image compression</article-title>
          .
          <source>In: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          . pp.
          <volume>3136</volume>
          {
          <issue>3145</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lombardo</surname>
          </string-name>
          , S., HAN, J.,
          <string-name>
            <surname>Schroers</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Deep generative video compression</article-title>
          . In: Wallach,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Beygelzimer</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>d'</surname>
            Alche-Buc,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garnett</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          . vol.
          <volume>32</volume>
          . Curran Associates, Inc. (
          <year>2019</year>
          ), https://proceedings.neurips.cc/paper/2019/ le/f1ea154c843f7cf3677db7ce922a2d17- Paper.pdf
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ouyang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Dvc: An end-to-end deep video compression framework</article-title>
          .
          <source>In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>June 2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Image and video compression with neural networks: A review</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>30</volume>
          (
          <issue>6</issue>
          ),
          <volume>1683</volume>
          {
          <fpage>1698</fpage>
          (
          <year>2020</year>
          ). https://doi.org/10.1109/TCSVT.
          <year>2019</year>
          .2910119
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Markidis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chien</surname>
            ,
            <given-names>S.W.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laure</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>I.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vetter</surname>
            ,
            <given-names>J.S.:</given-names>
          </string-name>
          <article-title>Nvidia tensor core programmability, performance and precision</article-title>
          .
          <source>2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (May</source>
          <year>2018</year>
          ). https://doi.org/10.1109/ipdpsw.
          <year>2018</year>
          .
          <volume>00091</volume>
          , http://dx.doi.org/10.1109/IPDPSW.
          <year>2018</year>
          .00091
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Molchanov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kautz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fusi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vahdat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Hant: Hardwareaware network transformation (</article-title>
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Molchanov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mallya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tyree</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frosio</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kautz</surname>
          </string-name>
          , J.:
          <article-title>Importance estimation for neural network pruning</article-title>
          .
          <source>In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          . pp.
          <volume>11256</volume>
          {
          <issue>11264</issue>
          (
          <year>2019</year>
          ). https://doi.org/10.1109/CVPR.
          <year>2019</year>
          .01152
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. Munzer,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Schoe</surname>
          </string-name>
          <string-name>
            <surname>mann</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          , Boszormenyi, L.:
          <article-title>Domain-speci c video compression for long-term archiving of endoscopic surgery videos</article-title>
          .
          <source>In: 2016 IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS)</source>
          . pp.
          <volume>312</volume>
          {
          <issue>317</issue>
          (
          <year>2016</year>
          ). https://doi.org/10.1109/CBMS.
          <year>2016</year>
          .28
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Punchihewa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bailey</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>A review of emerging video codecs: Challenges and opportunities</article-title>
          .
          <source>In: 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ)</source>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Tomar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Converting video formats with mpeg</article-title>
          .
          <source>Linux Journal</source>
          <year>2006</year>
          (
          <volume>146</volume>
          ),
          <volume>10</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Tsai</surname>
            ,
            <given-names>Y.H.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>M.Y.</given-names>
            ,
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.H.</given-names>
            ,
            <surname>Kautz</surname>
          </string-name>
          , J.:
          <article-title>Learning binary residual representations for domain-speci c video streaming</article-title>
          .
          <source>In: AAAI</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mentzer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gool</surname>
            ,
            <given-names>L.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Timofte</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Learning for video compression with hierarchical quality and recurrent enhancement</article-title>
          .
          <source>In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>6628</volume>
          {
          <issue>6637</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Residual highway convolutional neural networks for in-loop ltering in hevc</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          <volume>27</volume>
          (
          <issue>8</issue>
          ),
          <volume>3827</volume>
          {
          <fpage>3841</fpage>
          (
          <year>2018</year>
          ). https://doi.org/10.1109/TIP.
          <year>2018</year>
          .2815841
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>