=Paper=
{{Paper
|id=Vol-2744/paper4
|storemode=property
|title=Pyramid of Filters - Fast Image Filtering without FFT
|pdfUrl=https://ceur-ws.org/Vol-2744/paper4.pdf
|volume=Vol-2744
|authors=Sergey Ershov,Ildar Valiev,Alexey Voloboy
}}
==Pyramid of Filters - Fast Image Filtering without FFT==
<pdf width="1500px">https://ceur-ws.org/Vol-2744/paper4.pdf</pdf>
<pre>
Pyramid of Filters — Fast Image Filtering without FFT?

 Sergey Ershov[0000−0002−5493−1076] , Ildar Valiev[0000−0003−2937−8480] , and Alexey
                          Voloboy[0000−0003−1252−8294]

    Keldysh Institute of Applied Mathematics RAS, Miusskaya sq. 4, 125047 Moscow, Russia
                       {ersh,valiev,voloboy}@gin.keldysh.ru


        Abstract. Methods of computer graphics which had already become common
        in design of new optical systems and materials find currently new applications
        such as stomatology and ophthalmology. Some modern imaging systems are now
        designed in conjunction with the human vision system which is at their end. As
        a result simulation of the effects of human vision becomes necessary. These in-
        clude partial defocusing and resulting “blur” of image, scattering and halo/corona
        and so on. Such effects are usually simulated convolving the original, “ideal” im-
        age with the pixel spread function. The latter frequently has size about that of
        the source image, so straightforward calculation of convolution would take a gi-
        ant number of operations. Therefore in case of high resolution a decent speed
        is usually achieved by using the Fast Fourier Transform (FFT) for convolution,
        but since FFT operates periodic functions on a lattice with resolution being an
        integer power of a prime numbers, the required working resolution may consid-
        erably increase that of the original image and required memory becomes inadmis-
        sible. This paper presents an alternative method that allows calculations in much
        smaller memory avoiding overheads introduced by FFT requirements.

        Keywords: Optical Simulation · Imaging System Design · Human Vision Effects
        · Image Processing · Filters · Convolution.


1     Introduction
Methods of computer graphics which had already become common in design of new
optical systems and materials, are now applied in stomathology and ophthalmology.
Some imaging systems are now designed in conjunction with the human vision system
which is at their end. Simulation of many effects of human vision, as well as defocusing
and aberration of lenses etc use 2D convolution of the original, “ideal” image with a
kernel for the pixel spread function [1]:
                                   X
                        f¯(x, y) =     K(x0 − x, y 0 − y)f (x0 , y 0 )               (1)
                                     x0 ,y 0

where (x, y) are the pixel coordinates, f is the source image, and K(x, y) is the filter’s
kernel. Summation goes over the neighborhood of the “target” pixel (x, y) where the
kernel is not 0.
    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0).
?
    Supported by RFBR, grants 18-01-00569, 18-31-20032 and 19-01-00435.
2 S. Ershov et al.

     In simulation of such human vision effects as halo and glare [2–6] the size of filter
(read: of its support) can be about the size of the source image. For HD photos and
other high-precision means this the image size can be as huge as even 40000x25000
pixels. Obviously, if for each of about a billion of target pixels calculate a sum of about
a billion terms, the speed of such direct calculation will be below any expectation.
     Since the kernel depends only on the difference between the source and target coor-
dinates, efficiency can be drastically improved using Fast Fourier Transform (FFT) [7],
with which the number of operations is about the number of pixels.
     However FFT is not good regarding memory. First, FFT operates periodic functions
(both input, kernel and the result) so that the boundaries of the image are glued. As a
result, the FFT convolution to the black source image with the only white pixel at (0,0)
produces filtered image which is gray not only near (0,0) but also near the opposite end
(x = Nx , y = Ny ). To avoid this artifact one has to add “margins” of size at least twice
filter radius and pad the source image to 0 there.
     Meanwhile, as said above, in human vision simulation that radius is about the image
size, thus we apply FFT convolution to the thrice larger image, 120000õ75000 pixels in
the above example. And this is not the end. The usual FFT is radix-2, i.e. the resolution
must be an integer power of two. Although there do exist radix-3, 5 and other variants
[7], this exotic is less efficient. The nearest larger power of two for 120000õ75000 is
already 131072õ131072 pixels.
     Since the number of FFT operations is nearly linear in the total number of pixels,
it still remains more or less admissible, and modern PCs can do that in half an hour
or hour for 3 color components. The really critical thing is though memory because
131072õ131072 pixels, times 3 color channels is already 192 Gb. One must keep both
the source image, the filtered image and the kernel at a time, all of that resolution. Plus
several auxiliary arrays etc. So for our example this totals to more that 512 Gb which is
rarely available nowadays.
     Meanwhile, the source image is just about 12 Gb so all this huge memory is merely
the FFT overhead. Happily, for symmetric kernels it is possible to design an alterna-
tive method that works in much less memory while still has decent speed. This paper
describes two such algorithms (that can be also combined).


2   Pyramid of resolutions
This is closely related to the well-known Level Of Detail or mipmap [1] or multigrids
in finite-difference methods. Decomposition into a series of images with decreasing
resolution (based on Laplace pyramid) was even used in simulation of human vision [8].
Other alternatives to FFT such as wavelets etc are discussed in [9]. We shall proceed
from the multigrid idea, though differently, and obtain a pyramid of filters (and filtered
images). The idea of changing the resolution is much similar to that in the well-known
image pyramid [1].
    The kernel for human vision is rather sophisticated [4, 6] but is mainly a weighted
sum of isotropic components like
                                                1
                                 K(r) =                                                (2)
                                          (1 + (r/a)2 )β
                                     Pyramid of Filters — Fast Image Filtering without FFT 3
            p
where r ≡ x2 + y 2 is the distance between the source and target pixels. The spread
a can be both small (less than a pixel) for some components, and large for other com-
ponents. As a result, for small r the kernel changes very steeply, while at large r it
inevitably changes more and more slowly.


2.1   Intentions

To utilize that let us decompose our filter into the sum of the “central” and “periph-
eral” parts, so that the former be distinct from zero for only, say, r ≤ 50 while the
“periphery”, which can be not 0 everywhere, be smooth (slowly changing) enough.
     The convolution with the original filter is then the sum of convolutions with the
“central” and “peripheral” sub-kernels. The former, because of the small filter radius,
can be calculated even directly, without any FFT and thus in its original resolution. It
requires about 1002 operations per pixel which is a lot but still admissible for modern
PCs.
     As to the smooth peripheral part, it is insensitive to the fine-scale detail of the source
image. And the filtered image is smooth. So we can calculate it in decreased resolution
(say tenfold) and then interpolate the result to the original resolution. And in this highly
reduced resolution we can already apply FFT because the overhead is admissible.
     Alternatively, we can repeat the procedure, now decomposing the peripheral com-
ponent into again the “central part of periphery” and “periphery of periphery”. Again
the central part will be limited to say 50 pixels and the peripheral will be so smooth
that admits already a 10-fold reduction of resolution (as compared to level 1, so al-
ready 102 = 100 as compared to the original resolution!). In our example this is merely
250õ400 so the filter, even if it covers the whole image area, can be applied directly
(without FFT at all). So in this case we proceed without FFT at all levels.
     The first central component is termed the first level of pyramid. The “central part
of periphery” is the second level of pyramid. And “periphery of periphery” constitutes
the third level (which in principle can be subdivided that sample manner leading to the
fourth level and further). How this pyramid looks in reality is shown in Figure 1.
     While convolving with the 1st level, we work in the original resolution and keep the
source and the processed images which totals to about 24 Gb. The memory to keep the
filter of 50 pixels radius is negligible.
     At the second level, all is the same only for the tenfold reduced resolution, so it
needs merely about 24/102 Gb = 240 Mb that is negligible. The third level takes even
less.
     So the total memory and speed are determined by the first level and are admissible.


2.2   Real reduction of resolution

In the above reasoning we spoke about tenfold reduction of resolution which is surely
imaginary. In reality reduction ratio is determined by the desired accuracy (which is
in turn determined by the kernel and requirements to the “smoothness” of periphery),
and is usually just 3. That is, resolution of the image processed at the next level is 3
times lower than that at the current level, and each pixel of the image at the next level
4 S. Ershov et al.

“comprises” a group of 3õ3=9 current level pixels. In the simplest case, the value stored
in that pixel is just the sum over that group (i.e. 9 pixels of the current level), though
later we shall also use another relations). The first 3 levels of pyramid for one of the
filters used in simulation of human vision acuity [4] is shown in Figure 1.
     For the lower reduction ratio we shall need more levels to reach finally so low
resolution that can be processed directly (without FFT). It can be already 7.


Fig. 1. Decomposition of the kernel K(r) = (1+(10r)2 )−3/2 . Dashed line is the original kernel.
Red is the first level (central part in the original resolution), green is the second level (thrice lower
resolution), blue and further (indistinguishable) — the third level and further. The sum over all
levels gives exactly the original kernel.


    The radius of the central part which in our idea was said to be 50 pixels is, surely,
not that round. It is calculated from the criterion: outside that area the function changes
so gradually that the inaccuracy of the linear interpolation from the 3fold reduced res-
olution is below allowed threshold. For human vision kernel, it is usually somewhere
between 7 thru 25 pixels depending on other parameters.
    All that and other details are explained below.


2.3   Calculation of levels. “Second order” pyramid (with gradient)

How the second level (of the kernel and image) is constructed? Let the reduction ratio
be 2M + 1 where M is a positive integer. Than each pixel of the next level substitutes
somehow the group (2M + 1) × (2M + 1) pixels of current level “around” it.
    It is convenient to take M = 1 so that reduction of resolution is 3fold, and in our
calculations we used that value. Though M = 2 and larger is also possible.
    Obviously, the convolution of the 1st level (i.e. the original) image with the 1st level
kernel can be written as
                                   X
                      f¯(x, y) =             K((x0 − x)∆, (y 0 − y)∆)f (x0 , y 0 )                   (3)
                                   x0 ,y 0
                                     Pyramid of Filters — Fast Image Filtering without FFT 5

where x, y, x0 , y 0 are the pixel coordinates i.e. the integer indices, K is the kernel as a
function of the physical (continuous!) coordinates (meters, degrees, ...), while ∆ is the
pixel step i.e. the change of that physical coordinate from one pixel to another.
    Grouping the pixels of the 1st level into “clusters” of (2M + 1) × (2M + 1) pixels,
so that on the original resolution grid their centers are

                            ((2M + 1)X 0 + 1, (2M + 1)Y 0 + 1)                           (4)
where (X 0 , Y 0 ) are the pixels at the reduced resolution, we can write result of convolu-
tion at the point x = (2M + 1)X + 1, y = (2M + 1)Y + 1 (present in both resolutions!)
as


                     X     M
                           X      M
                                  X
        f¯(x, y) =                        K((X 0 − X)∆0 + ξ∆, (Y 0 − Y )∆0 + η∆)
                     X 0 ,Y 0 ξ=−M η=−M

                                   ×f (X 0 + ξ, Y 0 + η)                                 (5)
              0
            ∆ ≡ (2M + 1)∆

Near the origin the kernel usually changes steeply while far from it, for r ≥ r0 (below
we shall calculate that r0 ), the kernel becomes so smooth that for it the original high
resolution becomes superfluous. So according to our idea the kernel (which we shall
assume isotropic just for convenience) is decomposed into the sum of the “central part”
(vanishing for r > r0 ) and “periphery”:

                                  K(r) = Kc (r) + Kp (r)                                 (6)
   The peripheral part Kp (r) is known for r ≥ r0 where by construction it equals
K(r), while for r ≤ r0 it is arbitrary. It is natural to require that there Kp (r) be maxi-
mally simple and smooth, e.g. a parabola seamlessly going into K(r) at r = r0 :
                                       (
                                          α + βr2 , r ≤ r0
                            Kp (r) =                                                    (7)
                                          K(r),        r ≥ r0
where


                                                      r0 K 0 (r0 )
                                  α = K(r0 ) −                                           (8)
                                                          2
                                          K 0 (r0 )
                                  β=                                                     (9)
                                           2r0
Then the central part is
                                    (
                                     K(r) − α − βr2 , r ≤ r0
                           Kc (r) =                                                     (10)
                                     0,               r ≥ r0
and the convolution takes the form
6 S. Ershov et al.


               X
  f¯(x, y) =             Kc (x0 − x, y 0 − y)f (x0 , y 0 )
               x0 ,y 0
               |                          {z                }
                                      f¯c (x,y)

      X      M
             X           M
                         X
  +                              Kp ((X 0 − X)∆0 + ξ∆, (Y 0 − Y )∆0 + η∆)f (X 0 + ξ, Y 0 + η)
    X 0 ,Y 0 ξ=−M η=−M
    |                                                       {z                                        }
                                                         f¯p (X,Y )


So far we did not introduced any approximation but just regrouped the original expres-
sion and named the results of convolution with the central respectively peripheral part
of the kernel as f¯c (x, y) and f¯p (x, y).
    Now let us assume that r0 is so large that inside one pixel of 2nd level (i.e. for all
(2M + 1) × (2M + 1) pixels of the 1st level it comprises) Kp can be approximated by
the linear function (effectively the first terms of the Taylor series)


      Kp ((X 0 − X)∆0 + ξ∆, (Y 0 − Y )∆0 + η∆)
      ≈ Kp ((X 0 − X)∆0 , (Y 0 − Y )∆0 ) + (X 0 − X)D((X 0 − X)∆0 , (Y 0 − Y )∆0 )ξ
      +(Y 0 − Y )D((X 0 − X)∆0 , (Y 0 − Y )∆0 )η

where
                                 ∂Kp                    ∂Kp                 ∂Kp
  D(r) ≡ ∆0 ∆ × r−1                  = (2M + 1)∆2 × r−1     = 2(2M + 1)∆2 ×
                                  ∂r                     ∂r                 ∂(r2 )

Then introducing
                                                  M
                                                  X     M
                                                        X
                         Fm,n (X, Y ) ≡                          ξ m η n f (X + ξ, Y + η)
                                                  ξ=−M η=−M

we have in the 2nd level pixel (X, Y ),

                            X
        f¯p (X, Y ) ≈                 Kp ((X 0 − X)∆0 , (Y 0 − Y )∆0 )F0,0 (X 0 , Y 0 )
                           X 0 ,Y 0
                                X
                           +              (X 0 − X)D((X 0 − X)∆0 , (Y 0 − Y )∆0 )F1,0 (X 0 , Y 0 )
                               X 0 ,Y 0
                                X
                           +              (Y 0 − Y )D((X 0 − X)∆0 , (Y 0 − Y )∆0 )F0,1 (X 0 , Y 0 ) (11)
                               X 0 ,Y 0

which is nothing but the convolution on the coarse pixel grid of 2nd level of resolution.
   Deviation of the kernel from its linear approximation can be estimated through the
second derivatives. For r ≥ r0 the relative error is
                                           Pyramid of Filters — Fast Image Filtering without FFT 7


                                                       ∂K
                                        (∆0 )2         ∂r 2   + 2r2 ∂r∂ 2 ∂r
                                                                          ∂K
                                                                             2
                                  ≤                                                          (12)
                                          2                      K
while within the central zone it is

                                                               ∂K
                                              (∆0 )2           ∂r 2
                                          ≤                                                  (13)
                                                4              K
                                                                      r=r0

The use of the derivative in r2 , not in r, is convenient when the kernel is given as an
analytic function of r2 , as it is in human vision. If the kernel is defined as a tabulated
function of r,
                                       ∂K       1 ∂K
                                         2
                                            =          .                               (14)
                                      ∂(r )    2r ∂r
           ∂K
            2   +2r 2 ∂(r
                        ∂      ∂K
                          2 ) ∂(r 2 )
Usually ∂(r )        K                  decreases for r → ∞, so the relative error in any pixel
is bounded by

                                          ∂K
                            (∆0 )2       ∂(r 2 )       + 2r2 ∂(r∂ 2 ) ∂(r
                                                                       ∂K
                                                                          2)
                                                                                             (15)
                              2                               K
                                                                                 r=r0

Then r0 can be found as the radius for which the above expression equals the error of
approximation. It can be done with e.g. bisection because of monotone decrease.
    Construction of the third level i.e. decomposition of now Kp into its central and
peripheral parts is done similarly, just taking Kp instead of K. The central part of Kp
becomes level 2 in the pyramid of filters, while peripheral part of Kp is again decom-
posed, its central part becoming level 3 of the pyramid of filters and so on.
n Notice the o  filter of the n-th (for n > 1) level is not scalar, but its value is a “tuple”
         −1 ∂Kp
  Kp , r     ∂r   . The source image at this level also keeps a “triad” {F0,0 , F0,1 , F1,0 }
of sums over sub-pixels (i.e. of (2M + 1) × (2M + 1) pixels of the previous level) in
each its color channel.


3   Factorization/separability

These words mean a function of several arguments is a product of univariate function.
Namely, let us look at the famous Gaussian filter kernel
                                                                       2
                                            K = C 2 e−(r/a) .                                (16)
It is a product of two univariate functions
                                                   2                   2
                          K = Ce−(x/a) × Ce−(y/a) ≡ k(x)k(y)                                 (17)
so the convolution with it is just two successive one-dimensional convolutions
8 S. Ershov et al.


                                              X
                             f¯(x, y) =             k(x0 − x)g(x0 , y 0 )             (18)
                                              x0
                                              X
                            g(x0 , y 0 ) =          k(y 0 − y)f (x0 , y 0 )           (19)
                                              y0

For a filter of radius 100 pixels a direct calculation of a 2D convolution requires about
1002 operations per pixel, while two 1D convolutions need only 200 operations, i.e. 50
times less. For larger radius the gain can be giant. So it would be great if this method
can be somehow extended to work with more general kernels besides Gaussian.
    It happens that really a 2D convolution can be calculated as a series of 1D ones for
an isotropic kernel K(r) and even just a “Hermitian” kernel K(x, y) = K(y, x), this is
a generalization of “separable filtering” from [10], [1].
    Indeed, a Hermitian matrix K(x, y) = K(y, x) of dimension N × N can be ex-
panded as a sum of “eigenprojectors”
                                              N
                                              X −1
                            K(x, y) =                λm ψm (x)ψm (y)                  (20)
                                              m=0

where ψm — is a normalized eigenvector, λm being the corresponding eigenvalue.
Notice that a matrix with nonnegative elements can have some negative eigenvalues.
    The most contribution in (20) comes from eivenvectors with larger |λ| so that we
order in the absolute value of the eigenvalues and terminate the series (20) when the sum
of remaining λ2m becomes less than the desired accuracy. Notice for a smooth kernel,
the first eigenvectors are also smooth, while the last oscillate rapidly, changing the sign
from pixel to pixel. This is a usual property.
    For kernels used in simulation of the human vision, a good accuracy about 1% is
achieved already for just 2 – 4 (first) eigenvectors. The convolution now is calculated
as:

                                         XX
                         f¯(x, y) =                  ψm (x0 − x)gm (x0 , y 0 )        (21)
                                         m     x0
                                         X
                      gm (x0 , y 0 ) =        λm ψm (y 0 − y)f (x0 , y 0 )            (22)
                                         y0

It is not necessary to keep all the images gm for all pixels. For a filter of size N × N it
is enough to keep just N rows of each of them.


4   Combining factorization and pyramid of resolutions
We can combine the two approaches. Either we first construct the pyramid of filters and
then each its level is factored.
    In principle the opposite order is also possible, i.e. we first decompose the original
kernel with (20) and then construct the pyramid for each of its ψm . But for this order
                                     Pyramid of Filters — Fast Image Filtering without FFT 9

we should need to calculate eigenvectors (at least the main) of the matrix of size about
40000õ25000 which is too expensive.                  n             o
                                                                  ∂K
    For all levels beyond the 1st the filter is a tuple Kp , r−1 ∂rp so one must factor
both its components and so that the number of eigenvectors retained be the same. We
find it from the obvious condition that Kp itself is approximated well enough, because
                               ∂K
the gradient component, r−1 ∂rp is used for improvement of accuracy to the 2nd order
and is thus less essential than the main term Kp .


5   Results

The method was applied to HDRI image of size 4000õ2500 shown in Figure 2 (left).
                  1         1
Filter was K = 2πa  2 (1+(r/a)2 )3/2 with a = 12.5 pixels and was defined in area

2500õ2500 pixels.


Fig. 2. Left: The source image used for benchmarking of the method. Right: its exact filtering
(FFT).


    We compared our methods with the exact filtering based on FFT, shown in Figure 2
(right). The results of our methods are shown in Figure 3. All methods and their com-
binations give very similar, acceptable results. The difference between images obtained
by the 5 filtering methods is below 0.67%.
    The memory and speed for computer with i7–4800MQ @ 2.7Ghz and 16Gb DDR3
RAM are shown in Table 1


        Table 1. Time in seconds and memory in Mb for different methods of filtration

                          pyramid 1st order pyramid 2nd order      FFT
        with factorization 12.1 sec, 276 Mb 22.3 sec, 276 Mb        n/a
         no factorization 27.8 sec, 196 Mb 39.3 sec, 218 Mb 18.8 sec, 2598 Mb
10 S. Ershov et al.


    pyramid (no factorization, no gradient)          pyramid with gradient, no factorization


    pyramid with factorization, no gradient          pyramid with factorization and gradient

                            Fig. 3. Image filtered with our methods.


6    Conclusion
We elaborated and tested the method that allows to filter images of very high resolutions
without FFT. The new method needs much less memory while the speed is comparable
with that achieved with FFT, and in case of pyramid with factorization and no gradient
even higher. The deviation from the standard exact method is meanwhile very low,
below 1% which is admissible for visual applications. It can be further improved by
adding support of gradients (2nd order) and manipulating the settings of the method.
This however increases memory and decreases speed, so the results shown in Section 5
are just a reasonable compromise.


References
1. Szeliski, R.: Computer Vision: Algorithms and Applications. Springer (2011).
2. Franssen, L., Coppens, J.: Straylight at the retina: scattered papers. Univ. of Amsterdam, Fac-
   ulty of Medicine (2007).
3. Ginis, H.S., Pérez, G.M., Bueno, J.M., Artal, P.: The wide-angle point spread function of the
   human eye reconstructed by a new optical method. Journal of vision 12(3), 20:1–20:10 (2012).
4. Spencer, G., Inc, T., Shirley, P., Zimmerman, K., Greenberg, D.: Physically-based glare effects
   for digital images. Computer Graphics 29(08), 325–334 (1995).
5. Williamson, C., Strachan, E., Siebert, J.: Simulating the effects of laser dazzle on human
   vision. In: Proceedings of International Laser Safety Conference. pp. 268–277. Orlando, USA
   (2013).
                                       Pyramid of Filters — Fast Image Filtering without FFT 11

6. Yoshida, A., Mittner, M., Mantiuk, R., Seidel, H.P.: Brightness of glare illusion. In: Proceed-
   ings of the 5th symposium on Applied perception in graphics and visualization APGV ’08.
   pp. 83–90 (2008).
7. Press, W., Teukolsky, S., Vettering, W., Flannery, B.: Numerical Recipes in C++: The Art
   of Scientific Computing (2nd edn) Numerical Recipes Example Book (C++) (2nd edn) Nu-
   merical Recipes Multi-Language Code CD ROM with LINUX or UNIX single-screen license
   revised version. European Journal of Physics 24(3), 329 (2003).
8. Pattanaik, S.N., Ferwerda, J.A., Fairchild, M.D., Greenberg, D.P.: A multiscale model of adap-
   tation and spatial vision for realistic image display. In: Proceedings of the 25th Annual Confer-
   ence on Computer Graphics and Interactive Techniques SIGGRAPH ‘98. pp. 287–298. ACM,
   New York, NY, USA (1998).
9. Dammertz, H., Sewtz, D., Hanika, J., Lensch, H.: Edge-avoiding a-trous wavelet transform
   for fast global illumination filtering. In: Proceedings of High Performance Graphics 2010. pp.
   67–75 (2010).
10. Perona, P.: Deformable kernels for early vision. IEEE Trans. Pattern Anal. Mach. Intell. 17,
   488–499 (1995).

</pre>