LaMa network architecture search for image inpainting
                                Dmytro Kolodochka1, Marina Polyakova1, Oleksandr Nesteriuk1 and Victor Makarichev2
                                1
                                    Odesa Polytechnic National University, 1, Shevchenko Ave., Odesa, 65044, Ukraine
                                2


                                                   Abstract
                                                   The neural architecture search problem is to obtain a neural network architecture with a version of the
                                                   selected block that has the best performance according to a pre-selected evaluation strategy compared to
                                                   other alternative versions. The aim of the paper is to improve the performance of image inpainting using
                                                   neural architecture search by applying the wavelet transform to the LaMa network. Analyzing the results
                                                   of experiments on researching the performance of image inpainting using the developed software it was
                                                   noticed that the inpainting was better for images containing significant areas of uniform intensity, fine-
                                                   grained or structural texture. Fragments of images, including complex textures or detailed patterns were
                                                   inpainted worse. The proposed technique for searching neural architecture for image inpainting based on
                                                   LaMa differs in the ratio of image inpainting time and the quality of the reconstructed image. Inpainting
                                                   of images with large masks based on the LaMa network is improved by applying the wavelet transform.
                                                   In particular, the quality of filling the missing areas with image edges and small details is improved. In
                                                   addition, it was researched the dependence of the quality of generating of details and edges of objects in
                                                   the image on the properties of the image textures, which can be described by texture descriptors. Prospect
                                                   for further research is prediction the effectiveness of the image inpainting with the LaMa networks
                                                   depending on the estimated values of original image texture descriptors and missing areas size.

                                                   Keywords
                                                   Image inpainting, neural architecture search, wavelet transform, LaMa network 1


                                1. Introduction
                                In many applications in computer vision systems and computer graphics it is nessesary to fill
                                missing areas of images. Image inpainting is applied in many practical situations, such as removing
                                redundant elements or restoring damaged parts of photo [1]. Another application is the photo
                                retouching of fabrics, skin and hair taking a lot of time when done manually. Image inpainting can
                                also be applied to video, because video is a sequence of frames. Often, due to compression, some
                                parts of the video can be damaged, and advanced image inpainting methods are able to solve this
                                problem effectively. These methods are also useful for museums with limited budgets that cannot
                                hire a professional artist to restore paintings.
                                    The research object is natural image inpainting in computer graphics and computer vision
                                systems.
                                    Image inpainting methods are classified into two main types, specifically, direct methods and
                                deep learning methods. Direct methods include methods based on partial differential equations,
                                semi-automatic drawing, texture synthesis, for example, PatchMatch, implemented in Adobe
                                Photoshop. Direct methods are fast, require almost no computing resources, easy to implement,
                                and process images of any size. But the filling of missing areas by direct methods is based only on
                                known areas of the same image. Therefore, it will be impossible to restore objects that have no
                                analogues in the image. In addition, direct methods poorly restore large missing areas of images
                                [2].
                                    Deep learning image inpainting methods can generate missing areas of the image with fine local
                                textures and good global consistency. Thus, DeepFill v1-2 [3, 4], EdgeConnect [5], CoModGAN [6]
                                differ in properties of reconstructed images, processing time, the size of the processed image, and
                                the quality of filling of image regions [7]. The shortcoming of the listed methods is the
                                unsatisfactory results of generating both image context and texture when using large masks.

                                ICST-2024: Information Control Systems & Technologies, September 23-25, 2023, Odesa, Ukraine.
                                   dmitrytdr@gmail.com (D. Kolodochka); marinapolyakova943@gmail.com (M. Polyakova); nesteryuk@op.edu.ua (O.
                                Nesteriuk); v.makarichev@khai.edu (V. Makarichev)
                                   0009-0006-3329-1504 (D. Kolodochka); 0000-0001-7229-7657 (M. Polyakova); 0000-0002-0806-8259 (O. Nesteriuk); 0000-
                                0003-1481-9132 (V. Makarichev)
                                            © 2024 Copyright for this paper by its authors.
                                            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    In general, to achieve better results in image inpainting, the convolutional neural network
(CNN) architecture is complicated or divided into sub-networks with separate tasks [8]. LaMa-
Fourier uses a fewer parameters and a single network instead [9]. It is able to obtain a good result
even when the missing areas occupy most of the image.
   The subject of the research is methods of image inpainting based on LaMa network.
   The LaMa-Fourier network allows to fill large areas of spectral textures with high performance.
But the LaMa-Fourier inpaints the fine details of images and edges of objects with insufficient
quality. The filling missing areas of structural textures such as grass, leaves, textile is also difficult
for LaMa-Fourier network. To avoid this shortcoming, the LaMa-Wavelet processes both global
and local features of images using a wavelet transform [7]. However, there are quite a lot of
options for using the wavelet transform in CNN architectures. Neural architecture search allows to
design the effective LaMa network architecture with the lowest loss on the training set of images.
   The aim of the paper is to improve the performance of image inpainting with neural
architecture search by wavelet transform applying in the LaMa network. The neural architecture
search is justified the selection of network to solve the image inpainting problem.
   The main contributions of the paper are following.
   The neural architecture search technique is elaborated for image inpainting.
   The results of neural architecture search allow choosing a solution of the image inpainting
problem based on the requirements to the quality of restored images and processing time.
   It was revealed and researched the dependence of the quality of generating of details and edges
of objects in the image on the properties of the image textures, which can be described by texture
descriptors.

2. Problem statement
The RGB natural image is given by formula I(x,y)=(IR(x,y), IG(x,y), IB(x,y)}, where x             n; y
m; IR(x,y), IG(x,y), IB(x,y) are the color channels. To model image missing areas a mask is introduced.
It is a binary image M(x,y) of the same size as each channel of the original image. The mask is
element-by-element producted by image channels, and the image with missing areas is defined as
I (x,y)=(IR(x,y)°M(x,y), IG(x,y)°M(x,y), IB(x,y)°M(x,y)). The image I (x,y) should be transformed so as to
fill missing areas. In this case, the resulting image should be an approximation of the original one
in the sense of some criterion [7, 9].
     The problem of neural architecture search is as follows [10, 11]. The neural architecture search
space is included basic network with alternative versions of selected block. Further the
performance estimation strategy must be selected. It evaluates the performance of neural networks
which architecture includes one of the versions of selected block. Finally, it is necessary to obtain
the neural network architecture with version of selected block which has the best performance
compared with other alternative versions.
     Let a set F includes the architectures structf of the inpainting network f        F={structf
is network weights. F is noted as search space. Taking the image with missing areas I (x,y) the
CNN f                                         -convolutional manner, and produces the inpainted image
Iin(x,y) = f (I (x,y)) which approximates the original image I(x,y).
     Then the dataset D consisting of pairs (image I(x,y), mask M(x,y)) is selected to train network
f                           structf  F by training strategy T. It is necessary to find an architecture
structf  F within a given time or computation budget t which has the lowest possible validation
loss Lval when trained using dataset D and training strategy T:

                                               minstructf  F Lval (f                                  (1)

                         Ltrain (f      Ltrain denotes the training loss. The pairs (image I(x,y), mask
M(x,y)) of dataset D obtained from natural images and synthetically generated masks.

3. Literature review
In this paper a neural architecture search is performed to improve the quality of image inpainting.
Then the wavelet CNN architectures are reviewed to generate network blocks for search space
forming. Analysis of the deep learning CNN architectures allowed the authors to determine the
main applyings of wavelets in network architectures.
    The first approach is wavelet embedding in CNN layers. Thus, in [12, 13] the wavelet transform
is used as an implementation of convolutional and pooling layers to obtain a multiscale
representation of the image.
    In [14, 15] it is proposed to use a wavelet transform instead of a pooling layer in the CNN
architecture for image recognition to reduce the dimension of feature maps. The max pooling and
average pooling are most popular pooling methods. They based on a neighborhood processing that
easily introduce visual distortion. To avoid this problem, a pooling method based on Haar wavelet
transform was elaborated [14]. In [15] more sophisticated Daubechies wavelet and coiflets are used
to perform the pooling. Additionally, a new pooling method for CNNs is proposed combining
multiple wavelet transforms. The benefit of these pooling methods is to improve the performance
of object recognition.
    Although, in neural networks a few activation functions have been applied, the new activation
function search still being an open research area. In [16] Gaussian family wavelets (first, second
and third derivatives) are reused as activation functions in neural networks. The combination of
these activation functions improve the CNNs performance. In [17] to enhance the proposed
wavelet CNN the activation function of the convolutional layers is replaced by a real part of Gabor
wavelet, but the activation function of the last layer is sigmoid function. Thus, the precision and
accuracy of image classification on the test datasets was improved.
    The second approach is the alternation of wavelet transform levels with CNN layers [12, 18]. In
[12] it is first noticed that CNNs process images directly in the spatial domain. To incorporate a
spectral approach into CNNs, a multiresolution analysis and CNN are combined into one model via
wavelet transform and integrated it as additional components in the network architecture. The
applying of wavelet CNNs in texture classification and image annotation problems allows to
achieve the better accuracy than existing CNNs while having significantly fewer parameters [12].
In [18] multi-level wavelet CNN architecture is designed to include the CNN block before each
level of discrete wavelet transform (DWT). The CNN block is a fully convolutional network
without pooling, inputting all wavelet coefficient subbands. Each layer of the CNN block is
composed of convolutional layer, batch normalization (BN), and ReLU activation. The last layer of
the multi-level wavelet CNN is a convolutional layer which is adopted to predict resulting image.
The experimental results show the effectiveness of multi-level wavelet CNN with Haar and
Daubechies wavelets for image restoration problems such as image denoising and removal of JPEG
image artifacts.
    The third approach is CNN with wavelet domain inputs. In [19] it is noticed that although
pooling layers reduce the computation requirements to CNNs, they cause the loss of information
and affect the image classification accuracy. The CNN with wavelet domain inputs is proposed to
enhance the quality of input information and increase the classification accuracy without changing
the overall structure of the pre-defined CNN or enlarging a number of parameters. Specifically, at
the pre-processing stage wavelet packet transform or dual-tree complex wavelet transform is
applied to original image. Then some wavelet coefficient subbands are selected as the CNN inputs
so that the networks are directly trained in the wavelet domain. Experiment shows the improving
of the image classification accuracy.
    Analyzing the considered CNN architectures using the wavelet transform the following was
noticed. The time for image processing by a trained network is most likely comparable for the
three approaches considered, but the training time for a CNN with wavelet domain inputs is
significantly less, since training examples can be pre-transformed into the space of wavelet
coefficients before training the network.
    Besides, the use of wavelets in convolutional layers is characterized by the complexity of
implementation and interpretation of the results. In addition, this approach limits the image feature
extraction the learning process, which was an advantage of CNN. The use of the wavelet transform
in the pooling layer has shown effectiveness in object recognition, but such layers are not used in
LaMa networks.
    CNNs with wavelets as activation functions significantly increases the amount of computation
compared with ReLU. In addition, it is difficult to predict how such an applying of the wavelet
transform will affect the result of image inpainting.
   Using a CNN with wavelet domain inputs limits the selection of features to the domain of
wavelet coefficients. In addition, the quality of inpainted image is likely to be negatively affected
by the border effect of the wavelet transform.
   Interleaving of wavelet transform levels with CNN layers, we search image features
emphasizing image details and object edges. Although the feature selection is somewhat limited,
this approach is easier to implement. Therefore this and previous approach are used further in this
paper to form a neural architecture search space.

4. Materials and methods
To improve the LaMa network performance, the technique of the neural architecture search for
image inpainting is proposed. The stages of this technique are as follows.

   1.   Underlying neural architecture is selected based on the analysis of existing CNN for image
        inpainting.
   2.   To construct the neural architecture search space some basic network block is selected for
        modification. The alternative versions of selected block are generated and included in
        neural architecture search space.
   3.   The loss function is defined to estimate the training and validation losses.
   4.   Image dataset is selected and pairs (image, mask) is formed to train networks with designed
        architectures.
   5.   The measures of image inpainting performance are selected.
   6.   Each network with selected architecture is learned to fill missing image areas on training
        set of images. The validation set of images is used to control the learning process.
   7.   The trained networks are applied to test images and the image inpainting performance is
        evaluated depending on the size of missing areas.
   8.   The obtained results are analyzed to determine which neural architecture is better
        inpainted images from considered database depending on missing areas size.

   If necessary, the elaborated technique of neural architecture search can be configured to
research other CNNs which fill missing image areas, as well as to identify disadvantages and
validate the obtained results. Further the implementation of the technique of neural architecture
search for image inpainting is considered. The first, second and third stages are discussed in this
section. The following sections examine the remaining stages.
   At the first stage to generate a neural architecture search space for image inpainting, a base
CNN architecture is selected [10, 11]. The architecture of the LaMa-Fourier network is shown in
Figure 1 [9]. The network is inputted an image with pixels need to be inpainted. Further, this image
is downscaled by a factor of 3 and processed by nine residual blocks. After that, the image is
rescaled to its original size and fed to network output [9].
   In the residual block, the double Fast Fourier Convolution (FFC) decomposes the image into
local and global textures which are further passes through the convolution layers [9]. The global
texture additionally processed by spectral transform block. Then the convolution layer outputs are
added "cross over cross". BN and the ReLU activation are applied to them. The results of processing
of global and local texture are concatenated and summed with the original image (Figure 2).
   In the spectral transform block the real and imaginary parts of Fourier transformed image are
concatenated (Figure 2). Then there are sequentially applied the convolutional layer, BN and the
ReLU activation function. The obtained result is splitted on the real and imaginary parts which are
prosessed by inverse fast Fourier transform (iFFT). The result of iFFT is the output of the block [9].
   To implement the second stage of the proposed technique, notice, that neural architecture
search is a very time consuming. Probably, that is why this approach has not been used to design
CNNs for image inpainting. At least such papers have not been finding by authors of this paper.
Because of much time training it is therefore unrealistic to use a large search space. To form the
search space in this paper only the spectral transform block on Fig. 2 is considered. This block
processes the global context of the image. It was designed the four versions of such block using
DWT.
                  Inpainting Network fθ                               Fast Fourier Convolution
                                                                             (FFC)
                                                                                                                 FFT
                                                                                                            Re           Im
                                                                  Local                Global
                                                                                                                 Conc
                        Fast Fourier Conv                                                                         at
     Downscale


                         Residual Block


                                              Upscale
                                                                                             Special
                                                           Conv 3x3   Conv 3x3   Conv 3x3
                                                                                            Transform         Conv 3x3

                          FFC

                                FFC
                                                                                                              BN ReLU
                                                                  BN ReLU             BN ReLU
                                                                                                                 Split
                                                                                                            Re           Im
                                                                  Local                Global
     3x                         9x            3x                                                                 iFFT


 Figure 1: LaMa-Fourier                     network Figure 2 : Fast Fourier Convolution included spectral
 architecture [9]                                   transform block [9]

    The LaMa-Wavelet v1 network applies a single-level 3D Haar wavelet transform [20, 21]. The
block of Fourier Unit Structure from the original architecture of the LaMa-Fourier network is
replaced by the Wavelet Convolution Block (one level) elaborated by the authors (Figure 3). This
block uses a 3D wavelet transform with a Haar wavelet introducing frequency and time analysis
different from the Fourier transform. In this case, the decomposition of the obtained coefficients
into real and complex parts was also excluded, because the Haar wavelet does not have a complex
part. The obtained wavelet coefficients are splitted so that each subband represents a separate
feature of the image. Convolutional layer, BN and ReLU activation function are sequentially
applied to the results of splitting of coefficients on each level of wavelet transform (Figure 3). The
transformed subbands of wavelet coefficients are concatenated and then the inverse discrete
wavelet transform (iDWT) is applied.
    The LaMa-Wavelet v2 network uses a two-level Haar wavelet transform. The Fourier Unit
Structure from the architecture of the LaMa-Fourier network is replaced by the Wavelet
Convolution Block (two levels) (Figure 4). This block based on the wavelet CNN architecture [12],
where a convolutional layer is applied to the wavelet coefficients at each level of the wavelet
transform. Initially, the eight subbands of coefficients at first level of 3D wavelet transform was
obtained. Then the first convolutional layer is applied to splitted LLH, LHL, HLL, LHH, HLH, HHL,
HHH subbands (Figure 4). The LLL subband is inputted to second level of 3D wavelet transform.
The obtained subbands of coefficients are concatted with the first convolutional layer output. The
result of concatenation is inputted to second convolutional layer. After that the processed
coefficients on the second level of a 3D wavelet transform is concatted with outputs of first
convolutional layer. Finally, the inverse 3D wavelet transform on two levels is applied.


                 DWT                         1 level                                                       2 level
                                              DWT                                                          iDWT

            Conv 3x3                          Split     LLL         1 level
                                                                     DWT

                                              Other
            BN ReLU                           Coefs                Concat             Conv
                                                                                       3x3
                                                                                                 Level 2
                                                                                                  Coefs
                 iDWT
                                                        Level 1
                                             Conv        Coefs                                             Concat
                                              3x3

Figure    3:    Wavelet Figure 4: Wavelet CNN based Wavelet Convolution Block
Convolution Block
    The LaMa-Wavelet v3 network architecture applies a single-level Daubechies wavelet
transform. The Fourier Unit Structure block from the original LaMa-Fourier network architecture
is replaced by the Wavelet Convolution Block (one level) with Daubechies 4 wavelet (Figure 3). The
Daubechies 4 wavelet is chosen because of its ability to capture more complex image features than
the Haar wavelet.
    In LaMa-Wavelet v4 the Simple Wavelet Convolution Block elaborated by the authors is used
instead of Fourier Unit Structure. In this block, 3D wavelet transform of the image on two levels
using the Daubechies 4 wavelet was initially performed (Figure 5). The obtained coefficients of 3D
wavelet transform are splitted, and convolutional layer, BN and ReLU are sequentially applied to
the results of splitting of coefficients on each level of wavelet transform. The obtained subbands of
wavelet coefficients are concatenated and the iDWT is applied to them, the result of which is the
output of the block.


        1 level                                                                     iDWT
         DWT
                        Level 1      Conv 3x3       BN ReLU           Level 1
                         Coefs                                         Coefs
         Split          Level 2                                       Level 2
                                                                                   Concat
                         Coefs                                         Coefs
                                     Conv 3x3       BN ReLU


Figure 5: Simple Wavelet Convolution Block

   At the third stage of neural architecture search technique the loss function is need to be defined
to estimate the training and validation losses. In [7, 9] Lfinal is specially designed to solve the
problem of filling large missing regions. In this paper the loss function Lfinal is used which combines
pixel loss L2, perceptual loss L and competition loss LD [7, 9]:

                                       Lfinal = kL2 + aL + bLD                                      (2)

    where k, a, b are constants.
    The mean square error between the original and restored images was used to estimate L2 pixel
loss [7, 9]. The learned perceptual image patch similarity L is used to evaluate the perceptual
similarity between the restored images and original images using a pre-trained neural network [7,
9].
    The discriminator is used to estimate competition loss LD. This additional CNN is trained in
parallel with the basic network to distinguish between real and generated images. Based on this
evaluation, the discriminator tunes the basic network coefficients to improve the realism of the
generated images. Then, the LD is the estimation of the error in the global and local textures
computed from the discriminator output [7, 9].

5. Experimental Setup
In this section the fourth, fifth and sixth stages of neural architecture search technique are
discussed. At the fourth stage of the proposed technique image datasets are selected and pairs
(image, mask) is formed to train networks with designed architectures. The 16,000 Places365 and
Safebooru database images [22, 23] were scaled to a size of 256x256 pixels and randomly splitted
into training and validation sets in the ratio of 95% to 5%. As in [7], for each image it was generated
either a mask of 1-4 rectangles with sides of 30-150 pixels, or a mask of 1-5 straight lines 10-200
pixels long, 1-
from narrow (10% of the image pixels) to large (80% of the image pixels). This ensures the networks
training at different levels of inpainting complexity with masks uniformly coveraging different
image areas. To generate a mask of one or another format for specific practical cases, in the
described below software the user can draw a mask on the original image.
    At the fifth and sixth stages of the proposed technique the LaMa-Fourier and LaMa-Wavelet v1-
4 networks were trained and the evaluation of the training results was performed by the Fréchet
inception distance (FID) [24]. FID measures the distance between the feature distributions of real
images and images inpainted by the network [24]. The obtained results are compared with existing
image inpainting methods based on [25]. Since the FID estimates the overall similarity of the
original and inpainted images, peak signal-to-noise ratio (PSNR) and structural similarity index
measure (SSIM) are applied. These two additional measures evaluates the inpainting of edges and
details with the CNN obtained as a result of the neural architecture search [26].
    In addition, the impacting of the properties of images from the considered datasets on the result
of the neural architecture search was estimated. The measures of homogeneity and uniformity
were used to evaluate the saturation of an image with large details and coarse texture [27].
    Special software of LaMa network architecture search has been developed for computer
experiment. Several key libraries were used to implement LaMa-Wavelet v1-4 in Python,
specifically, PyTorch, NumPy, saicinpainting, and others. PyTorch provides tools for designing and
training neural networks. It is distinguished by an intuitive API with support for dynamic
generation of graphs. This is particularly convenient for the design of the neural network
architectures. Also, PyTorch allows using the GPU to accelerate calculation for processing large
volumes of data.
    For fast mask generation, the mask method of the saicinpainting library was used. To evaluate
the results, the pytorch-fid library was also used to obtain the FID score. The psnr and ssim
methods of the scikit-image library calculate the PSNR and SSIM, respectively.
    The following hardware resources are required to run the program. The operating system is
better to use Linux or, alternatively, Windows 10 with support for Windows Subsystem for Linux 2
(WSL2). The processor must support the x86-64 architecture. 1G of disk space is required for the
program not taking into account the space for the training set if it is used. The recommended RAM
is 16G, the minimum is 8G.
    The GPU with CUDA support can be used. Also, Full HD images require 4G graphics memory,
alternatively 8G for 2K images, 16G for 4K images. To process data the Google Colab environment
was used. It interacted with a pre-configured NVIDIA Tesla T4 GPU, which has 16 GB of GDDR6
memory and 2,560 CUDA cores. Google Drive data storage was used to store the training and
testing datasets to easy access to them.
    The developed software includes such elaborated Python files as wconv.py and predict.py. The
wconv.py module implements spectral blocks of LaMa-Wavelet v1-4 that form the neural
architecture search space. DWT calculation was significantly increasing the processing time of
LaMa-Wavelet v1-4 networks compared to LaMa-Fourier. Therefore, the ptwt library was used to
calculate the DWT coefficients using the GPU.
    The wconv.py can process images of given size. So inpainting the larger images after training
on small images significantly reduces training time. The predict.py module obtains the image
inpainting result using the trained network weights. It requires the following input arguments: the
path to the network architecture file; the path to the file with network weights; the path to the
folder of source images and their masks; the path to the folder for saving the inpainted images. The
user interface of the LaMa network architecture search (Figure 6) is designed based on the T3
template, using Next.js and Tailwind CSS for structuring and styling. The template also includes
the trpc framework to ensure reliable communication between the client and the server. React-
DaisyUI was used to design the stylized UI components. For image masking the react-canvas-draw
library is used, which provides tools for changing the size of the brush, undoing the last stroke and
cleaning the image.
    The interface displays the loaded image (left) and allows the user to 'paint' a mask directly over
it (shown in light green). The Clear button completely erases the drawn mask. The Undo button
erases the previous stroke. The Show Mask button turns off or on the display of the drawn mask.
The Compare button turns off or on the display of the original image in the result window for
quick comparison.
    The Brush Size slider changes the size of the mask brush. The Inpaint button starts the image
inpainting modules, after which the result is displayed in the window on the right. In addition, the
interface includes the ability to select a network for image inpainting, allowing users to search the
best option for specific tasks.
    After image inpainting the interface shows statistics, including the generation time and the
PSNR and SSIM of the inpainted image.
   Figure 6: User interface of LaMa network architecture search software

6. Results
In this section the seventh and eighth stages of neural architecture search technique are discussed.
The results of the training of the LaMa-Fourier and LaMa-Wavelet v1-4 networks were evaluated
using the FID on training and validation sets. The image inpainting time and training epoch time
were estimated additionally. Image inpainting time is averaged for a set of 25 images of size
1024x1024 pixels (Table 1).
    The dependence of FID from epoch for the LaMa-Wavelet v4 still showed a downward trend
after 128 epochs (Figure 7, 8). This indicated the possibility of further reducing of the FID.
Therefore the training of the LaMa-Wavelet v4 was continued to 212 epochs. Then the FID of the
LaMa-Wavelet was reduced to about 8 on the training set and to 24 on the validation set,
approaching the FID of the LaMa-Fourier [7].

Table 1
The neural architecture search results on Places2 dataset [22] 256x256 images
     CNN                  FID on          FID on             Epoch time,      Image
                          training set    validation set     minutes          inpainting time,
                                                                              seconds
     LaMa-Fourier         8.2             25.4               40.1             2.2
     LaMa-Wavelet v1 12.1                 41.2               94.6             3.8
     LaMa-Wavelet v2 45.3                 95.0               249.7            13.2
     LaMa-Wavelet v3 9.5                  37.9               125.3            5.3
     LaMa-Wavelet v4 9.2                  31.8               148.8            6.6


Figure 7: FID score on epochs of LaMa-Fourier network (blue line) and LaMa-Wavelet v1-4
networks (green, orange, yellow and red lines, respectively) on a training set on a logarithmic scale
Figure 8: FID score on epochs of LaMa-Fourier network (blue line) and LaMa-Wavelet v1-4
networks (green, orange, yellow and red lines, respectively) on a validation set on a logarithmic
scale

   When inpanting test set images using the LaMa-Fourier and LaMa-Wavelet v4 networks, it was
noticed that the inpainting was better for images containing significant areas of uniform intensity,
fine-grained or structural texture (Figure 9, c). Fragments of images, including complex textures or
detailed patterns were inpainted worse. For example, the inpainting of grass, leaves, branches,
crowd, or thin fabric fibers is difficult for both LaMa-Wavelet v4 and LaMa-Fourier networks
(Figure 9, b). Therefore further 50 images with the lowest and highest PSNR, as well as 50 images
with the lowest and highest SSIM were selected after inpainting with narrow, medium and large
masks. The saturation of the original images with details was estimated with homogeneity and
uniformity [27]. These measures were calculated using the gray level co-occurence matrix of the
image. In Table 2, 3 homogeneity and uniformity values are given for images inpainted with high
and low PSNR and SSIM.


           a                      b                       c                   d
 Figure 9: Original images with mask (a, c) and images inpainted by LaMa-Wavelet v4 (b, d)

Table 2
The homogeneity of original images which then were inpainted with LaMa-Fourier and LaMa-
Wavelet v4 networks
   CNN, mask size                  Low PSNR     High PSNR     Low SSIM     High
                                                                           SSIM
   LaMa-Fourier, narrow masks      0.127        0.595         0.094        0.510
   LaMa-Wavelet      v4,   narrow 0.127         0.594         0.101        0.537
   masks
   LaMa-Fourier, medium masks      0.211        0.357         0.156        0.351
   LaMa-Wavelet v4, medium 0.157                0.348         0.151        0.351
   masks
   LaMa-Fourier, large masks       0.234        0.369         0.091        0.343
   LaMa-Wavelet v4, large masks    0.261        0.458         0.085        0.387
Table 3
The uniformity of original images which then were inpainted with LaMa-Fourier and LaMa-
Wavelet v4 networks
   CNN, mask size                  Low PSNR     High PSNR     Low SSIM     High SSIM
   LaMa-Fourier, narrow masks      2.989        25.126        1.749        12.088
   LaMa-Wavelet      v4,   narrow 2.989         25.500        1.735        13.444
   masks
   LaMa-Fourier, medium masks      28.815       8.669         3.530        8.443
   LaMa-Wavelet v4, medium 2.182                8.486         3.520        8.443
   masks
   LaMa-Fourier, large masks       36.558       13.706        0.423        16.563
   LaMa-Wavelet v4, large masks    35.284       22.152        0.405        17.606

7. Discussions
Considering the Figures 7, 8 it is noted that the original LaMa-Fourier network is balanced in terms
of image inpainting quality and processing time. Training was completed after 128 epochs,
providing a reliable baseline for comparison [25]. The validation dataset is used to unbiased
estimate the network performance after each epoch. By monitoring the network's performance on
the validation set, the training was stopped when the LaMa-Fourier starts to overfit. The training
of LaMa-Wavelet v1-4 networks was initially stopped at the same epoch as the training of the
LaMa-Fourier network for a consistent comparison. Their learning curves still trend downward,
indicating the potential for further performance improvement with continued training. A
comparison of training results after epoch 128 (Table 1) showed the following.
    The LaMa-Wavelet v2 has shown a worst image inpainting quality and low computational
efficiency in comparison with LaMa-Wavelet v1, v3, v4 networks and LaMa-Fourier. Therefore, the
interleaving of wavelet transform levels with convolutional layers was decided not to be used, after
experiment. This approach negatively affected the computational efficiency of the network, so that
it made sense to practically use the LaMa-Wavelet v2. The LaMa-Wavelet v1, v3, v4 networks have
reached a level comparable to the original LaMa-Fourier network. These networks have
demonstrated a promising balance between image inpainting quality and computational efficiency
indicating potential for further optimization. Moreover, the closest result to LaMa-Fourier network
was shown by LaMa-Wavelet v4. The LaMa-Wavelet v1, v3 networks showed similar to each other
results, which were slightly lower than LaMa-Wavelet v4. Specifically, difference was 26% and 3%
for FID on training set, 28% and 19% for FID on validation set respectively (Table 1). The LaMa-
Fourier requires less training time and inpaint the image faster then LaMa-Wavelet v1-4. The
LaMa-Wavelet v1, v3, v4 are more time consuming. An experiment to estimate the quality of
inpainting of edges and fine details of images the LaMa-Wavelet v4 and LaMa-Fourier networks
was conducted by the authors in [7]. The PSNR of images inpainted using the LaMa-Wavelet v4
exceeds the results obtained using the LaMa-Fourier network for narrow and medium masks in
average by 4.5%, for large masks in average by 6%. The LaMa-Wavelet applying can enhance SSIM
by 2 4% depending on a mask size. This issue is covered in more detail in [7]. To analyze the
dependence of the quality of generating of details and edges of objects in the image on the
properties of the image textures, let's first note the following. To describe image texture properties
the texture descriptors of homogeneity and uniformity based on the gray level co-occurence matrix
of the image are used. Uniformity increases as the square of the image intensity probabilities, so
the less random the image is, the higher its uniformity. Homogeneity characterizes the
concentration of the values of the image gray level co-occurence matrix near the main diagonal. A
matrix with larger probability values near the diagonal will correspond to a larger value of the
homogeneity descriptor. This matrix is typical for an image with a large content of halftones and
areas of little changing intensity [27]. So, the results in Tables 2, 3 showed that to fill missing areas
of images with large masks, it is preferable to use the LaMa-Fourier network if homogeneity and
uniformity is low. If homogeneity and uniformity is high then it better to use the LaMa-Wavelet v4
network to get the inpainted image with high PSNR. To inpaint images with medium masks with
high PSNR, it is preferable to use the LaMa-Fourier and LaMa-Wavelet v4 networks if homogeneity
is high and uniformity is medium. To inpaint images with narrow masks with high PSNR, it is
preferable to use the LaMa-Fourier and LaMa-Wavelet v4 networks if homogeneity and uniformity
is very high. Thus, in the case of large masks, the LaMa-Fourier network is better at inpainting
images with more random intensities, while the LaMa-Wavelet v4 network better inpaints images
with more halftones and areas of low intensity variation. If the size of the masks is reduced, the
ability of both networks to reconstruct images with high detail content increases. However, in the
case of narrow masks the both networks is better at inpainting images with areas of low intensity
variation.

8. Conclusions
The actual scientific and applied problem of the neural architecture search for the inpainting of the
image fine details and object edges has been considered.
    The scientific novelty is the technique of neural architecture search for image inpainting
proposed. In this way, the new LaMa based network architectures were designed which different
by relation between image inpainting time and reconstructed image quality. The image inpainting
with large masks based on the LaMa network is improved by applying wavelet transform.
Specifically, the quality of filling missing areas with image edges and fine details is increased.
    The practical significance of obtained results is that the software realizing the proposed
technique of neural architecture search for image inpainting is developed based on LaMa network.
Experiments to research image inpainting performance are conducted. The experimental results
allow to determine effective conditions for the application of versions of this network in practice.
In addition, it was researched the dependence of the quality of generating of details and edges of
objects in the image on the properties of the image textures, which can be described by texture
descriptors.
    Prospects for further research is reducing the computing time by using fast transforms [28,
29] and prediction the effectiveness of the LaMa network depending on the estimated values of
image texture descriptors and formulating the recommendations on the LaMa network
applications.

References
[1] H. Xiang, Q. Zou, M. A. Nawaz, X. Huang, F. Zhang, H. Yu, Deep learning for image
    inpainting: A survey, Pattern Recognition 134.109046 (2023). doi: 10.1016/j.patcog.2022.109046.
[2] D. Kolodochka, M. Polyakova, The research of the quality of filling missing regions of images
    by methods PatchMatch and LaMa, in: Proceedings of 5th International Scientific and Practical
    Conference on Modern Resea                                         -
    2022, pp. 211 219.
[3] J. Yu, J. Yang, X. Shen, X. Lu, T. S. Huang, Generative image inpainting with contextual
    attention, in: Proceedings of Computer Vision and Pattern Recognition Workshops, CVPRW,
    IEEE/CVF, Salt Lake City, UT, USA, 2018, pp. 5505 5514. doi: 10.1109/CVPRW.2018.00577.
[4] J. Yu, Z. Lin, J. Yang et al., Free-form image inpainting with gated convolution, in: Proceedings
    of IEEE/CVF International Conference on Computer Vision, ICCV, IEEE/CVF, Seoul, Korea
    (South), 2019, pp. 4471 4480. doi: 10.1109/ICCV.2019.00457.
[5] K. Nazeri, E. Ng, T. Joseph, F. Qureshi, M. Ebrahimi, EdgeConnect: structure guided image
    inpainting using edge prediction, in: Proceedings of IEEE/CVF Computer Vision Workshop,
    ICCVW, IEEE/CVF, Seoul, Korea (South), 2019, pp. 2462 2468. doi: 10.1109/ICCVW.2019.00408.
[6] S. Zhao, J. Cui, Y. Sheng et al., Large scale image completion via co-modulated generative
    adversarial networks, in: Proceedings of International Conference on Learning
    Representations, ICLR, Vienna, Austria, 2021. doi: 10.48550/arXiv. 2103.10428.
[7] D. O. Kolodochka, M. V. Polyakova, LaMa-Wavelet: image inpainting with high quality of fine
    details and object edges, Radio Electronics, Computer Science, Control 1 (2024) 208 220. doi:
    10.15588/1607-3274-2024-1-19.
[8] L. Cao, T. Yang, Y. Wang, B. Yan, Y. Guo, Generator pyramid for high-resolution image
    inpainting, Complex & Intelligent Systems 9.7553 (2023). doi: 10.1007/s40747-023-01080-w.
[9] R. Suvorov, E. Logacheva, A. Mashikhin et al., Resolution-robust large mask inpainting with
    Fourier convolutions, in: Proceedings of IEEE Workshop/Winter Conference on Applications
     of Computer Vision, WACV, IEEE, Waikoloa, Hawaii, 2022, pp. 2149 2159. doi:
     10.1109/WACV51458.2022.00323.
[10] C. White, M. Safari, R. Sukthanker et al., Neural architecture search: insights from 1000 papers.
     doi: 10.48550/arXiv.2301.08727.
[11] T. Elsken, J. H. Metzen, F. Hutter, Neural architecture search: a survey, Journal of Machine
     Learning Research 20 (2019) 1 21.
[12] S. Fujieda, K. Takayama, T. Hachisuka, Wavelet convolutional neural networks, 2018. doi:
     10.48550/arXiv.1805.08620.
[13] A. Souza Brito, M. B. Vieira, M. L. Andrade, R. Q. Feitosa, G. A. Giraldi, Combining max-
     pooling and wavelet pooling strategies for semantic image segmentation, Expert Systems with
     Applications 183.115403 (2021). doi: 10.1016/j.eswa.2021.115403.
[14] A. Hamad, A new pooling layer based on wavelet transform for convolutional neural network,
     Journal of Advanced Research in Dynamical and Control Systems 24.4 (2020) 76 85.
     doi:10.5373/JARDCS/V12I4/20201420.
[15] A. Ferrà, E. Aguilar, P. Radeva, Multiple wavelet pooling for CNNs, in: L. Leal-Taixé, S. Roth,
     (Eds.), Computer Vision        ECCV 2018 Workshops, volume 11132 of Lecture Notes in
     Computer Science, Springer, Cham, 2019, pp. 671 675. doi: 10.1007/978-3-030-11018-5_55.
[16] O. Herrera, B. Priego, Wavelets as activation functions in neural networks, Journal of
     Intelligent & Fuzzy Systems 42.5 (2022) 4345 4355. doi: 10.3233/JIFS-219225.
[17] J. W. Liu, F. L. Zuo, Y. X. Guo et al., Research on improved wavelet convolutional wavelet
     neural networks, Applied Intelligence 51 (2021) 4106 4126. doi: 10.1007/s10489-020-02015-5.
[18] P. Liu, H. Zhang, W. Lian, W. Zuo, Multi-level wavelet convolutional neural networks, IEEE
     Access 7 (2019) 74973 74985. doi: 10.1109/ACCESS.2019.2921451.
[19] L. Wang, Y. Sun, Image classification using convolutional neural network with wavelet
     domain inputs, IET Image Processing 16.8 (2022): 2037 2048. doi: 10.1049/ipr2.12466.
[20] I. Daubechies, Ten Lectures on Wavelets, SIAM Press, Philadelphia, 1992.
[21] J. Bobulski, Multimodal face recognition method with two-dimensional hidden Markov model,
     Bulletin of the Polish Academy of Sciences, Technical Sciences, 65.1 (2017) 121 128. doi:
     10.1515/bpasts-2017-0015.
[22] Places365 Scene Recognition Demo. URL: http://places2.csail.mit.edu/.
[23] Safebooru. URL: https://safebooru.org/index.php?page=post&s=list&tags=
      no_humans+landscape.
[24] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs trained by a two
     time-scale update rule converge to a local nash equilibrium, in: Proceedings of 31st Annual
     Conference on Neural Information Processing Systems, NIPS, Long Beach, California, USA,
     2017, pp. 6629 6640. doi: 10.18034/ajase.v8i1.9.
[25] Supplementary material. URL: https://bit.ly/3zhv2rD/lama_supmat_2021.pdf.
[26] U. Sara, M. Akter, M. S. Uddin, Image quality assessment through FSIM, SSIM, MSE and PSNR
         a comparative study, Journal of Computer and Communications 7.3 (2019) 8 18. doi:
     10.4236/jcc.2019.73002.
[27] R. C. Gonzalez, R. E. Woods, Digital Image Processing, 4th ed., Pearson, New York, NY, 2017.
[28]                                                  -friendly filtering algorithms for deep neural
     networks, Applied Science, 13.9004 (2023). doi: 10.3390/app13159004.
[29] A. Cariow, G. Cariowa, Minimal filtering algorithms for convolutional neural networks, in: C.
     van Gulijk, E. Zaitseva (Eds.), Reliability Engineering and Computational Intelligence. Studies
     in Computational Intelligence, volume 976, Springer, Cham, 2021, pp. 73 88. doi: 10.1007/978-
     3-030-74556-1_5.